Guide

Automating Cross-Platform Data Pipelines

A practical walkthrough for teams whose pipelines no longer fit inside one platform.

The Real Problem Is Coordination, Not Compute

Most teams do not have a transformation problem. ADF, Databricks, Functions, Logic Apps, Synapse, Fabric, and Cloud Composer all do their respective jobs well. The pain shows up between them: the ADF pipeline that finishes in the morning, the Databricks job that should follow it but instead runs on a fixed schedule and hopes ADF is done, the Cloud Function that nobody is sure ran at all.

Automation here means moving from "a clock said it was time" to "the graph says it is ready." The steps below describe a path most teams can follow regardless of orchestrator. The Polysync-specific notes show how each step works in practice.

Step 1. Connect Platforms Through a Vault, Not a Config File

The first step is wiring up your cloud platforms so the orchestrator can call them. The right way to do this is with a secret reference, not a stored secret. Polysync supports four vault providers as the credential source:

Azure Key Vault
AWS Secrets Manager
Google Cloud Secret Manager
HashiCorp Vault

If you would rather not link your own vault, Polysync can hold the credential in its managed Azure Key Vault. In either case, the secret is read on demand at dispatch time and is not retained in application memory or logs.

Supported platforms today:

Azure: Data Factory, Databricks, Functions, Logic Apps, Synapse Pipelines, Fabric, AKS, Batch, AI Foundry, OpenAI.
Google Cloud: Cloud Composer, Vertex AI, Cloud Functions, Dataflow.
AWS: Step Functions, Glue, Lambda, SageMaker, Batch.

Step 2. Discover Jobs Rather Than Redefine Them

Once a platform is connected, the orchestrator should be able to list the work that already exists on it. Polysync calls each platform's API and imports the available resources as Jobs: ADF pipelines, Databricks notebooks, Logic Apps, Cloud Functions, Cloud Composer DAGs, and so on. Parameters that the resource accepts come along with it.

The distinction is worth being explicit about: a Job is the resource on the platform. A Task is a configured run of that Job, with parameters bound, dependencies attached, and schedules added. The same Job can back many Tasks with different parameter sets.

Step 3. Express Dependencies as a DAG, Not as Cron Offsets

"Run the Databricks notebook at 06:15 because the ADF pipeline usually finishes at 06:10" is a smell. The orchestrator should be told the actual relationship and decide the time itself.

In a Polysync DAG each edge carries one of four trigger conditions:

On Success. Run only if the upstream task succeeded.
On Failure. Run only if the upstream failed. Useful for alerts and compensations.
On Completion. Run regardless of outcome. Useful for cleanup.
On Skipped. Run only if the upstream was skipped by an earlier condition.

Conditions are resolved recursively across the whole graph, so skip behaviour and branching produce the result you would expect without manual bookkeeping. Cycles are detected before any task starts; an accidental loop between two tasks is reported as a configuration error, not as a stuck run.

Step 4. Map Outputs to Downstream Inputs

Useful pipelines pass values along. A Cloud Function returns the path of the file it wrote, and the next task needs that path as an input. Without parameter mapping, teams write custom passthrough code or a side channel through storage.

Polysync supports cross-task parameter mapping: a value produced by one task can be bound into a parameter of a downstream task, including across platforms (an ADF output flowing into a Databricks notebook parameter, for example). The mapping is part of the task configuration, not embedded in transformation code.

Step 5. Add Concurrency and Rate Limits Where the Real Constraints Are

The platforms you orchestrate have hard limits: Databricks job concurrency per workspace, Logic Apps action throughput, Cloud Functions concurrent executions. Without governance, parallel triggers cause HTTP 429 storms, dropped runs, and unpredictable costs.

Polysync concurrency profiles combine two well-known controls, applied per platform connection:

Leaky bucket for concurrency: at most N tasks in flight against this platform at any moment.
Rolling window for rate: at most N dispatches in any sliding T-second window.

Because the budget is per connection, a busy Databricks workspace cannot starve work that targets an unrelated ADF instance.

Step 6. Schedule at the Roots, Let the Graph Do the Rest

Cron schedules attach to root tasks (tasks with no upstream dependencies). Standard cron syntax, timezone-aware, with a visual builder for engineers who do not want to memorise asterisks. Multiple schedules per task are supported. Everything downstream runs because the graph says it is ready, not because it was given its own clock.

Step 7. Watch It in One Place

The reason to do all of the above is to be able to answer "what is happening right now" with one screen. Polysync's monitoring view shows live queue depth, currently dispatched tasks, recent run history, duration trends, and status breakdowns across every connected platform.

What You End Up With

A pipeline that no longer depends on someone remembering which portal to open first. The graph triggers itself, governance is applied automatically, and the monitoring view tells the truth about what ran. The orchestrator becomes the system of record for what your data platform actually did today.