Guide

What Is ETL Orchestration?

A technical guide to scheduling, dependency resolution, and dispatcher design.

Definition

ETL orchestration is the discipline of running data pipelines in the right order, at the right time, under the right resource constraints, with the right response when something fails. It is the layer that sits between your scheduler and the cloud services that actually move and transform data.

A useful working definition: orchestration is the runtime that turns a graph of work into a sequence of platform API calls, applies governance to those calls, and records what happened.

Scheduling Is Not Orchestration

The two terms get used interchangeably. They are not the same thing.

Scheduling answers a single question: when should this start? A cron expression and a clock are enough.
Orchestration answers four: when does it start, what has to finish first, how many copies can run at once, and what happens on failure?

Every scheduler is a tiny orchestrator, and every orchestrator contains a scheduler. The interesting design choices live in the other three questions.

The DAG Is the Core Data Structure

Workflows are modelled as a directed acyclic graph (DAG). Each node is a task. Each edge declares that one task must wait for another before it runs. The "acyclic" part is non-negotiable: if task A waits for B and B waits for A, neither can ever start. A real orchestrator detects that loop before it executes anything and refuses the run.

Edges usually carry a trigger condition. Common conditions include:

On Success. Run the downstream task only if the upstream task finished successfully.
On Failure. Run the downstream task only if the upstream task failed (typical for alerting or compensation).
On Completion. Run regardless of outcome (typical for cleanup).
On Skipped. Run only if the upstream task was skipped by an earlier condition.

A solid orchestrator resolves these conditions recursively: the status of a node is a function of the resolved status of its ancestors, not just its immediate parent. That is what allows complex branches and skip paths to behave correctly without manual bookkeeping.

Dispatchers, Queues, and Why They Run Separately

The component that actually calls the cloud platform API is the dispatcher. In any non-trivial orchestrator it runs as a separate process from the scheduler and from the UI for three reasons:

Throughput. Dispatching is bursty. Decoupling lets you scale dispatchers without scaling the UI.
Isolation. A misbehaving platform connection (slow API, throttling, auth failure) should not freeze unrelated work.
Durability. If the dispatcher restarts, work in the queue is still there. The scheduler keeps producing work even if a downstream dispatcher is busy.

The communication channel between them is typically a queue table or message broker, and the queue is where governance lives.

Concurrency and Rate Limits

Every managed cloud service has hard limits: Databricks job concurrency per workspace, ADF integration runtime slots, Logic Apps action throughput, Cloud Functions concurrent executions, and so on. Hitting these limits manifests as HTTP 429 responses, dropped runs, and unpredictable retry storms.

Two well-known algorithms handle this cleanly:

Leaky bucket bounds the concurrent count: at most N tasks may be "in flight" against a given platform at any moment. New tasks wait until a slot opens.
Rolling window bounds the rate: at most N tasks may be dispatched in any sliding T-second window, regardless of how quickly previous tasks finished.

A mature orchestrator lets you apply both, scoped per platform connection, so a noisy Databricks workspace cannot starve an ADF run that uses a different connection.

State, Retries, and Idempotency

Distributed work fails for boring reasons: network blips, transient throttling, expired tokens, mid-flight deployments. A retry policy is not optional; it is part of the contract.

The non-obvious part is idempotency. A retry is safe only if the underlying platform call is idempotent or the orchestrator tracks the original run identifier. Good orchestrators capture the platform-side run ID on first dispatch and reconcile state from it on retry, rather than blindly issuing a second call.

Metadata-Driven vs Code-Driven

Two design schools dominate the orchestration market.

Code-driven orchestrators (Airflow, Dagster, Prefect) express workflows as code. The DAG is generated at parse time from Python.
Metadata-driven orchestrators (ADF, Polysync) express workflows as configuration. The DAG is data, queried at runtime from a database.

Both are valid. Code-driven gives you arbitrary logic and version control diffs. Metadata-driven gives you a UI that non-engineers can use, hot reload without redeployment, and a cleanly auditable configuration store. The right answer depends on who owns the pipelines.

How Polysync Implements This

Polysync is a metadata-driven orchestrator delivered as a SaaS application through the Azure Marketplace. Concretely:

Jobs are discovered, not authored. A Job is the resource that exists on the platform (an ADF pipeline, a Databricks notebook, a Cloud Function). A Task is a configured run of that Job with parameters, dependencies, and schedules attached.
The DAG is resolved recursively with cycle detection up front; offending edges are reported before any task starts.
The dispatcher is a separate process from the web app, with state and locking that survives restarts so in-flight work is not lost.
Concurrency profiles combine leaky-bucket and rolling-window limits, applied per platform connection.
Parameters flow across tasks: the output of one task can be mapped into the input of another, including across platforms.
Credentials live in a vault, either your own (Azure Key Vault, AWS Secrets Manager, Google Cloud Secret Manager, HashiCorp Vault) or, if you prefer, in Polysync's managed Azure Key Vault.

Getting Started

Polysync is sold on the Azure Marketplace. Sign in with your Microsoft 365 account and your tenant is provisioned automatically.