Workflow & Orchestration | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Wiki Orchestration Workflow & Orchestration

Contents

Workflow & Orchestration

Workflow orchestration tools schedule, monitor, and manage data pipelines as directed acyclic graphs (DAGs). The dominant platforms are Apache Airflow, Dagster, and Prefect, with Astronomer providing managed Airflow.

A workflow orchestrator is the thing that decides what runs, when, and in what order. If your data stack is a kitchen, the orchestrator is the head chef holding the schedule on a clipboard: it knows that the soup needs stock before the stock can be made, that the dessert can wait until after dinner service, and that if the oven breaks the pastry chef should be paged at 2 a.m. Without an orchestrator, every pipeline is just a script someone runs by hand or a cron job that fails silently at 3 a.m.

In more technical terms: a workflow orchestrator schedules and executes pipelines — collections of tasks with dependencies between them — and gives you visibility into whether they ran, how long they took, and what failed. It is the connective tissue between the rest of the data stack: ingestion (Fivetran, Airbyte), transformation (dbt), reverse ETL, machine learning training, and notification systems all live as nodes in some orchestrator's graph.

Why Orchestration Exists

Before dedicated orchestrators, data teams used cron. Cron is fine for "run this script every hour" but breaks the moment you have dependencies. If job B needs the output of job A, and A is late, cron will happily run B against stale data and never tell you. If a job fails, cron forgets it ever happened. If you need to backfill three weeks of history because you found a bug, cron has no concept of "the run that should have happened last Tuesday."

Orchestrators exist to fix four specific problems that cron cannot:

1. Dependencies. Most data work is a chain. Extract from Salesforce, load into the warehouse, run dbt, refresh the BI cache, send the morning report. Each step needs the previous step to have actually succeeded. An orchestrator models these dependencies explicitly.

2. Retries and failure handling. Real systems fail. APIs rate-limit you, networks blip, warehouses hit query queues. An orchestrator retries with backoff, alerts a human if retries exhaust, and lets you resume from the failed step instead of restarting from scratch.

3. Observability. When the CEO asks "why is the dashboard wrong?" you need to know which pipeline failed, when, what error it threw, and what data it touched. An orchestrator gives you a UI with run history, logs, and lineage.

4. Backfills and reruns. Data pipelines exist in time. If you discover a bug in last week's transformation, you need to rerun the affected days, in order, without re-running everything. An orchestrator understands the concept of a "logical date" and can replay history.

The DAG: One Idea That Defined the Category

The single most important concept in orchestration is the Directed Acyclic Graph, or DAG. The name sounds intimidating but the idea is simple:

Graph: a set of nodes (tasks) connected by arrows (dependencies).
Directed: the arrows have a direction. Task A points to Task B means "A must finish before B starts."
Acyclic: no loops. You can't have A depending on B depending on C depending on A. That would be impossible to run.

Kitchen analogy: A DAG is the recipe card the head chef tapes to the wall. "Chop onions" → "Sauté onions" → "Add stock" → "Simmer". You can't simmer before sautéing, and you can't sauté the same onions twice in a circle. Each arrow is a dependency; the whole picture is the DAG. Multiple parallel branches (the salad station, the dessert station) all converge at "plate the dish."

Once you accept the DAG as your model of the world, a lot of things become clean. Scheduling = "run this DAG every day at 6 a.m." Backfilling = "run this DAG for every day between Jan 1 and Jan 14." Failure handling = "if any node fails, mark its downstream nodes as upstream-failed." Almost every modern orchestrator — Airflow, Dagster, Prefect, Luigi, Argo Workflows, AWS Step Functions — is fundamentally a DAG executor with a UI on top.

Airflow's Dominance and the Challengers

The orchestration market has one dominant player and a handful of credible challengers. Apache Airflow is the SQL of orchestration: not the best tool by any aesthetic measure, but so widely adopted that "data engineer who has used Airflow" is essentially a job category. Airbnb open-sourced Airflow in 2015, the Apache Foundation took it in 2016, and within a few years it was the default. By 2026, it powers pipelines at most Fortune 500 data teams and every cloud has a managed offering (AWS MWAA, Google Cloud Composer, Astronomer).

Airflow's dominance is also its biggest problem. The framework was designed in 2014 for a world where DAGs were authored as Python files that scheduled bash and SQL operators. It was not designed for the modern data stack where the things you orchestrate are data assets (a dbt model, a Snowflake table) rather than tasks (a script to run). Airflow knows that "task X ran successfully," not "table Y is now fresh." This task-centric view is why Airflow pipelines so often degrade into spaghetti as they scale.

Two main challengers emerged from this gap:

Dagster (2019, Nick Schrock, ex-Facebook GraphQL co-creator) made the radical bet that assets, not tasks, should be the primitive. You declare "this dbt model produces this table" and Dagster figures out the DAG, the scheduling, and the lineage automatically. For anyone starting a new data platform in 2026, Dagster is generally the recommended choice.

Prefect (2018, Jeremiah Lowin, former Airflow committer) pitched "negative engineering" — the idea that most engineering work in data isn't building features, it's preventing failure modes (retries, timeouts, alerting, restart logic). Prefect's API feels lighter and more Pythonic than Airflow's. It has a smaller mindshare than Dagster among new builds, but it's a genuinely well-designed system.

There are also workflow-adjacent tools worth knowing about: Argo Workflows (Kubernetes-native), Temporal (general-purpose durable execution, used heavily outside data), Mage and Kestra (newer entrants), and AWS Step Functions (managed cloud-native).

Honest Take: What to Pick

Scenario	Recommendation
—-	—-
Existing Airflow shop, large team	Stay on Airflow, use Astronomer for the managed plane
Greenfield platform, modern data stack	Dagster. Asset orientation pays off as the platform grows.
Small Python team, want elegant API	Prefect.
Pure Kubernetes-native, ML focus	Argo Workflows or Kubeflow Pipelines
AWS-only, simple workflows	Step Functions
Cross-cloud durable execution beyond data	Temporal

The general rule: if you're hiring a data engineer in 2026, they have used Airflow. If you're choosing for the next decade, you probably want Dagster.

Tools in This Category

Apache Airflow — The dominant open-source orchestrator. Created at Airbnb in 2014. Task-centric, Python-defined DAGs.
Astronomer — The commercial company built around managed Airflow. Made Airflow enterprise-friendly.
Dagster — The modern, asset-oriented challenger. The recommended choice for new data platforms in 2026.
Prefect — The "negative engineering" alternative with an elegant Pythonic API.

How TextQL Works with Orchestration

TextQL Ana reads from the warehouses and lakehouses that orchestrators populate. The orchestrator decides when the data is fresh; TextQL answers questions about the data once it lands. In platforms with rich asset metadata (Dagster especially, but also Airflow with the right plugins), TextQL can use orchestrator state to know which tables are stale, when they were last refreshed, and which upstream pipeline owns them — turning orchestrator metadata into trustworthy answers about freshness.

See TextQL in action

Workflow & Orchestration

Category Pipeline scheduling & coordination

Also called Workflow managers, schedulers, DAG runners

Core abstraction Directed Acyclic Graph (DAG)

Key tools Apache Airflow, Dagster, Prefect

Managed offerings Astronomer, MWAA, Cloud Composer

Predecessors cron, Oozie, Luigi, Azkaban

Typical users Data engineers, platform teams

Monthly mindshare ~600K · every data team needs orchestration; broad concept