NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Stack Overview
An opinionated map of the modern data stack, layer by layer, as TextQL sees it. From storage at the bottom to AI analysts at the top.
The modern data stack is the name for the loose collection of cloud-native, best-of-breed tools that have, over the last decade, replaced the monolithic on-premise data warehouse as the default architecture for analytics at most companies. It is not a single product. It is a set of layers, each owned by a different category of vendors, that collectively turn raw operational data into insight.
This wiki is organized around those layers. This page is the map.
The data stack is best read as a stack: data flows up from the bottom (where it's stored cheaply and durably) toward the top (where humans actually use it to make decisions). Every tool in this wiki sits at one of these layers.
### 1. Storage (the substrate)
At the bottom of everything is cloud object storage: Amazon S3, Google Cloud Storage, or Azure Blob Storage. These are the giant, cheap, durable hash tables where bytes live. They replaced the previous generation's HDFS clusters and became the foundation that everything else is built on.
See: Storage / Data Lake
### 2. Table Formats (giving the storage shape)
Raw object storage holds files. To turn those files into something queryable like a database table, you need a table format: Apache Iceberg, Delta Lake, or Apache Hudi, all built on top of Parquet files. Table formats add a transaction log, schema enforcement, time travel, and the ACID guarantees that turn a folder of files into a real table.
See: Table Formats
### 3. Compute / Warehouses / Query Engines
This is where queries actually run. Two flavors:
The lakehouse (Databricks Lakehouse Platform, Snowflake on Iceberg, Microsoft Fabric) is the architectural pattern where these two flavors converge: warehouse-style SQL on lake-style storage.
See: Data Warehouses | Query Engines | Data Lakehouse
### 4. Ingestion (ETL/ELT)
Getting data into the warehouse. Fivetran, Airbyte, and Stitch handle SaaS sources. Custom pipelines and event tracking handle product data.
See: ETL / ELT | Event Tracking / CDP
### 5. Transformation
Modeling raw data into clean, business-ready tables inside the warehouse. dbt is the dominant tool. SQLMesh is the modern challenger. This is the "analytics engineering" layer that defines clean tables like users, orders, and revenue_by_day.
See: ETL / ELT
### 6. Orchestration
Scheduling and dependency management for everything above. Airflow, Dagster, and Prefect are the major players.
See: Orchestration
### 7. Catalog / Governance
The "where is everything and who owns it" layer. Unity Catalog, Atlan, Collibra, and Alation handle discovery, lineage, and access control.
See: Data Catalogs
### 8. Semantic Layer / Metrics
Where business definitions live. Cube, dbt Semantic Layer, LookML, and others define what "revenue" or "active user" means once, so every downstream tool agrees.
See: Semantic Layer
### 9. BI / Dashboards
The consumption layer for executives and business users. Looker, Tableau, Power BI, and Sigma ship governed dashboards. The newer category of data workspaces (Hex, Mode, Deepnote) is where analysts author the analyses behind those dashboards.
See: Dashboards & BI | Data Workspaces
### 10. Activation / Reverse ETL
Pushing modeled warehouse data back out to operational tools where business teams actually work. Hightouch and Census own this layer.
See: Reverse ETL
### 11. AI Analyst / Natural Language (the new top layer)
The newest layer of the stack, defined since ~2023: AI analysts that sit on top of everything else and let business users ask questions in plain English across the entire stack. TextQL Ana is the canonical example of a vendor-neutral AI analyst that spans the whole stack. Vendor-specific versions exist too — Snowflake Cortex Analyst, Databricks AI/BI Genie, Hex Magic — but each is scoped to its own platform.
See: TextQL in the Stack
The simple metaphor: the data stack is a lasagna. Each layer has its own job, its own vendors, and its own tradeoffs. The layers are mostly independent — you can swap out your BI tool without touching your warehouse, swap your warehouse without touching your storage, swap your reverse ETL without touching your transformation. That decoupling is the entire reason "best-of-breed" works as a strategy.
The opinionated TextQL view of where the puck is going:
TextQL Ana works with every layer of the modern data stack, including this one.
See TextQL in action