NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Data Ecosystem Wiki
The vendor-neutral reference guide to every tool, format, and concept in the modern data stack.
Snowflake
Snowflake — the dominant independent cloud data warehouse.
dbt
dbt (data build tool) is the de facto SQL-based transformation layer for the modern data stack.
Tableau
Tableau is the original modern BI tool — the drag-and-drop visualization platfor...
Apache Iceberg
Apache Iceberg is the open table format that won the lakehouse format war.
Fivetran
Fivetran is the dominant managed-connector service for moving data from SaaS app...
Apache Kafka
Apache Kafka is the dominant open-source event streaming platform.
Data Warehouses
12Databricks
Databricks — the lakehouse company founded by the creators of Apache Spark.
Data Warehouses
Cloud data warehouses are purpose-built analytical databases that store structur...
Snowflake
Snowflake — the dominant independent cloud data warehouse.
Databricks SQL Warehouse
Databricks SQL is the lakehouse's warehouse impersonation -- a high-performance ...
Snowflake Data Cloud
Snowflake's umbrella branding for its unified platform spanning data warehousing...
Apache Spark (Databricks)
Apache Spark is the distributed processing engine that Databricks was founded to...
Snowpark
Snowpark is Snowflake's DataFrame API and runtime for Python, Java, and Scala --...
Unity Catalog
Unity Catalog is Databricks' unified governance layer for data and AI assets -- ...
Google BigQuery
Google BigQuery — Google Cloud's serverless, pay-per-query data warehouse.
Snowpipe
Snowpipe is Snowflake's continuous, file-based ingestion service -- the way most...
Amazon Redshift
Amazon Redshift — AWS's MPP columnar data warehouse.
Delta Lake (Databricks)
Delta Lake is the open-source table format Databricks built to give cloud object...
ETL & Integration
7ETL & Data Integration
ETL and data integration tools move data from source systems into the warehouse ...
Fivetran
Fivetran is the dominant managed-connector service for moving data from SaaS app...
Informatica
Informatica is the legacy enterprise leader in data integration.
Matillion
Matillion is a cloud-native ETL/ELT platform with a visual GUI that pushes trans...
dbt
dbt (data build tool) is the de facto SQL-based transformation layer for the modern data stack.
Airbyte
Airbyte is the open-source challenger to Fivetran.
Stitch
Stitch was an early SaaS ELT service founded in 2016 from the ashes of RJMetrics.
BI & Dashboards
8Dashboards & BI
Business intelligence and dashboarding tools turn warehouse data into charts, da...
Tableau
Tableau is the original modern BI tool — the drag-and-drop visualization platfor...
Power BI
Microsoft Power BI is the largest BI tool in the world by user count, driven ent...
Looker
Looker is the BI tool that invented the modern semantic layer.
Amazon QuickSight
Amazon QuickSight is AWS's native cloud BI tool.
Sigma Computing
Sigma is the cloud-native BI tool that gives business users a spreadsheet interf...
ThoughtSpot
ThoughtSpot pioneered search-driven analytics — type a question, get a chart — a...
Apache Superset
Apache Superset is the dominant open-source BI tool.
Table Formats
7Table Formats
Table formats like Apache Iceberg, Delta Lake, and Apache Hudi add database sema...
Apache Iceberg
Apache Iceberg is the open table format that won the lakehouse format war.
Delta Lake
Delta Lake is Databricks' open table format.
Apache Hudi
Apache Hudi is the streaming-first table format born at Uber.
Apache Parquet
Apache Parquet is the columnar file format that powers essentially every modern ...
Apache ORC
Apache ORC is the columnar file format born in the Hive ecosystem.
Apache Avro
Apache Avro is a row-based, schema-first data serialization format.
Query Engines
9Query Engines & Virtualization
Query engines are the SQL brains that read data from somewhere else.
Trino
Trino is the open-source distributed SQL query engine forked from PrestoSQL in 2...
Dremio
Dremio is an open data lakehouse SQL engine built around Apache Iceberg, Apache ...
DuckDB
DuckDB is an embedded analytical database -- the SQLite of analytics.
Denodo
Denodo is the original enterprise data virtualization platform.
Presto
Presto is the original distributed SQL query engine built at Facebook in 2012.
Apache Hive
Apache Hive is the original SQL-on-Hadoop engine, built at Facebook in 2008.
Databricks Photon
Photon is Databricks' from-scratch C++ vectorized execution engine for SQL and DataFrame workloads.
Starburst / Trino (split)
This page has been split into two: Trino (the open-source distributed SQL query ...
Semantic Layer
6Semantic Layer / Metrics
A semantic layer is the place where business definitions like 'revenue,' 'active...
AtScale
AtScale started as a virtual OLAP cube engine for Hadoop in 2013 and evolved int...
Cube
Cube is the leading open-source headless BI and semantic layer.
LookML
LookML is Looker's data modeling language, the proprietary file format that put ...
dbt Semantic Layer
The dbt Semantic Layer is dbt Labs' answer to LookML and Cube.
Transform (MetricFlow)
Transform was the headless metrics startup founded by ex-Airbnb engineers behind Minerva.
Data Catalog
7Data Catalog & Discovery
Data catalogs are the inventory and search layer for an organization's data.
Atlan
Atlan is a modern, design-first data catalog and collaboration platform founded ...
Alation
Alation is the established enterprise data catalog, founded in 2012 in Silicon Valley.
Collibra
Collibra is the European-born enterprise data governance giant, founded in 2008 in Brussels.
Select Star
Select Star is a modern, automated data catalog founded in 2020 and headquartered in San Francisco.
DataHub
DataHub is the open-source data catalog originally built at LinkedIn and now com...
Amundsen
Amundsen is the open-source data catalog originally built at Lyft in 2019.
Data Observability
5Data Observability
Data observability platforms monitor data quality, freshness, schema changes, an...
Monte Carlo
Monte Carlo is the company that invented the data observability category in 2019.
Bigeye
Bigeye is a data observability platform built by the Uber data quality team.
Great Expectations
Great Expectations is the open-source data testing framework that defined the de...
Acceldata
Acceldata is an enterprise data observability platform that covers data quality,...
Orchestration
5Workflow & Orchestration
Workflow orchestration tools schedule, monitor, and manage data pipelines as dir...
Apache Airflow
Apache Airflow is the dominant open-source workflow orchestrator.
Astronomer
Astronomer is the commercial company built around managed Apache Airflow.
Prefect
Prefect is a modern Python-native workflow orchestrator built on the philosophy ...
Dagster
Dagster is the modern, asset-oriented workflow orchestrator.
Event Streaming
8Event Streaming
Event streaming platforms move data as it happens, treating every change in your...
Apache Kafka
Apache Kafka is the dominant open-source event streaming platform.
Amazon Kinesis
Amazon Kinesis is AWS's managed event streaming service, launched in 2013 as Ama...
Confluent
Confluent is the commercial company founded in 2014 by Apache Kafka's original c...
Apache Pulsar
Apache Pulsar is a distributed messaging and streaming platform created at Yahoo in 2012-2013.
Upsolver
Upsolver is a streaming ETL platform founded in Tel Aviv in 2014 that turns SQL ...
StreamNative
StreamNative is the commercial company behind Apache Pulsar, founded in 2019 by ...
Redpanda
Redpanda is a Kafka-compatible streaming platform built from scratch in C++ by R...
Stream Processing
6Stream Processing
Stream processing engines compute on data while it is in motion -- transforming,...
Apache Flink
Apache Flink is the dominant open-source stream processing engine.
Materialize
Materialize is a streaming database that maintains SQL views incrementally as data changes.
ksqlDB
ksqlDB is Confluent's SQL interface for stream processing on top of Apache Kafka.
Ververica
Ververica is the original commercial company behind Apache Flink.
Decodable
Decodable is a modern managed Apache Flink platform founded in 2020 by Eric Samm...
Real-time Analytics
5Real-time Analytics
Real-time analytics databases are OLAP systems built to answer analytical querie...
ClickHouse
ClickHouse is the dominant real-time analytics database.
Apache Druid
Apache Druid is a real-time analytics database created at Metamarkets in 2011 by...
Apache Pinot
Apache Pinot is a real-time analytics database created at LinkedIn in 2013-2014 ...
Rockset
Rockset was a real-time analytics database founded in 2016 by ex-Facebook engine...
Storage
5Storage Primitives
Cloud object storage and distributed file systems — the bottom layer of the mode...
Amazon S3
Amazon S3 (Simple Storage Service) is the object storage service AWS launched in 2006.
Google Cloud Storage
Google Cloud Storage (GCS) is Google Cloud's object storage service.
Azure Blob Storage
Azure Blob Storage is Microsoft Azure's object storage service, launched in 2010.
HDFS
HDFS (Hadoop Distributed File System) was the distributed file system at the hea...
Data Lakehouse
2Reverse ETL
3Data Workspaces
4Data Workspaces & Notebooks
Data workspaces are collaborative environments where analysts write SQL and Pyth...
Mode
Mode is the original collaborative SQL and Python notebook for analytics teams.
Hex
Hex is a collaborative, reactive data workspace for SQL and Python that has beco...
Deepnote
Deepnote is a cloud-hosted, real-time collaborative Jupyter notebook for data science teams.
ML Platforms
4DS/ML Platforms
ML platforms are end-to-end environments for training, deploying, and operating ...
Databricks ML
Databricks ML is the machine learning side of the Databricks lakehouse.
Amazon SageMaker
Amazon SageMaker is AWS's end-to-end machine learning platform, launched in 2017.
DataRobot
DataRobot is the AutoML pioneer that defined the enterprise machine learning cat...
Governance & Security
4Data Governance & Security
Data governance and security tools control who can see which data, how it is mas...
Collibra Governance
Collibra's Data Governance and Protect modules are the governance-first half of ...
Privacera
Privacera is the commercial data access governance platform founded in 2016 by t...
Immuta
Immuta is the leading pure-play data access governance platform, founded in 2015...
Vendors
7Vendors
Vendors are the companies that build and sell data tools.
Amazon Web Services (AWS)
AWS is the original public cloud and the dominant hyperscaler.
Google Cloud Platform
Google Cloud is the third-largest hyperscaler and home to BigQuery -- the most t...
Microsoft (Azure & Fabric)
Microsoft is the enterprise incumbent of the cloud era.
Starburst
Starburst is the commercial company built around Trino, the open-source distribu...
Salesforce
Salesforce became a major data ecosystem vendor by acquisition, not by building.
dbt Labs
dbt Labs (formerly Fishtown Analytics) is the company behind dbt -- the SQL-base...
Stack Overview
3Stack Overview
An opinionated map of the modern data stack, layer by layer, as TextQL sees it.
How to Read This Wiki
A guide to navigating the Data Ecosystem Wiki: what each section is for, how the...
TextQL in the Stack
TextQL Ana is an AI analyst that sits on top of the entire data stack — warehous...