Data Warehouses | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Data Warehouses

Cloud data warehouses are purpose-built analytical databases that store structured data in a schema-enforced, query-optimized format. The dominant platforms are Snowflake, Databricks, Google BigQuery, and Amazon Redshift.

A data warehouse is a database designed specifically for answering questions about your business, not for running your business. That distinction matters. The database behind your app (Postgres, MySQL, etc.) is optimized for writing data quickly — processing thousands of transactions per second. A data warehouse is optimized for reading data quickly — scanning millions of rows to answer questions like "what were our top-selling products last quarter by region?"

Think of it this way: a data warehouse is like a really organized library. Every book is cataloged, shelved by topic, and indexed so you can find anything fast. A data lake, by contrast, is more like a storage unit — you throw everything in boxes and sort it later. An operational database is like your desk: small, fast, covered in whatever you're working on right now.

Data warehouses have existed since the 1980s (Bill Inmon coined the term in 1990), but the modern cloud data warehouse — pioneered by Amazon Redshift in 2012 and redefined by Snowflake and BigQuery — is a fundamentally different animal from its on-premise ancestors. The old version required buying hardware upfront, hiring a DBA team, and waiting weeks for capacity. The new version is elastic, serverless (or close to it), and billed by the query.

What Makes a Data Warehouse a Data Warehouse

Three properties separate a data warehouse from other places you might put data. These are mutually exclusive and collectively exhaustive — if a system has all three, it's a warehouse (or warehouse-like); if it's missing one, it's something else:

1. Schema enforcement (structure first). Data must conform to a defined structure before it enters the warehouse. You declare your columns, their types, and their relationships upfront. This is sometimes called "schema-on-write." A data lake does the opposite: you dump raw files in and figure out the structure when you read them ("schema-on-read"). The tradeoff is rigidity for reliability. Warehouse data is trustworthy because it already passed validation; lake data is flexible because anything goes.

2. Query optimization (speed for analytical patterns). Warehouses use tricks that make analytical queries fast at the expense of transactional speed. The most important is columnar storage: instead of storing each row together (like a spreadsheet), warehouses store each column together. Why? Because analytical queries almost always ask for a few columns across millions of rows ("give me the sum of revenue for all orders"), not all columns for a few rows. Columnar storage means the engine only reads the data it needs. This is also why warehouses are called OLAP systems (Online Analytical Processing) rather than OLTP systems (Online Transaction Processing).

3. Built-in governance and access control. Warehouses let you control who can see what, down to the row and column level. They maintain audit logs, support role-based access, and integrate with identity providers. Data lakes have historically been a free-for-all on this front (though that's changing with tools like Unity Catalog and Lake Formation).

Simple analogy for columnar storage: Imagine a spreadsheet with 1 million rows and 50 columns. You want the average of column C. In a traditional row-based database, the engine reads every cell in every row just to find column C — it has to walk through columns A, B, C, D... for each row. In a columnar database, all of column C is stored together in one contiguous block. The engine grabs just that block and ignores the other 49 columns entirely. That's why analytical queries on warehouses can be 10—100x faster.

Key Concepts, Explained Simply

OLAP vs OLTP. Your app's database is OLTP: optimized for lots of small, fast reads and writes (a user logs in, places an order, updates their profile). A warehouse is OLAP: optimized for fewer but much larger reads (scan all orders from last year, group by category, calculate growth). You generally cannot use one as the other. Running a heavy analytical query on your production OLTP database will slow down your app. That's the whole reason warehouses exist.

Star schemas and dimensional modeling. The dominant way to organize warehouse data is the star schema: a central fact table (events that happened, like orders or clicks, typically with numbers you want to aggregate) surrounded by dimension tables (descriptive context, like customers, products, dates). It's called a "star" because if you draw the relationships, the fact table is in the center with dimension tables radiating outward. This pattern makes queries intuitive — "give me total revenue (fact) by product category (dimension) by quarter (dimension)" — and lets the warehouse engine optimize aggressively.

ELT, not ETL. Older warehouses were expensive to run, so you transformed data before loading it (Extract, Transform, Load). Modern cloud warehouses have cheap, elastic compute, so teams now load raw data first and transform it inside the warehouse using tools like dbt (Extract, Load, Transform). This is a subtle but important shift: the warehouse became the transformation layer, not just the storage layer.

How Vendors Define "Data Warehouse" Differently

Here is something the vendor documentation will never tell you plainly: every major vendor defines "data warehouse" in a way that flatters their own product. Understanding these biases is essential to making sense of the market.

Snowflake wants "data warehouse" to be the center of the universe. Their pitch: put all your data in Snowflake — structured, semi-structured, even unstructured — and it becomes your single platform for analytics, data sharing, applications, and AI. They've expanded the warehouse concept to include data lakes, data apps, and marketplace features, branding it the "Data Cloud." When Snowflake says "warehouse," they mean everything.

Databricks wants you to stop saying "data warehouse" altogether. Their position: the warehouse was a necessary but limited concept, and the future is the lakehouse — an open architecture that gives you warehouse-style SQL performance on top of data lake storage (Delta Lake, Parquet files on S3/ADLS/GCS). When Databricks says "data warehouse," they mean "that legacy thing we're replacing." Their SQL Warehouse product is pointedly named to show warehouses are just one feature of their broader lakehouse.

Google BigQuery blurs the line by being serverless from day one. There's no cluster to manage, no warehouse "size" to pick. You just run queries and pay per byte scanned. BigQuery also natively queries data in Google Cloud Storage (lake-style), making the warehouse/lake distinction less meaningful. Google's pitch: stop thinking about infrastructure categories entirely.

Amazon Redshift is the most traditional of the four — it started as a classic MPP (massively parallel processing) columnar database. Amazon has since added Redshift Serverless, Redshift Spectrum (query S3 data), and integration with the broader AWS lake ecosystem (Lake Formation, Glue, Athena). But Redshift still feels the most like a "database you manage" compared to the others.

The honest takeaway: the term "data warehouse" means something slightly different depending on who you ask, and the differences are driven as much by commercial incentives as by technical architecture.

The Convergence: Warehouses and Lakes Are Merging

The single most important trend in data infrastructure from 2020 to 2026 has been the convergence of warehouses and data lakes. Both sides are absorbing each other's strengths:

Warehouses are becoming more lake-like. Snowflake now supports Apache Iceberg tables backed by your own cloud storage. BigQuery has BigLake. Redshift has Spectrum. You can now keep data in open formats on cheap object storage and query it through your warehouse — something only lakes used to offer.

Lakes are becoming more warehouse-like. Databricks added ACID transactions (via Delta Lake), SQL query engines (Photon), schema enforcement, and fine-grained access control (Unity Catalog). These were all things only warehouses used to provide.

The result: the "warehouse vs. lake" debate is increasingly a false dichotomy. Most large organizations end up with both, and the platforms themselves are converging toward a single architecture that combines open storage (Parquet/Iceberg on object storage) with high-performance SQL compute and strong governance. Whether you call that a "warehouse," "lakehouse," or "data cloud" depends mostly on which vendor's conference you attended last.

Where the puck is going: expect the category label to matter less and less. The winning architecture in most enterprises is open table formats on object storage, queried by one or more high-performance SQL engines, with a unified governance layer on top. Whether Snowflake or Databricks hosts that query engine is a commercial question, not an architectural one.

When to Pick a Warehouse vs. Alternatives

Scenario	Best fit	Why
—-	—-	—-
BI dashboards and reporting	Data warehouse	Schema enforcement + fast SQL = reliable dashboards
Ad hoc analytics by SQL-fluent teams	Data warehouse	Optimized for exactly this
Data science / ML on raw unstructured data	Data lake or lakehouse	Warehouses aren't built for unstructured data at scale
Real-time analytics (sub-second)	Real-time OLAP (ClickHouse, Druid, Pinot)	Warehouses have query latency in the seconds, not milliseconds
Streaming event data	Event streaming + lake or warehouse	Warehouses aren't real-time ingest systems
Small team, simple needs	Warehouse (BigQuery or Snowflake)	Managed, serverless, no infra to run
Large enterprise, multi-engine	Lakehouse on open formats	Avoid vendor lock-in, query from multiple engines

Tools in This Category

Snowflake — The dominant independent cloud data warehouse. Closed storage, expanding into data cloud platform.
Databricks — Originally a Spark/lake company, now offering SQL warehouses as part of its lakehouse platform.
Google BigQuery — Google's serverless, pay-per-query warehouse. Pioneered separation of storage and compute.
Amazon Redshift — AWS's MPP columnar warehouse. The longest-running cloud warehouse (launched 2012).

How TextQL Works with Data Warehouses

Data warehouses are the most common connection point for TextQL. TextQL Ana connects directly to Snowflake, Databricks, BigQuery, and Redshift, allowing business teams to ask questions of warehouse data in natural language without writing SQL. Because warehouses enforce schemas and have well-defined table structures, they're the ideal backend for AI-powered analytics — the structure that makes warehouses reliable for BI also makes them reliable for LLM-generated queries.

See TextQL in action

Data Warehouses

Category Analytical storage & compute

Also called Cloud DW, OLAP warehouse, analytical database

Not to be confused with Data lakes, lakehouses, operational databases (OLTP)

Key vendors Snowflake, Databricks, BigQuery, Redshift

First-gen Teradata, Oracle Exadata, IBM Netezza

Query language SQL (all major platforms)

Typical users Analytics engineers, data analysts, BI teams

Monthly mindshare ~2M · everyone with a BI report or analytics job; foundational concept across all data orgs