Data Lakehouse | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Data Lakehouse

The lakehouse architecture combines the flexibility of data lakes with the performance of data warehouses — or at least, that's the pitch. Here's what it actually means.

A data lakehouse is an architecture that attempts to combine the cheap, scalable, schema-agnostic storage of a data lake with the structured querying, ACID transactions, and governance features of a data warehouse. The term was coined by Databricks in 2020 to describe their platform and position it as the successor to both traditional data lakes and cloud warehouses.

In plain English: imagine you had a storage unit where you could throw in anything — files, images, logs, CSVs, Parquet files, whatever. That's a data lake. Now imagine you added shelving, labels, a checkout system, and a rule that says nobody can move two things at once without signing a ledger. That's basically what a lakehouse is — your storage unit got organized without losing the ability to throw anything in there.

Whether "lakehouse" describes a genuinely new architecture or is primarily a marketing term depends heavily on who you ask — and what they're selling.

Why This Term Exists

The history matters because "data lakehouse" is arguably the most vendor-loaded term in the entire data ecosystem.

Databricks coined it. In January 2020, Databricks co-founders Ali Ghodsi and Matei Zaharia published a paper and began using "lakehouse" to describe an architecture where you store everything in open formats on cheap object storage (like S3) but layer on warehouse-grade features — transactions, indexing, SQL access. The implicit argument: you don't need a separate warehouse like Snowflake. The lake becomes your warehouse.

This was a strategic move. Databricks had built its business on Apache Spark and data lakes. Snowflake had built its business on cloud warehouses. By coining "lakehouse," Databricks reframed the conversation: warehouses were legacy; the lakehouse was the evolution.

If you ask Databricks, a lakehouse is the natural evolution beyond warehouses — why pay to copy data into a proprietary warehouse when your lake can do the same job?

If you ask Snowflake, they'd say their warehouse already does everything a lakehouse claims to do — they just don't use the buzzword. Snowflake would (and does) argue that external tables, Iceberg support, and their storage layer already provide lake-like flexibility without sacrificing warehouse performance.

If you ask Google, they'll show you that BigQuery has quietly added lakehouse features — BigLake, native Iceberg support, open storage connectors — without ever centering their marketing around the term.

The honest framing: "lakehouse" is roughly 50% genuine architectural pattern and 50% competitive positioning. That doesn't make it meaningless — the architecture is real — but you should understand the term's origin before evaluating vendor claims.

The Three-Column Diagram (And What It Gets Right)

You've almost certainly seen the Databricks three-column comparison: Data Warehouse vs. Data Lake vs. Data Lakehouse, typically showing how warehouses have structure but are expensive and siloed, lakes are cheap but chaotic, and lakehouses magically get the best of both.

What that diagram gets right:

Data warehouses really do enforce structure, enable fast SQL, and support ACID transactions — but they historically struggle with unstructured data (images, logs, ML training sets) and lock data into proprietary formats.
Data lakes really are cheap and flexible — but without additional tooling, they devolve into "data swamps" where nobody knows what's in them, there's no transactional consistency, and querying is slow.
The lakehouse pattern genuinely does aim to address both sets of limitations.

What it oversimplifies:

It presents a clean linear evolution (warehouse, then lake, then lakehouse) when the real history is messier. Many organizations ran warehouses and lakes simultaneously for different workloads — that two-tier architecture was a deliberate design choice, not a failure.
It implies warehouses can't handle unstructured or semi-structured data, when modern warehouses (Snowflake, BigQuery, Redshift) have added extensive support for JSON, Parquet, and external tables.
It conveniently omits that the "lakehouse" column describes Databricks' product. The diagram is marketing material presented as a technology taxonomy.

The diagram is useful as a mental model for understanding the motivation behind lakehouse architecture. It's less useful as an objective comparison of what modern platforms actually do.

The Four Pillars: What Makes Something a Lakehouse

Technically, a lakehouse adds four categories of capability on top of a raw data lake. These are the features that turn a pile of files into something you can query like a warehouse:

### 1. ACID Transactions

Raw data lakes have no transactional guarantees. If two processes write to the same dataset simultaneously, you can end up with corrupted or partial data. Lakehouse architectures add ACID (Atomicity, Consistency, Isolation, Durability) transactions through open table formats like Delta Lake, Apache Iceberg, or Apache Hudi. These formats maintain a transaction log alongside the data files, ensuring that reads and writes are consistent.

### 2. Schema Enforcement and Evolution

A data lake will happily accept a CSV with 12 columns today and 15 columns tomorrow with no warning. Lakehouse table formats enforce schemas — they validate that incoming data matches the expected structure — while also supporting controlled schema evolution (adding a column, changing a type) without breaking existing queries.

### 3. Indexing and Query Performance

Raw Parquet files on S3 are scannable, but slowly. Lakehouse architectures add data skipping, Z-ordering, compaction, and various indexing strategies so that a SQL query doesn't have to read every file in a dataset. This is what closes the performance gap with traditional warehouses.

### 4. Governance and Access Control

Data lakes historically had file-level permissions at best — you could control who accesses a bucket, but not who can see column X in table Y. Lakehouse platforms add fine-grained access control, data lineage, audit logging, and catalog integration (e.g., Unity Catalog for Databricks, or open catalogs like Apache Polaris for Iceberg).

If a platform provides all four on top of open-format storage, it fits the lakehouse definition regardless of what the vendor calls it.

The Great Convergence: Everyone Is a Lakehouse Now

Here's the part that most vendor marketing won't tell you: the lakehouse "won" the narrative war, but architecturally, every major platform has converged to roughly the same place. The distinction between warehouse and lakehouse is increasingly academic.

Vendor	Originally	What They've Added	Lakehouse?
—-	—-	—-	—-
Databricks	Spark-based data lake platform	SQL warehouses, Unity Catalog, serverless SQL, BI integrations	Yes (they coined the term)
Snowflake	Cloud data warehouse	Iceberg tables, external tables, Snowpark for Python/ML, unstructured data support	Functionally yes, though they avoid the term
Google BigQuery	Cloud data warehouse	BigLake, native Iceberg/Delta support, object table queries, open storage	Functionally yes, marketed as "data cloud"
Amazon	Redshift (warehouse) + S3 + Glue (lake)	Redshift Spectrum, Lake Formation, zero-ETL integrations, Iceberg support	Converging, sold as multiple services
Microsoft	Azure Synapse + ADLS	Microsoft Fabric (OneLake), Delta Lake native, unified analytics	Yes, via Fabric

The pattern is clear: warehouse vendors added lake capabilities (open formats, unstructured data, external tables). Lake vendors added warehouse capabilities (SQL engines, transactions, governance). They met in the middle.

When does the distinction actually matter in practice?

Storage format ownership matters. If your data sits in open formats (Iceberg, Delta) on your own object storage, you have portability. If it sits in a vendor's proprietary storage, you're locked in regardless of what they call themselves. This is the one area where the lakehouse philosophy — open formats, separation of storage and compute — has genuine architectural implications.
Workload diversity matters. If you only run SQL analytics, a traditional warehouse works fine. If you also need ML training, streaming ingestion, and unstructured data processing on the same data, a lakehouse-style architecture (or any modern platform, really) is more appropriate.
Cost structure matters. Lakehouse architectures that store data in your own cloud storage (S3, GCS, ADLS) can be cheaper for large-scale data because you're paying commodity storage prices rather than warehouse storage markup.

When is it just branding? If you're running SQL dashboards on structured data in Snowflake and someone tells you that you need to "migrate to a lakehouse," you probably don't. The workload hasn't changed — only the buzzword.

The Real Technical Foundation: Open Table Formats + Compute Engines

Strip away the marketing and a lakehouse is really two things:

An open table format (Delta Lake, Apache Iceberg, or Apache Hudi) that sits on top of commodity object storage and provides transactions, schema management, and metadata.
A compute engine (query engine, SQL engine, or Spark cluster) that reads those formats and provides fast querying, indexing, and access control.

That's it. Everything else — the branding, the diagrams, the three-column comparisons — is packaging around these two components. If you understand table formats and query engines, you understand lakehouses.

The industry's current trajectory suggests that Apache Iceberg is becoming the dominant open table format (with Snowflake, AWS, Google, and even Databricks adding Iceberg support alongside Delta Lake), which further erodes any meaningful distinction between "lakehouse" and "modern warehouse."

Tools in This Category

Databricks Lakehouse Platform

How TextQL Works with Data Lakehouses

TextQL sits above the lakehouse layer entirely. Whether your data lives in a Databricks lakehouse, Snowflake warehouse, BigQuery, or a combination of all three, TextQL Ana connects to each and lets teams query across them as a single unified asset. The lakehouse-vs-warehouse distinction is invisible to end users — which, in a way, proves the point that the distinction is increasingly just plumbing.

See TextQL in action

Data Lakehouse