NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Delta Lake (Databricks)
Delta Lake is the open-source table format Databricks built to give cloud object storage the ACID guarantees of a database -- the technical foundation of the lakehouse.
Delta Lake is what turns a folder full of Parquet files into something that behaves like a database table. It adds a transaction log on top of Parquet so that multiple readers and writers can work on the same data safely, schemas can be enforced and evolved, deletes and updates work, time travel is possible, and the lakehouse can offer the warehouse-like guarantees that bare object storage cannot.
The simple metaphor: Parquet is the brick; Delta Lake is the building inspector and the property records office. Bricks alone don't make a building — you need someone tracking which bricks are where, who added them, who removed them, and whether the structure is consistent. The Delta transaction log is that record-keeping layer.
This page is about Delta Lake as Databricks ships it — the version embedded in the Databricks Runtime, integrated with Unity Catalog, accelerated by Photon, and the default storage format for Databricks SQL. For the format itself in the abstract, see Delta Lake under table formats.
Delta Lake was built inside Databricks starting around 2017 to solve a problem that customers kept hitting: doing transactional updates on data in a data lake was almost impossible. The lake had cheap storage and infinite scale, but it lacked the most basic database property — atomicity. If two jobs wrote to the same table at the same time, you got partial files and corrupted state. If a job failed halfway through, you had to manually clean up. There was no UPDATE, no DELETE, no MERGE; everything was append-only or full-overwrite. Working with lake data felt like working with a database from the 1980s.
Delta was the answer. The core idea was to maintain a transaction log (a series of JSON files in a _delta_log directory) that records every commit to the table. Each commit is atomic; the log is the source of truth for which Parquet files belong to the table at any given version. This single architectural decision unlocked a cascade of warehouse-like features: ACID transactions, schema enforcement, schema evolution, MERGE INTO (upserts), DELETE, UPDATE, time travel ("show me this table as of last Tuesday"), and concurrent reads and writes without corruption.
Databricks open-sourced Delta Lake in April 2019, donated it to the Linux Foundation in 2019, and made it the default table format for the Databricks platform. The launch was a direct counter-move to the Hudi project (created at Uber in 2016, open-sourced 2017) and the Iceberg project (created at Netflix in 2017, open-sourced 2018), which were solving the same problem from different starting points. The three formats — Hudi, Iceberg, Delta — are technically siblings; the differences are real but relatively narrow, and the choice of which one a company uses is increasingly a function of which platform they're already on.
The reason Delta Lake exists is the foundational lakehouse argument: if you can give object storage the guarantees of a database, you no longer need a separate warehouse. Without Delta (or Iceberg, or Hudi), the lakehouse is a slogan. With it, the lakehouse is an architecture. Delta is the engineering substrate that makes Databricks' entire strategic narrative work.
Delta Lake the open-source project is genuinely open and runs on Spark, Flink, Trino, Presto, DuckDB, and many other engines. But Delta Lake on Databricks gets a number of extras that the open-source version does not:
OPTIMIZE and VACUUM operations on tables based on usage patterns.Some of these features eventually flow back to the open-source project; others remain proprietary. The relationship is similar to Postgres vs. Aurora Postgres — the open core is real and useful, but the managed version has performance and operational features that give the vendor a meaningful edge.
MERGE INTO for CDC pipelines. This is the killer use case. Most production data pipelines that ingest from Kafka, Fivetran, Debezium, or any CDC source land in Delta tables via MERGE INTO. Delta makes upserts on lake data tractable.ALTER TABLE semantics.OPTIMIZE runs, streaming workloads can leave you with thousands of small files and slow scans.Delta Lake is the most consequential piece of data infrastructure Databricks has ever shipped, and it's the technology that turned the lakehouse from a slide into an architecture that ships. The competition with Iceberg is real and ongoing, and the honest read in 2026 is that Iceberg has won the open-format mindshare while Delta has won the Databricks-installed-base depth. Snowflake, BigQuery, AWS, and most non-Databricks vendors have standardized on Iceberg for their open-table-format support. Databricks customers, meanwhile, are mostly on Delta because it's the path of least resistance and because Photon and Unity Catalog were built around it.
Databricks has hedged this very smartly. In 2024 they introduced UniForm, which lets a Delta table also expose itself as if it were an Iceberg table by writing Iceberg metadata alongside the Delta log. In effect, Databricks is saying "you can keep using Delta and still be readable by every Iceberg-compatible engine." This is the right defensive move: it neutralizes the lock-in argument against Delta while preserving Databricks' performance and feature edge on its native format. The acquisition of Tabular in mid-2024 (the company founded by the Iceberg creators) cemented the strategy — Databricks now employs the people who built Iceberg and is positioning to be a leader in both formats simultaneously.
The convergence story is starkest here. Delta and Iceberg are increasingly two dialects of the same idea, with bridge formats blurring even that distinction. The right way to think about Delta in 2026 is not "Databricks' proprietary format" but "the open table format optimized for the Databricks runtime, with an Iceberg compatibility layer." That's a more nuanced position than the format wars of 2021—2023 implied, and it reflects how rapidly the entire industry is converging on open storage as the foundation.
Delta Lake is mostly invisible to TextQL — by the time queries reach Delta tables through Databricks SQL, TextQL is just generating SQL, and the storage format underneath is an implementation detail. But Delta features matter indirectly. Time travel makes it possible to ask Ana "what did this dashboard look like on Monday," reproducible against a specific table version. Column-level lineage from Delta operations flows into Unity Catalog, which TextQL uses to ground its query generation. And because Delta is the default storage for most Databricks customers, every TextQL Databricks deployment is, in effect, a TextQL-on-Delta deployment.
See TextQL in action