Apache Iceberg | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Apache Iceberg

Apache Iceberg is the open table format that won the lakehouse format war. Born at Netflix, adopted by Snowflake, and forced onto Databricks via the $2B Tabular acquisition.

Apache Iceberg is the open table format that won the lakehouse format war. It is the reason you can now keep your data in your own S3 bucket, in open Parquet files, and still have ACID transactions, schema evolution, and time travel — without being locked into a single vendor's query engine. In 2026, if you are designing a new data lakehouse and you are not using Iceberg, you are doing something unusual and you had better have a good reason.

Think of Iceberg as a really meticulous filing clerk for your data lake. Your Parquet files are the documents. Iceberg keeps a running index of which files belong to which version of each table, what schema each file was written under, and exactly which files to read to answer any given query. The files never move. Only the index updates. Every commit is an atomic swap of one pointer.

Origin: The Netflix Story

Iceberg was born at Netflix in 2017, out of frustration with the Hive table format. Netflix was running one of the largest analytical data lakes on earth — petabytes of Parquet and ORC files on S3 — and Hive's metadata layer was bending under the weight. Listing partitions required expensive S3 list operations. Schema changes were brittle. Atomic writes across partitions were essentially impossible. Correctness bugs were common.

Two Netflix engineers, Ryan Blue and Daniel Weeks, set out to design a better metadata layer from first principles. The core insight: don't track tables by directory structure (which is what Hive did). Track them by an explicit, versioned list of files, written as metadata in the same object store as the data. Every snapshot of a table is an immutable file listing. Committing a change means writing a new snapshot and atomically updating a pointer to it.

This sounds obvious in hindsight. It was not. The design unlocked everything: real ACID transactions on S3, reliable schema evolution, time travel for free, partition evolution without rewriting data, and the ability to support multiple query engines natively without any of them having to "own" the table.

Netflix donated Iceberg to the Apache Software Foundation in November 2018. It graduated to a top-level Apache project in May 2020.

The Four Properties, Iceberg-Style

Iceberg nails all four of the defining table format properties:

ACID transactions. Iceberg uses snapshot isolation. Every commit produces a new metadata file describing a new snapshot of the table. The catalog atomically swaps the current pointer. Two writers can prepare commits concurrently; the loser retries. Readers always see a consistent snapshot.

Schema evolution. Iceberg tracks columns by unique ID, not by name or position. You can add, drop, rename, reorder, and even change types on columns without touching the underlying Parquet files. This is a major correctness improvement over Hive and Delta's early schema handling, where a dropped-and-re-added column could silently return old data.

Time travel. Every snapshot is retained (subject to a retention policy). You can SELECT * FROM orders FOR VERSION AS OF 12345 or FOR TIMESTAMP AS OF '2026-01-01' and get the exact state of the table at that point. Rollback is a one-line operation.

Partition evolution and hidden partitioning. This is where Iceberg truly shines. In Hive, partitioning was a directory-naming convention, and users had to manually reference partition columns in queries (WHERE event_date = '2026-01-01'). Iceberg hides partitioning: you query WHERE event_ts BETWEEN ... and Iceberg figures out which files to skip. Even better, you can change the partitioning scheme over time (from daily to hourly) without rewriting old data — old files use the old partitioning, new files use the new, and queries work across both transparently.

The War, and How Iceberg Won

From 2019 to 2024, the open table format market looked like a three-way race between Iceberg, Delta Lake, and Apache Hudi. Each had a major tech-company origin. Each had strong technical advocates. For a long time, it was unclear which would become the standard.

Iceberg won on two fronts: neutrality and engine support. Delta was effectively controlled by Databricks; even though it was donated to the Linux Foundation, the best Delta experience was always inside Databricks, and non-Databricks engines were second-class citizens. Hudi was technically strong but struggled to build adoption outside a few specific companies. Iceberg, by contrast, was built from day one to support multiple engines as first-class citizens — Spark, Trino, Flink, Presto, Hive, Impala, and eventually Snowflake, BigQuery, Redshift, DuckDB, ClickHouse, and StarRocks all shipped native Iceberg support.

The turning point was Snowflake's 2022 announcement that it would support Iceberg as a first-class table format. For Snowflake — the poster child of proprietary closed-format warehousing — to embrace an open format was a landmark. It signaled to every CTO and data architect that open table formats were no longer a lake-side concern; they were where the entire industry was heading.

The Tabular Drama: Databricks' $2B Defensive Play

In 2021, Ryan Blue, Daniel Weeks, and Jason Reid founded Tabular, the commercial company built around Iceberg. Tabular's pitch was simple: a managed Iceberg catalog and data platform where customers could keep their data in open format and use whichever engine they wanted — Snowflake, Databricks, Trino, anything. Tabular was, in effect, the neutral Switzerland of the lakehouse wars.

This was an existential problem for Databricks. Databricks' entire strategic position rested on Delta Lake being the dominant open format. If Iceberg became the standard and Tabular became the standard Iceberg platform, customers could trivially move between Databricks and Snowflake — and the lock-in Databricks depended on would evaporate.

In June 2024, Databricks acquired Tabular for a reported $1–2 billion (most reporting settled on roughly $2B). By any conventional metric, this was an extraordinary price for a company with minimal revenue. By the only metric that mattered — neutralizing the single biggest strategic threat to Databricks' moat — it was cheap insurance.

At the same event where the acquisition was announced (Data + AI Summit 2024), Databricks pledged to contribute to Iceberg, unify Delta and Iceberg, and invest in cross-format compatibility via UniForm (a Delta feature that writes Iceberg-readable metadata alongside Delta metadata). The messaging was "we're joining forces." The subtext was "we lost, and we're trying to own the winner."

Not to be outdone, Snowflake responded by doubling down on its own neutral Iceberg catalog, Polaris, and donating it to the Apache Software Foundation in 2024. Polaris became an Apache project in short order. Snowflake's pitch: if Databricks now owns the main Iceberg commercial vendor, Snowflake will be the keeper of the neutral open catalog.

The result: Iceberg is now the de facto universal table format, and the fight has moved up the stack to the catalog layer. Every major lakehouse deployment in 2025 uses Iceberg or is migrating to it.

Engine Support

Iceberg is supported, natively or near-natively, by: Apache Spark, Trino, Presto, Apache Flink, Apache Hive, Apache Impala, Snowflake, Databricks (via UniForm), Google BigQuery (via BigLake), Amazon Athena / Redshift (via AWS Glue), DuckDB, ClickHouse, StarRocks, Dremio, and Doris, among others. This is by far the broadest engine support of any table format.

Where Iceberg Fits in the Stack

Iceberg sits at the table format layer, on top of Parquet (or ORC/Avro) files in object storage, below query engines and catalogs. It replaces the Hive metastore's table abstraction, though it can still use a Hive metastore, AWS Glue, Polaris, Nessie, Unity Catalog, or Tabular as its catalog. In a modern lakehouse, the canonical stack is: S3 + Parquet + Iceberg + Iceberg REST catalog (Polaris/Unity) + any number of query engines.

Honest Take

Iceberg is the right default in 2026 for any new lakehouse, full stop. The main nuance: because Iceberg is now effectively co-controlled by Databricks (via Tabular) and challenged by Snowflake (via Polaris), the catalog layer is the new battleground. Picking Iceberg as your format is easy; picking which catalog and which governance plane to use around it is the real decision.

How TextQL Works with Apache Iceberg

TextQL Ana reads Iceberg tables through whichever engine a customer already uses — Snowflake, Databricks, Trino, Athena, or a direct Iceberg REST connection. Because Iceberg enforces schemas, tracks column IDs, and exposes rich metadata, it is an ideal substrate for LLM-driven query generation: the metadata gives the model the structural grounding it needs to produce correct SQL.

See TextQL in action

Apache Iceberg

Created 2017 at Netflix

Creators Ryan Blue, Daniel Weeks

Open-sourced 2018 (Apache incubation)

Top-level project May 2020

License Apache 2.0

Type Open table format

Commercial backer Tabular (acquired by Databricks, 2024)

Category Table Formats

Monthly mindshare ~80K · rapidly growing post-Tabular acquisition; ~10K GitHub stars; adopted by Snowflake & Databricks