Apache Hudi | Data Ecosystem Wiki

Public Preview · May 18–Jun 5 NEW: Opus 4.8 is now available in Ana →

Contents

Apache Hudi

Apache Hudi is the streaming-first table format born at Uber. It was the first of the three open table formats and remains technically excellent for CDC and upserts, but has been losing the format war to Apache Iceberg.

Apache Hudi (pronounced "hoodie", short for Hadoop Upserts Deletes and Incrementals) is the oldest of the three big open table formats, the streaming-first one, and the underdog in the format war. It was born at Uber in 2016 — before Iceberg, before Delta — to solve one extremely specific problem: how do you efficiently ingest a high-frequency stream of database change events (CDC) into a data lake and make them queryable in near real time?

Hudi's answer was smart, and for that specific problem it is still arguably the best of the three. The broader industry, however, has moved in a direction where Hudi's particular strengths matter less, and its commercial and community momentum has lagged. In 2026, Hudi is a respected, technically strong, but clearly losing entry in the format war.

Metaphor: if Iceberg is a meticulous librarian and Delta is the in-house clerk at a mega-vendor, Hudi is the overnight shift at a 24-hour logistics hub — built to absorb a continuous fire hose of changes and make the new state queryable as fast as possible.

Origin: The Uber Story

Uber in 2015–2016 had a specific and painful problem. It was running hundreds of production OLTP databases, and it needed the changes from those databases reflected in its analytical data lake within minutes, not hours. The standard Hive-on-HDFS lake architecture of the time was append-only and batch-oriented. To handle updates, you had to rewrite entire partitions — often multiple gigabytes at a time — every time a single row changed. At Uber's scale, this was impossible.

Vinoth Chandar and a team at Uber built Hudi to solve this. The core innovation was a new storage strategy called Merge-on-Read (MoR): instead of immediately rewriting files on every update, Hudi writes small append logs alongside the base Parquet files and merges them at read time. Periodic compaction jobs fold the logs back into the base files in the background. The result is a table that can absorb millions of row-level updates and deletes per hour without rewriting all its data, and that stays queryable throughout.

Hudi also pioneered incremental queries — the ability to ask, "give me all the records that changed since timestamp X" — directly from the table format, without needing a separate CDC system on top.

Uber open-sourced Hudi in 2017. It entered the Apache incubator in 2019 and graduated to a top-level project in 2020. Onehouse, the commercial company around Hudi, was founded by Vinoth Chandar in 2021.

The Four Properties, Hudi-Style

Hudi delivers all four table format properties, with a particularly strong story on mutability:

ACID transactions. Hudi uses a timeline — an ordered log of table actions (commits, compactions, cleans, rollbacks). Writers coordinate via the timeline. Hudi supports both optimistic and (optionally) pessimistic concurrency control.

Schema evolution. Hudi supports adding, renaming, and dropping columns, with increasingly strong guarantees in recent versions. Historically this was a weaker area for Hudi compared to Iceberg, but it has improved substantially.

Time travel. Hudi's timeline lets you query the table as of any previous commit.

Partition evolution. Hudi supports partition-level operations but does not have the clean hidden-partitioning / partition-evolution story that Iceberg does.

The property where Hudi is genuinely best in class is record-level upserts and deletes at high frequency. Hudi supports primary keys as a first-class concept (Iceberg and Delta do not, natively), which makes upsert semantics straightforward and efficient.

Copy-on-Write vs Merge-on-Read

Hudi exposes two table types, and the distinction is central to understanding what Hudi is for:

Copy-on-Write (CoW): When you update a row, Hudi rewrites the entire file containing that row. This is conceptually similar to how Iceberg and Delta historically handled updates. Good for read-heavy workloads where writes are less frequent.

Merge-on-Read (MoR): Updates are written to small row-based log files (in Avro) alongside the base Parquet files. Readers merge the logs with the base files on the fly. A background compaction process periodically folds the logs into new base files. This gives you much lower write latency and much higher ingestion throughput, at the cost of slightly more expensive reads (until compaction catches up).

Merge-on-Read is Hudi's defining contribution, and for high-throughput CDC workloads into a data lake it is still the best answer in the table format world.

Why Hudi Is Fading

Hudi has real, durable technical strengths. It is not going away. But in the format war, it has lost.

Engine support narrowed. Iceberg and Delta aggressively built out connectors for every major query engine. Hudi's ecosystem is more Spark- and Flink-centric, with less first-class support in Trino, Snowflake, BigQuery, Redshift, and DuckDB.

Design complexity. Hudi's model has more moving parts than Iceberg's: table types, timeline services, compaction, clustering, cleaning, and index choices. This richness is powerful but creates operational complexity that newer teams find off-putting when a cleaner alternative exists.

Commercial momentum lagged. Onehouse is a real company with real customers, but it never developed the gravitational pull that Tabular briefly had for Iceberg, or that Databricks has always had for Delta. When Databricks paid $2B for Tabular in 2024, it cemented Iceberg's status as the standard. Hudi was not in that conversation.

The use case narrowed. Hudi's big differentiator was high-frequency CDC upserts into a lake. In 2026, a lot of that use case has migrated to: (a) direct replication into a warehouse (Fivetran, Estuary, Debezium-to-Snowflake), or (b) Iceberg tables with the newer row-level delete support and upsert helpers. Hudi is still arguably the fastest option for this workload, but the workload itself is less distinct than it was in 2017.

Loyal base remains. Uber still runs Hudi at enormous scale. A number of other high-throughput shops (notably in fintech and ride-sharing) still use it seriously. Onehouse continues to innovate. Hudi 1.0 was a significant release. But the default choice for a new lakehouse in 2026 is Iceberg, not Hudi.

Where Hudi Fits in the Stack

Hudi sits at the table format layer on top of Parquet (for base files) and Avro (for log files) in object storage. It's most commonly paired with Apache Spark or Apache Flink for writes, and read from Spark, Trino, Presto, Hive, and (with varying quality) other engines. Onehouse offers a managed cloud service that abstracts most of the operational complexity.

Honest Take

Hudi is a technically excellent format that is on the wrong side of history. If you have a specific, high-throughput CDC-into-lake workload and your team already runs Spark or Flink heavily, Hudi is still a reasonable choice and may be the best one. For general-purpose lakehouse architectures in 2026, pick Iceberg. For anything inside the Databricks ecosystem, Delta (ideally with UniForm) is fine. Hudi's future is as a specialized streaming-oriented format, not as the universal default.

How TextQL Works with Apache Hudi

TextQL Ana reads Hudi tables through whatever query engine is already in place — typically Trino, Spark SQL, or a warehouse exposing external Hudi tables. Hudi's timeline and schema metadata provide the same structured grounding Ana uses on any other table format to generate reliable SQL from natural language.

See TextQL in action

Apache Hudi

Created 2016 at Uber

Creator Vinoth Chandar and team

Open-sourced 2017

Apache top-level 2020

License Apache 2.0

Type Open table format (streaming-first)

Commercial backer Onehouse (founded 2021)

Category Table Formats

Monthly mindshare ~15K · Uber origin; lost mindshare war to Iceberg/Delta; ~5K GitHub stars