Apache Hive | Data Ecosystem Wiki

Public Preview · May 18–Jun 5 NEW: Opus 4.8 is now available in Ana →

Contents

Apache Hive

Apache Hive is the original SQL-on-Hadoop engine, built at Facebook in 2008. It made batch big data accessible to anyone who knew SQL and ruled the data lake for nearly a decade. Today its execution engine is legacy, but its metastore lives on as the de facto catalog standard.

Apache Hive is the original SQL-on-Hadoop engine. It was built at Facebook in 2008 to solve a very specific problem: Facebook had a fast-growing Hadoop cluster full of useful data, but the only way to query it was to write Java MapReduce jobs by hand. Most of the people who wanted to ask questions of that data — analysts, product managers, business intelligence teams — did not know how to write MapReduce. They knew SQL. Hive translated SQL queries into MapReduce jobs automatically, and in doing so it democratized big data analytics overnight.

For roughly the period 2010 to 2017, Hive was the SQL engine on the data lake. If you were doing analytical SQL on a Hadoop cluster, you were almost certainly using Hive. Today it is legacy. But the Hive Metastore — a side component of the original project — has had an unexpected second life as the de facto catalog standard for the entire open-source big data ecosystem.

Origin Story

Hive was started at Facebook in 2007-2008 by Joydeep Sen Sarma, Ashish Thusoo, and a small team. Facebook was running on a Hadoop cluster that had grown from gigabytes to petabytes in a couple of years, and the existing tooling was MapReduce in Java — powerful but inaccessible. The team wanted to give Facebook's analysts a way to query Hadoop data that felt like the SQL they already knew.

The result was Hive. The user wrote SQL ("HiveQL"), Hive parsed and planned it, and then it generated a sequence of MapReduce jobs that actually executed the work. The query that took three minutes to start (because MapReduce had to spin up JVMs and stage data through HDFS between every stage) might run for an hour, but you got an answer in SQL on petabytes of data, which had been functionally impossible six months earlier.

Facebook open-sourced Hive in 2008, and it became an Apache top-level project in 2010. By 2012-2013, every Hadoop distribution — Cloudera, Hortonworks, MapR — shipped Hive as the default SQL layer. It became the universal language of the Hadoop world. Tools like Apache Sqoop loaded data into Hive tables, BI tools connected to it via JDBC, ETL pipelines wrote to it, dashboards read from it. Hive was the front door to the data lake.

Why Hive Is Legacy Now

Three things happened, more or less in parallel, that turned Hive from the dominant choice into a legacy one.

1. MapReduce was too slow. Hive's original execution engine was Hadoop MapReduce, and MapReduce was designed for fault-tolerant batch jobs that could afford to land intermediate data to disk between every stage. For a 12-hour ETL job, that's fine. For an interactive analyst who wants an answer in 10 seconds, it is unusably slow. Hive got faster over time — it added Tez as a more efficient execution engine, then LLAP for long-lived in-memory daemons, and people also wired Hive to run on Spark — but the architecture was always playing catch-up.

2. Presto, Impala, and SparkSQL showed there was a better way. Starting around 2012, a new generation of MPP SQL engines arrived — Presto at Facebook, Impala at Cloudera, and SparkSQL at Databricks. All of them used the same basic architectural insight: long-lived worker processes, in-memory pipelined execution, no MapReduce in the critical path. They were 10x to 100x faster than Hive on interactive queries. Once the new engines were stable, there was very little reason to keep using Hive's execution layer.

3. Hadoop itself fell out of favor. The whole on-premise Hadoop ecosystem — where Hive made the most sense — entered a long decline starting around 2017 as enterprises moved to cloud object storage (S3, ADLS, GCS) and cloud warehouses (Snowflake, BigQuery, Redshift). Cloudera and Hortonworks merged in 2019 in a defensive move; MapR was acquired and effectively wound down. Hive's natural habitat shrank.

By the early 2020s, choosing Hive's execution engine for a new analytical workload made no sense. There were better options for every scenario.

The Metastore Lives On

Here is the twist that makes Hive's story unusual: while Hive's execution engine became legacy, the Hive Metastore — a small component the team built almost as an afterthought — became one of the most widely-used pieces of infrastructure in the open-source data ecosystem.

The Hive Metastore is a service that stores metadata about tables: their names, schemas, partition layouts, file locations, and types. When Presto or Spark or Trino wants to know "where are the files for this table and what columns does it have," they ask the Hive Metastore. The Metastore turned out to be useful far beyond Hive itself, and over the next decade essentially every other engine in the lake ecosystem learned to read and write it. Today, when you point Trino, Spark, Dremio, or even some commercial warehouses at a data lake, they typically discover the tables through a Hive Metastore (or a Metastore-compatible service like AWS Glue or Databricks Unity Catalog in Hive-compatible mode).

In other words: the most important legacy of Hive is its catalog, not its query engine. When people in 2026 say "we use Hive," there's a good chance they actually mean "we run a Hive Metastore that other engines query against."

The newer table formats — Iceberg, Delta Lake, Hudi — were designed in part to fix the Metastore's limitations (no ACID transactions, no schema evolution, no time travel, weak partitioning model). The next-generation catalogs (Unity Catalog, Polaris, Nessie, Gravitino) are gradually taking over from the Hive Metastore, but the Metastore is still extremely common in production.

When You'd Still Use Hive

Honest answer: you should not start new workloads on Hive in 2026. The reasons to still encounter it are:

You inherited a Hadoop estate and the Hive jobs work well enough that no one has bothered to migrate them.
You're running a Hive Metastore as a catalog and you call that "Hive" but you query through Spark, Trino, or Presto.
You're using a Cloudera, Hortonworks (now Cloudera), or EMR distribution that ships Hive as part of the bundle.

For new analytical SQL on the lake, the choices are Trino, Dremio, DuckDB, Spark SQL, or a cloud warehouse — not Hive's execution engine.

TextQL Fit

TextQL connects to Hive via JDBC. In practice, the more useful integration for organizations on the Hive side of the world is to put Trino or Spark in front of the same Hive Metastore-managed tables and connect TextQL to that, getting modern execution speed while reusing the existing catalog and security model.

See TextQL in action

Apache Hive

Created 2008 (Facebook); Apache top-level 2010

Origin Facebook

Original creators Joydeep Sen Sarma, Ashish Thusoo, and team

License Apache 2.0

Query language HiveQL (SQL-like)

Originally executed on Hadoop MapReduce; later Tez and Spark

Category Query Engines

Monthly mindshare ~80K · ~5K GitHub stars; legacy SQL-on-Hadoop; entrenched in old Hadoop shops