Unity Catalog | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Unity Catalog

Unity Catalog is Databricks' unified governance layer for data and AI assets -- the lakehouse's answer to the question of who can see what, and the foundation of cross-platform data sharing.

Unity Catalog is the layer in Databricks that knows what data exists, who is allowed to see it, where it came from, and where it went. It's the metastore (a catalog of every table, view, volume, and ML model), the access control system (row/column-level permissions, role-based access), the audit log (every query and access event), the lineage graph (which jobs read which tables and produced which downstream tables), and the cross-account sharing layer (Delta Sharing) — all in one product.

The simple metaphor: Unity Catalog is the building security desk for the lakehouse. It keeps the directory of every room (table) in the building, issues badges (permissions), logs every entry (audit), tracks where people went (lineage), and runs the visitor program for partner companies (sharing). Without it, the lakehouse is a building with no front desk.

Origin Story

Unity Catalog was announced at Data + AI Summit in May 2021 and went generally available in August 2022. It was open-sourced under the Apache 2.0 license in June 2024 — a notable strategic move that turned it from a Databricks-exclusive product into a candidate for cross-vendor adoption.

The reason Unity Catalog exists is, again, the structural problem the lakehouse had to solve to compete with traditional warehouses. Snowflake had governance built in from day one. A single account, a single permission model, row and column policies, audit logs — all native, all consistent. Databricks did not. Pre-Unity Catalog, every Databricks workspace had its own Hive metastore, permissions were enforced at the workspace level, and there was no consistent way to grant a user access to a table in workspace A without giving them an account in workspace A. For an enterprise with dozens of workspaces (a normal state for large customers), governance was a sprawling mess of duplicated permissions, inconsistent access models, and audit gaps.

This was a real selling problem. Enterprise security teams looked at Databricks and said "we can't approve this; the governance story is too weak." Snowflake won deals on this dimension alone. Unity Catalog was the engineering response: one metastore per region, spanning all workspaces, with a consistent ANSI SQL permission model (GRANT SELECT ON TABLE ... TO ...) and a unified audit log. It was not glamorous work, but it removed an objection that had been costing Databricks deals for years.

The 2024 open-sourcing was a separate strategic move worth understanding: by making Unity Catalog open source, Databricks invited Snowflake, AWS, Google, and others to interoperate with it. The implicit pitch is "make Unity Catalog the universal metadata layer for the open lakehouse," which is both a defensive move (preventing a competitor's catalog from becoming the standard) and an offensive one (drawing the standardization gravity toward Databricks' design).

How It Works

Unity Catalog organizes data in a three-level namespace: catalog → schema → object. An object can be a table, a view, a volume (a managed location for unstructured files), a function, a model, or an ML feature table. The full name of a table looks like main.sales.orders — catalog, schema, table.

The major capabilities:

Permissions. Standard SQL GRANT / REVOKE semantics on every object, with hierarchical inheritance (grant on a catalog cascades to all schemas and tables inside it). Permissions support row-level security via row filters and column-level security via column masks, defined as SQL functions.

Lineage. Every query and job that reads or writes a table is captured automatically, building a lineage graph from raw sources through every transformation to final BI dashboards and ML models. This is captured at the column level, not just the table level, and surfaced in the Databricks UI as well as via API. For data teams trying to answer "what would break if I drop this column," Unity Catalog lineage is the answer.

Audit logs. Every access, every query, every grant change is logged. The audit log is the primary source for security teams investigating incidents and for compliance reports.

Volumes. A first-class abstraction for unstructured files (images, PDFs, audio) governed by the same permission model as tables. This is how Databricks supports ML and unstructured data workflows under unified governance.

Delta Sharing. An open protocol (also open-sourced by Databricks, in 2021) for sharing live tables across accounts, clouds, and even across vendors. A Databricks account can share a Delta table with a customer who reads it from Snowflake, BigQuery, or pandas, without copying data. This is the direct competitor to Snowflake's Secure Data Sharing, with the key strategic difference that Delta Sharing is open and cross-platform by design.

Federation. Unity Catalog can also act as a metadata layer over external systems — registering tables in Snowflake, MySQL, Postgres, Redshift, and BigQuery and applying its governance and lineage on top. This is the move from "Databricks' catalog" to "the catalog of your whole data estate," and it's where Unity Catalog gets most ambitious.

What It's Good At

Centralized governance across many workspaces. The original problem it was built to solve, and the place where it most clearly beats the alternatives.
Lineage at column granularity. Few catalogs do this well; Unity Catalog does it as a native side-effect of running queries on the platform.
Cross-platform sharing via Delta Sharing. The open protocol angle is genuinely differentiated and increasingly important as multi-vendor data estates become normal.
Unified governance over structured data, files, and ML assets. Treating tables, volumes, and models under one permission system is a real architectural win.

What It's Not Good At (yet)

Being your only catalog if you're not Databricks-first. Unity Catalog is best when most of your data lives in Databricks. Federation across other platforms is improving but still feels secondary.
Replacing a dedicated data catalog. For pure discovery, business-glossary, and stewardship use cases, dedicated catalogs (Atlan, Collibra, Alation, data.world) still offer richer interfaces. Unity Catalog is more "governance plumbing" than "discovery experience."
Operational simplicity in complex multi-region setups. Enterprises with strict data residency requirements end up running multiple Unity Catalog metastores and coordinating between them, which is not trivial.

The Opinionated Take

Unity Catalog is the unsexy infrastructure that makes the lakehouse credible to enterprise buyers. The engineering itself is solid but not surprising — it's the kind of work that has to exist for any platform that wants to be the system of record for an enterprise's data. The interesting move is the open-sourcing in 2024. By turning Unity Catalog into an Apache project, Databricks made a play to become the standard metadata layer for the open lakehouse era, the same way Delta Lake (and Iceberg) became the standard table formats. If that play works, Unity Catalog ends up running governance for data that lives in Snowflake, BigQuery, and Redshift, not just Databricks — a position with enormous strategic value.

The honest comparison to Snowflake: Snowflake's governance is more polished and turnkey within Snowflake, but it ends at the Snowflake account boundary. Unity Catalog is messier in places but architecturally designed to span platforms, clouds, and vendors. Which one is "better" depends on whether you believe the future is "everything in one warehouse" (Snowflake's bet) or "open formats across many engines" (Databricks' bet). The market is increasingly siding with the second view, and Unity Catalog is positioned for that world.

The convergence story applies here too. Snowflake's Horizon Catalog and Polaris Catalog (also open-sourced, in 2024) are the mirror of Unity Catalog, each company building an open governance layer designed to host the other's data. That two ostensibly competitive companies are racing to open-source their catalogs tells you everything about where the industry is heading: the catalog is becoming the new battleground because that's where lock-in actually lives in an open-format world.

How TextQL Fits

Unity Catalog is one of the most useful metadata sources TextQL can connect to. TextQL Ana reads Unity Catalog table descriptions, column comments, tags, and lineage to ground its SQL generation in the customer's actual semantics — not just schema names and types, but the meaning of each column and the relationships between tables. For Databricks customers, a well-tagged Unity Catalog deployment is the single biggest lever for improving AI analyst accuracy, because the same metadata that helps a human analyst understand the warehouse is exactly what the LLM needs.

See TextQL in action

Unity Catalog

Released Announced 2021; GA August 2022; open-sourced June 2024

Vendor Databricks

Type Unified governance and metadata catalog

License Apache 2.0 (open source as of 2024)

Category Data Warehouse

Monthly mindshare ~50K · newer governance feature; ~10% of Databricks customers actively using