NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Data Catalog & Discovery
Data catalogs are the inventory and search layer for an organization's data. They answer the question: what data do we have, where did it come from, who owns it, and can I trust it?
A data catalog is an inventory of all the data an organization has, together with the context needed to find it, understand it, and trust it. If the data warehouse is the library, the data catalog is the card catalog, the Dewey Decimal system, the librarian's Rolodex, and the "most requested books" list, all merged into one product.
The core question a catalog is built to answer is embarrassingly simple: "we have a table called fct_orders_v3_final. What is it, where did it come from, who owns it, and can I trust it?" At a small startup this is a Slack message. At a 5,000-person enterprise with 40,000 tables across Snowflake, Redshift, Postgres, S3, Salesforce, and a dying Oracle instance, it is a full-time job for a team of people — which is exactly why catalogs exist as a product category.
Think of a catalog as four products fused into one: (1) an automated inventory that crawls your sources and lists every table, column, dashboard, and pipeline; (2) a lineage graph showing how data flows from raw ingestion through transformations to dashboards; (3) a search engine so humans can find the right asset by typing "revenue"; and (4) a governance workflow layer for owners, certifications, glossaries, tags, and approvals. Every vendor in this category is really just balancing these four ingredients differently.
Catalogs are not a nice-to-have invented by vendors. They exist because modern data stacks have a discovery crisis. A decade ago a company might have had one warehouse and a hundred tables; today it has multiple warehouses, a lake, a lakehouse, streaming topics, SaaS sources, operational stores, and dbt models stacked on top of each other. The number of data assets grew by 100x but the number of people who understand any one of them did not.
The result: analysts recreate the same metrics three different ways, nobody knows which of four customer tables is the real one, deprecated pipelines keep running because nobody remembers what depends on them, and new hires take six months to figure out where anything lives. A catalog does not fix the underlying sprawl — nothing does — but it makes the sprawl legible.
There is also a regulatory reason. GDPR, CCPA, HIPAA, SOX, and the EU AI Act all require organizations to know where personal or sensitive data lives, who touched it, and how it was used. A catalog is often the fastest path to "yes, we can prove that."
The catalog category has lived through two clearly distinct eras, and understanding the split is the single most important thing when evaluating vendors.
First generation (2010s): enterprise governance suites. Collibra (2008), Alation (2012), Informatica Enterprise Data Catalog, and IBM InfoSphere defined the category. They were sold top-down to Chief Data Officers, deployed on-prem or in single-tenant cloud, priced in the high six figures, and shaped around the needs of compliance teams. Their UIs looked like SAP. Their lineage was mostly manual or imported from ETL tools. They were very good at what they were designed for (regulated industries with heavy governance mandates) and bad at what modern data teams actually want (a fast, pretty tool that analysts will voluntarily open).
Second generation (2020s): data-team-native catalogs. Atlan (2019), DataHub (open-sourced by LinkedIn in 2019, commercialized by Acryl Data in 2020), Select Star (2020), Secoda, OpenMetadata, and Castor were built for a different buyer: the head of data or analytics engineering lead at a cloud-native company. They assume your warehouse is Snowflake or BigQuery, your transformation layer is dbt, your BI tool is Looker or Tableau, and you want column-level lineage out of the box. They look like Linear or Notion. They deploy in hours. They cost tens of thousands, not millions.
The convergence. By 2025 the two generations are colliding. Modern catalogs are adding the governance workflows and policy engines that Collibra built; legacy catalogs are frantically rewriting their UIs and adding column-level lineage. Everyone is now shipping an "AI copilot" that summarizes tables and answers natural-language questions. The technical gap is narrowing, but the brand and buyer gap is not: Collibra still wins the regulated-bank RFP; Atlan still wins the Series-C startup.
If there is one thing that separates modern catalogs from legacy ones and from each other, it is column-level lineage. Table-level lineage tells you that fct_orders feeds dashboard_revenue. Column-level lineage tells you that fct_orders.discount_amount feeds dashboard_revenue.net_revenue through three specific dbt models and a CASE statement. The second is dramatically more useful: it lets you do real impact analysis before you drop a column, answer "where does this PII actually end up?" during an audit, and debug silent metric changes.
Column-level lineage is computationally hard (it requires parsing every SQL statement in your warehouse history) and has been the benchmark every vendor now chases. Modern catalogs — Atlan, DataHub, Select Star, Castor, OpenMetadata — all ship it. Alation and Collibra have added it in recent releases but their implementations are generally considered shallower. Amundsen never did.
Here is the honest landscape, with no vendor hedging:
The structural force behind all of this: the buyer has shifted from CDO to Head of Data/Platform, and products built for the old buyer feel wrong to the new one.
A catalog is the closest thing most organizations have to a single source of truth about what data exists and what it means. That metadata — table descriptions, column-level definitions, lineage, certifications, owners — is exactly the context an LLM needs to generate correct SQL. TextQL Ana integrates with Atlan, DataHub, Collibra, Alation, and Select Star, pulling certified metrics, glossary terms, and lineage into the prompt so that generated queries respect governance decisions the organization has already made. In practice, customers with a well-maintained catalog get noticeably better answers from TextQL than those without one — the catalog's metadata is the semantic layer the AI actually reads.
See TextQL in action