NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Databricks
Databricks — the lakehouse company founded by the creators of Apache Spark. Started as a notebook platform for data science, now one of the two dominant enterprise data platforms alongside Snowflake.
Databricks is what you get when the people who invented Apache Spark start a company and then spend ten years convincing the enterprise that the data warehouse is not the right shape for modern data. The short pitch: Databricks is a lakehouse — a platform that gives you warehouse-style SQL performance and governance on top of open file formats sitting in your own cloud object store. The longer pitch is that Databricks is trying to be the single platform for every kind of data workload a large enterprise has: SQL analytics, machine learning, data science, streaming, and AI model training.
If Snowflake's metaphor is "a database you don't manage," Databricks's metaphor is "a giant computer you can use for anything data-related" — batch ETL, Python notebooks, production ML pipelines, SQL dashboards, Spark jobs, fine-tuning an LLM. All of it, on all your data, in one place.
This flexibility is Databricks's greatest strength and its oldest PR problem. Everything is possible, but nothing is as simple as it is in Snowflake — at least not until recently.
Databricks was founded in 2013 by seven people from UC Berkeley's AMPLab, the same research group that had already produced Mesos (later Kubernetes-era infrastructure). The founding team included:
The founding thesis was that Hadoop was too hard. In 2013, doing "big data" meant running a Hadoop cluster with HDFS, Hive, Pig, and half a dozen other Apache projects duct-taped together. Spark was faster (10–100x for iterative workloads, thanks to in-memory execution) and had much nicer developer ergonomics (DataFrames, Python and Scala APIs). Databricks started life as "managed Spark in a notebook" — a cleaner, hosted way to run Spark on your data in S3 without the Hadoop operational tax.
For the first four or five years, Databricks was primarily a data science and ML tool. Data scientists loved it; data warehouse buyers mostly ignored it. The pivot that made Databricks an enterprise data platform happened in 2018–2020 with two moves: Delta Lake and the lakehouse thesis.
In 2019, Databricks open-sourced Delta Lake, a table format that adds ACID transactions, schema enforcement, and time travel to Parquet files sitting in cloud object storage. In 2020, Zaharia and co-authors published a paper titled "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics," making the argument that the warehouse/lake split was an accident of history, not an architectural necessity.
The thesis, paraphrased: Warehouses are expensive, closed, and bad at ML/unstructured data. Lakes are cheap and open but bad at reliability and SQL. A lakehouse fuses the two — open file formats in object storage, plus a transactional layer (Delta Lake, later also Iceberg), plus a fast SQL engine (Photon), plus governance (Unity Catalog). You get warehouse-grade SQL and lake-grade flexibility in the same system, on the same copy of data, with no proprietary storage format.
This was a direct shot at Snowflake, and it worked. "Lakehouse" went from a Databricks marketing term in 2020 to an industry-standard category by 2023. Every major vendor now claims to be lakehouse-compatible, even the ones who spent years calling it a made-up word.
A Databricks workspace has four layers worth understanding.
1. Open storage (your cloud object store). Data lives as Parquet files in S3, ADLS, or GCS, organized as Delta Lake tables (or increasingly Iceberg tables). The storage is in your own cloud account — Databricks does not host your data. This is a huge philosophical and commercial difference from Snowflake.
2. Compute clusters. You run workloads on ephemeral Spark clusters. Historically these were general-purpose Spark clusters; today they come in several flavors:
3. Unity Catalog. A unified governance layer over all tables, files, models, and notebooks. Provides access control, lineage, auditing, and cross-workspace sharing. Unity Catalog was Databricks's answer to the critique that lakes had no real governance; launched in 2022, it's now the center of gravity of the platform.
4. Workloads on top. MLflow (the de facto open-source ML experiment tracker, originated at Databricks), Databricks SQL, Databricks Workflows (orchestration), Model Serving, Mosaic AI (post-MosaicML acquisition, 2023, $1.3B), Genie and AI/BI Dashboards, Databricks Apps, Delta Live Tables for declarative pipelines, and Databricks Connect.
The mental model: one platform where a data engineer can write a dbt job, a data scientist can train an XGBoost model on the same table, and a BI analyst can hit it from a dashboard, all without copying the data anywhere.
Databricks's official pitch is that the world is moving from warehouses to lakehouses, and every other vendor is either catching up to that (Snowflake adding Iceberg) or irrelevant (legacy warehouses). They're explicit about their rivals:
The Tabular acquisition (June 2024, ~$1–2B) was particularly notable: it brought the creators of Apache Iceberg in-house, signaling that Databricks is committing to Iceberg as a first-class format alongside Delta Lake, and trying to become the dominant lakehouse across both table formats. The open-source governance of Iceberg and Delta is converging under Databricks's influence.
Where Databricks's pitch is self-serving: the operational complexity is real. Running Databricks well requires genuine engineering skill. The "one platform for everything" pitch assumes you have the team to use it. Small companies often find Snowflake dramatically simpler to adopt.
Good at:
Bad at (or honest weaknesses):
Databricks's strategy for the next phase is legible and ambitious:
The honest view: Databricks and Snowflake are converging on the same product — a unified data + AI platform with open-format storage, SQL and Python workloads, governance, and AI primitives. The interesting question of 2026–2028 is whether enterprises pick one, run both, or whether a smaller open-source-native alternative disrupts them both from below.
TextQL Ana connects to Databricks via the SQL Warehouse endpoint and Unity Catalog, respecting table, row, and column-level access policies defined in Unity Catalog. Because Databricks holds so many different data shapes in one place — warehouse tables, Delta tables, streaming tables, and ML features — TextQL can reason across a broader surface than on a pure SQL warehouse. For customers running Databricks as their central platform, TextQL inherits the full Unity Catalog lineage and permission model, meaning business users get natural-language analytics with the same governance that covers data engineers and data scientists.
See TextQL in action