Databricks Photon | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Databricks Photon

Photon is Databricks' from-scratch C++ vectorized execution engine for SQL and DataFrame workloads. It runs underneath the JVM-based Spark API and dramatically accelerates Databricks SQL workloads. It is also completely locked into the Databricks platform.

Photon is Databricks' next-generation vectorized query execution engine, written from scratch in C++ and designed to drop in underneath the existing Spark and SQL APIs. From the user's point of view, nothing changes — you write the same Spark SQL or DataFrame code you've always written — but underneath, supported operators run on Photon's C++ engine instead of the older JVM-based Spark execution path. The result is a 2-3x speedup on typical SQL workloads, sometimes much more on the operations Photon handles especially well.

The plain-English description: Photon is Databricks' answer to "Java is too slow for the kind of SQL performance modern warehouses need." Snowflake is written in C++. ClickHouse is written in C++. DuckDB is written in C++. The pattern is consistent — the fastest analytical engines all sit on a tightly-controlled, vectorized C++ runtime that uses SIMD instructions and careful memory layout to chew through columnar data at the speed of the underlying hardware. Spark, by contrast, was written in Scala on the JVM, which made it portable and pleasant to extend but left a lot of performance on the table. Photon closes that gap.

Origin Story

Databricks announced Photon at the Data + AI Summit in 2020, reached general availability in 2022, and published an academic paper at SIGMOD 2022 describing its design. The team had been working on it internally for years before that, motivated by a fundamental observation: as Databricks customers ran more SQL workloads on top of Spark, the JVM-based execution engine was becoming the bottleneck. Spark had layered query optimizations on top of the JVM (Tungsten, whole-stage code generation), but at some point you hit the limits of what you can squeeze out of the JVM, and the only way forward is to leave it.

Photon is Databricks rewriting the execution engine — not the API, not the optimizer, not the planner — in C++, and slotting it into the existing Spark architecture as a transparent accelerator. When a query plan compiles, the planner asks "can Photon handle these operators on this data?" If yes, those operators go to Photon. If no, they fall back to Spark's regular Tungsten execution. Coverage has expanded over time and now covers most common SQL and DataFrame operations.

This was a substantial engineering investment. Building a vectorized C++ engine that interoperates safely with a JVM runtime is genuinely hard — memory management, type systems, exception handling, and performance interfaces all have to line up. The fact that it works at all, and works transparently, is one of the more impressive engineering accomplishments in the modern data infrastructure space.

What Photon Does Well

Vectorized columnar execution. Photon processes data in batches of rows ("vectors") rather than one row at a time. For each operator — a filter, an aggregate, a join hash probe — the C++ inner loop runs over a buffer of values using SIMD CPU instructions where possible. This is the same architectural pattern used by every modern fast analytical engine, and it's where most of the speedup comes from.

Tight integration with Delta Lake and Parquet. Photon's readers for Parquet and Delta tables are heavily optimized. Predicate pushdown, column pruning, and data-skipping indexes are all integrated, so Photon often does much less I/O than a naive engine would. This matters a lot on the lakehouse storage that Databricks is built around.

Interactive SQL workloads. The biggest beneficiary of Photon is Databricks SQL Warehouses — the interactive BI/analyst workload, where you want sub-second to seconds-of-latency responses on dashboards. Databricks publishes TPC-DS benchmark numbers showing Photon-on-Databricks-SQL outperforming traditional warehouses. Take vendor benchmarks with the usual grains of salt, but the underlying improvement is real.

Drop-in transparency. From the user's perspective, Photon is just a checkbox (or actually, a default-on option in many newer cluster types). You don't have to rewrite anything. Your existing Spark SQL and DataFrame code automatically gets faster on supported operators. This is unusual for a major engine swap.

What Photon Does Not Do

It does not exist outside Databricks. This is the most important thing to understand about Photon strategically. It is not open source. There is no community edition. There is no way to download it, deploy it independently, or use it with non-Databricks Spark. Photon is locked into the Databricks platform, and that is a deliberate product decision. You cannot use Photon to query an Iceberg table from outside Databricks. You cannot run Photon against Postgres for federation. You cannot benchmark it against other engines on your own hardware.

This is the opposite of how Spark itself was built. Spark is open source, runs everywhere, and is one of the reasons Databricks succeeded — the openness of the engine drove adoption that Databricks then converted into commercial value. Photon goes the other way. It is a moat. The bet is that Databricks customers will pay for the speedup and that the performance lead over open-source alternatives will help defend against Snowflake and other competitors.

It does not cover everything. Photon supports a growing but not complete set of Spark SQL operators and data types. Workloads that use unsupported features fall back to the JVM Spark engine, which is fine functionally but doesn't get the speedup. The set of unsupported operations has shrunk over the years and continues to shrink, but coverage is not 100%.

It is not a federation engine. Photon executes queries against data Databricks already manages — Delta Lake, Iceberg via Unity Catalog, Parquet on object storage. It is not designed to reach out to dozens of external sources the way Trino is. Databricks' Lakehouse Federation feature does some of that, but it's a separate product layer.

The Honest Market Take

Photon is a real engineering accomplishment and a genuine performance improvement for Databricks customers. It is also a transparent strategic move to lock value into the Databricks runtime. Both of these things are true.

For an organization that has already chosen Databricks, Photon is essentially a free win — you get a 2-3x speedup on SQL workloads with no code changes, and the price is built into the cluster type you're already paying for. There's no good reason to turn it off.

For an organization choosing between platforms, Photon should not be the deciding factor. The deciding factors are everything else about Databricks — the lakehouse architecture, Unity Catalog, MLflow, the Spark heritage, the pricing model, the existing Spark talent on your team. If Databricks is the right platform overall, Photon is a nice bonus. If it's not, Photon won't change that.

The interesting analog is Velox, the open-source C++ vectorized engine being built primarily by Meta (and intended to drop into Presto and other engines). Velox is the open-source counter-bet: build the same kind of C++ vectorized execution engine, but make it embeddable in any project. If Velox succeeds, the long-term differentiation that Photon offers becomes harder to defend. If it doesn't, Photon stays a meaningful Databricks-only advantage.

TextQL Fit

TextQL connects to Databricks via the Databricks SQL Connector. When TextQL points at a Databricks SQL Warehouse, queries automatically execute on Photon (for supported operators) — there's nothing to configure on TextQL's side, the speedup is transparent. For TextQL deployments backed by a Databricks lakehouse, this is the natural and recommended setup, and it's the same pattern as connecting TextQL to Snowflake or BigQuery: you talk to the warehouse's SQL endpoint and let the underlying engine do its thing.

See TextQL in action

Databricks Photon

Announced 2020

GA 2022

Origin Databricks

License Commercial / proprietary (Databricks-only)

Written in C++

Replaces JVM-based Spark execution engine (for supported operators)

Category Query Engines

Monthly mindshare ~50K · Databricks-only; subset of Databricks SQL Warehouse users