DuckDB | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

DuckDB

DuckDB is an embedded analytical database -- the SQLite of analytics. It runs in-process, has zero dependencies, and queries Parquet and CSV files faster than most distributed clusters. MotherDuck is the commercial cloud play built on top of it.

DuckDB is an embedded analytical database. That phrase deserves unpacking. "Embedded" means it runs inside another process — a Python script, an R session, a Node app, a binary — with no server, no cluster, no daemon. "Analytical" means it's columnar and vectorized, optimized for the kind of group-by-and-aggregate queries that warehouses are built for. Put those two things together and you get something that did not really exist before DuckDB: a serious OLAP engine that fits inside your laptop.

The standard one-line description is "DuckDB is the SQLite of analytics." It is a pretty good line. SQLite proved that an in-process, single-file, zero-config relational database is enormously useful even though it doesn't scale horizontally — it ended up shipping in essentially every phone, browser, and operating system on Earth. DuckDB is making the same bet for analytics. Most analytical jobs do not actually need a cluster. They need a fast columnar engine that can chew through a few billion rows on a laptop with 32GB of RAM. DuckDB does that, well, and almost instantly.

Origin Story

DuckDB came out of the database research group at CWI in Amsterdam — the same institute that produced Python (Guido van Rossum) and that has been at the center of academic columnar database research for two decades. The creators, Mark Raasveldt and Hannes Mühleisen, wanted an analytical database that could be embedded in data science tools the same way SQLite is embedded in everything else. The first public release was in 2019.

The early DuckDB project was small, academic, and quietly excellent. Around 2021-2022 it started to go viral in data circles. Bloggers benchmarked it against Spark and Presto and discovered that on single-node workloads — which turns out to be most workloads — DuckDB was faster, simpler, and required no infrastructure. Pandas users discovered that DuckDB could run SQL directly against their DataFrames. dbt added a DuckDB adapter. Suddenly DuckDB was everywhere.

In 2022, Hannes Mühleisen and Jordan Tigani (formerly of Google BigQuery) co-founded MotherDuck, a venture-backed company building a commercial cloud product on top of DuckDB. MotherDuck raised over $100 million across seed and Series B rounds and launched general availability in 2023. The pitch is "hybrid execution" — queries run partly on your local DuckDB and partly in MotherDuck's cloud, so you get the speed of local for what's local and the scale of cloud for what's not.

Why DuckDB Is Having a Moment

There are a few reasons DuckDB went from academic curiosity to one of the most-discussed projects in the data ecosystem in roughly three years.

Single-node hardware got absurdly powerful. A modern laptop has 8-16 cores and 32-128GB of RAM. A modern cloud VM has 96+ cores and a terabyte of RAM. The set of analytical workloads that actually require horizontal scale is much smaller than the data industry pretended throughout the 2010s. Once people noticed, the case for "just use one fast machine" became obvious.

Parquet is everywhere. DuckDB reads Parquet files directly with SELECT * FROM 'file.parquet'. It also reads CSV, JSON, Apache Iceberg, Delta Lake, Postgres, MySQL, S3, GCS, and Hugging Face datasets. The "no ETL, just query the file" experience is genuinely magical the first time you try it.

Zero installation pain. pip install duckdb. No JVM, no cluster, no Docker, no config files. Compare this to setting up Spark or Trino and the appeal is immediate. For data scientists and analysts who just want to run SQL against some files, DuckDB removes essentially every yak shave between you and the answer.

It is genuinely fast. DuckDB's execution engine is a modern vectorized columnar engine — the same kind of engineering that makes Snowflake and Photon fast — packed into a single binary. Benchmarks vary, but on the ClickBench and TPC-H workloads DuckDB consistently sits near the top of the single-node category, often beating distributed systems on equivalent hardware budgets.

It composes with the Python data stack. DuckDB queries Pandas DataFrames, Polars DataFrames, and Arrow tables in-place, with zero copy in many cases. You can write duckdb.sql("SELECT ... FROM my_dataframe") and it just works. This makes DuckDB the natural SQL layer for the modern Python notebook.

What DuckDB Is Not

It is not distributed. DuckDB runs on one machine. If your data and your working set genuinely don't fit on one big machine, DuckDB is not the answer. (MotherDuck addresses some of this by extending DuckDB into the cloud, but the core engine is single-node.)

It is not concurrent in the traditional database sense. DuckDB is designed for one (or a few) heavy analytical readers per process, not for thousands of simultaneous OLTP-style users. It is not a backend for your web app.

It is not OLTP. Like every other engine in this section, DuckDB is for analytics. Don't put your application's writes in it.

MotherDuck and the Commercial Play

MotherDuck is the most interesting thing happening in the DuckDB world commercially. The product is best described as "DuckDB with a cloud." You write the same DuckDB SQL you'd write locally, but tables can live in MotherDuck's cloud storage, and queries can execute partly local, partly cloud, with the planner deciding which side does which work. For a small team that wants the simplicity of DuckDB with the persistence and sharing of a hosted service, this is compelling. For a large enterprise weighing it against Snowflake or Databricks, the value proposition is more about cost and developer experience than raw scale.

The honest market take: DuckDB the open-source project has clearly won its category — there is no other serious embedded analytical SQL engine. Whether MotherDuck the company can build a durable business on top of it is the open question. The same dynamic that made DuckDB great (single-node, embedded, runs anywhere) makes it harder to monetize than a hosted warehouse where every query has to go through your servers.

TextQL Fit

TextQL connects to DuckDB and to MotherDuck. DuckDB is a particularly good backend for TextQL in two scenarios. The first is local prototyping — pointing Ana at a folder of Parquet or CSV files and asking questions in natural language, with no warehouse required. The second is analytical sandboxes inside larger organizations, where small teams need fast, isolated SQL on a slice of data without provisioning real warehouse compute. DuckDB's "no infrastructure" property makes TextQL trivial to set up against it.

See TextQL in action

DuckDB

Created 2019 (first release)

Origin CWI (Centrum Wiskunde & Informatica), Amsterdam

Creators Mark Raasveldt, Hannes Mühleisen

License MIT

Written in C++

Commercial MotherDuck (founded 2022)

Category Query Engines

Monthly mindshare ~120K · ~22K GitHub stars; rising fast as the SQLite of analytics; massive growth in 2024-25