NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Real-time Analytics
Real-time analytics databases are OLAP systems built to answer analytical queries over fresh, high-volume event data in milliseconds rather than seconds. ClickHouse has become the dominant choice; Druid and Pinot are the prior generation; Rockset was acquired by OpenAI in 2024.
A real-time analytics database is a specialized OLAP system designed to do two things at the same time, both of which traditional warehouses do badly: ingest events as they happen, and answer analytical queries over those events in milliseconds. Cloud data warehouses like Snowflake and BigQuery are powerful, but they were built for "scan a billion rows in seconds when an analyst clicks Run." Real-time analytics databases are built for "scan a billion rows in 50 milliseconds when a user loads a dashboard, and do this for thousands of users concurrently, with data that's no more than a few seconds old."
The mental model worth holding: a data warehouse is for asking ad-hoc questions about yesterday. A real-time analytics database is for powering a live dashboard, an in-product analytics feature, or an alerting system that needs answers about what is happening right now. They are not the same product even though they are both "OLAP databases."
If warehouses are so good, why do we need a separate category at all? Three reasons:
1. Latency floor. Snowflake, BigQuery, and Redshift have query latency in the seconds-to-tens-of-seconds range for typical analytical queries. That is fine for an analyst running a query, but unacceptable for a user-facing dashboard that needs to render in under a second when 10,000 users hit it at the same time. Real-time OLAP databases hit p95 query latencies in the tens of milliseconds for the same kinds of aggregations.
2. Concurrency. Warehouses are designed for tens to low hundreds of concurrent queries. A user-facing analytics product might need thousands or tens of thousands of concurrent queries. Real-time OLAP engines are built for that level of concurrency on commodity hardware.
3. Streaming ingestion. Warehouses traditionally batch-load data on intervals (Snowpipe Streaming has narrowed this gap, but not fully). Real-time OLAP databases ingest events directly from Kafka or Kinesis with end-to-end freshness in seconds.
The classic use case where this matters: in-product analytics. Think LinkedIn's "who viewed your profile" page, or Stripe's transaction dashboard, or any SaaS product with a live "your usage this month" view. These are queries running over event data, served to end users, with low latency and high concurrency. A warehouse cannot do this economically. A real-time OLAP database can.
Almost every real-time OLAP database shares a similar architectural recipe:
1. Columnar storage. Like warehouses, they store data column-by-column, not row-by-row. Analytical queries that touch a few columns over many rows scan only the relevant columns.
2. Aggressive compression and encoding. Specialized codecs (delta-of-delta for timestamps, run-length encoding for low-cardinality strings, gorilla compression for floats) shrink data dramatically and let queries run on compressed data without decompression.
3. Pre-aggregation or rollups. Most engines support rolling up incoming data into pre-computed aggregations at ingest time (Druid is famous for this). This trades some query flexibility for massive speedups.
4. Time-partitioned segments. Data is split into time-based segments (hourly, daily) so queries that filter by recent time ranges only scan recent segments. This is the single most important optimization for time-series-shaped event data.
5. In-memory or SSD-resident hot data. The recent data lives in memory or local SSD; older data is offloaded to cheaper storage. The mix is engine-specific.
6. Streaming ingestion. Direct Kafka/Kinesis consumers, with events queryable seconds after ingestion.
The differences between ClickHouse, Druid, and Pinot are in the details of these recipes — how they handle joins (some don't, well), how they manage schema evolution, what kinds of queries they optimize for, and how they trade off ingestion latency vs. query latency vs. operational complexity.
For a long time, the conventional wisdom in this category was a three-way race between Apache Druid (LinkedIn-adjacent origin, oldest of the three, ad-tech roots), Apache Pinot (LinkedIn-built, used to power LinkedIn's profile views), and ClickHouse (Yandex origin, originally for clickstream analytics). All three solved similar problems with similar architectures.
That race is over. ClickHouse won. It is the dominant real-time OLAP database in 2026 by every meaningful measure — adoption, ecosystem, mindshare, GitHub activity, hiring demand. The reasons:
Apache Druid is fading, not dead. It remains in production at large adopters like Netflix, Airbnb, and Lyft. New deployments are rarer.
Apache Pinot still has its niche — LinkedIn famously uses it to power user-facing features at scale, and StarTree is doing real work commercializing it — but it has not displaced ClickHouse outside its core constituency.
Rockset was the venture-backed entrant that pitched "real-time analytics with full SQL on schemaless data." It was technically interesting and well-funded but never reached the scale of the others. OpenAI acquired Rockset in June 2024, ostensibly to use its real-time analytics capabilities in OpenAI's infrastructure. Rockset's external commercial product was wound down. The lesson, perhaps: this category is hard to monetize as a standalone vendor when ClickHouse exists as a free open-source alternative.
Real-time OLAP databases sit downstream of event streaming and stream processing, and at the same level as data warehouses — but for a different use case. The typical architecture: events flow through Kafka, are optionally enriched by Flink, and are written into ClickHouse (or Druid, or Pinot) where they become queryable for live dashboards, in-product analytics, observability tools, and operational alerting.
Real-time OLAP databases are not a replacement for warehouses. They are a complement. Most large data organizations run both: a warehouse for batch analytics, ad-hoc questions, and ML training data; a real-time OLAP database for the use cases that need millisecond latency on fresh event data.
| Scenario | Real-time OLAP? | Alternative |
|---|---|---|
| —- | —- | —- |
| User-facing live dashboard with thousands of concurrent users | Yes | Warehouse cannot handle the concurrency |
| In-product analytics ("your usage this month") | Yes | Same |
| Observability / log analytics at high volume | Yes (often ClickHouse) | Elasticsearch is the legacy alternative |
| Real-time fraud detection or alerting | Yes | Combined with stream processing |
| Internal BI dashboards refreshed daily | No | Warehouse |
| Ad-hoc analyst queries on historical data | No | Warehouse |
| ML training data preparation | No | Lakehouse / warehouse |
| Real-time leaderboards with millisecond freshness | Maybe | Or Materialize |
Real-time OLAP databases are first-class TextQL connections. Because ClickHouse, Druid, and Pinot all speak SQL (with some dialect variation), TextQL Ana can query them directly the same way it queries Snowflake or BigQuery. The interesting use case is freshness: when a business user asks "how many sign-ups did we get in the last hour, by region," TextQL pointed at ClickHouse can answer with data that is seconds old, not yesterday's batch. Real-time OLAP backends are how TextQL goes from "analytics on historical data" to "analytics on what's happening right now."
See TextQL in action