Apache Pulsar | Data Ecosystem Wiki

Public Preview · May 18–Jun 5 NEW: Opus 4.8 is now available in Ana →

Contents

Apache Pulsar

Apache Pulsar is a distributed messaging and streaming platform created at Yahoo in 2012-2013. Open-sourced in 2016 and an Apache top-level project since 2018, it is the most architecturally interesting Kafka alternative -- separating compute from storage via Apache BookKeeper -- and the most widely ignored outside of a handful of large adopters.

Apache Pulsar is a distributed messaging and streaming platform that does most of what Kafka does, plus some things Kafka does not, with an architecture many engineers consider cleaner. It was born at Yahoo in 2012-2013 to solve a messaging problem at Yahoo's scale — not just "high throughput," but "thousands of internal teams sharing one messaging system without stepping on each other." That multi-tenant origin shaped Pulsar's design in ways that still distinguish it from Kafka today.

This page is about the open-source project governed by the Apache Software Foundation. The commercial company founded by Pulsar's creators — the one that sells managed Pulsar — is StreamNative, and it has its own page.

If Apache Kafka is the obvious answer to event streaming in 2026, Pulsar is the technically interesting one that lost the popularity contest. It is the platform engineers admire on architecture diagrams and rarely deploy unless they work at one of a handful of very large companies.

The Yahoo Multi-Tenant Origin

In the early 2010s, Yahoo was running dozens of properties — Mail, Sports, Finance, News, Tumblr, Flickr — and dozens of internal teams needed messaging infrastructure. The engineering reality of the time was that each team would either run their own RabbitMQ cluster, hack something together on top of a database, or fight with the existing shared infrastructure. Yahoo's messaging team was tasked with building something that could be a true shared service: one cluster, many tenants, isolation guarantees, geographic replication, and enough scale to handle Yahoo's overall traffic.

The team was led by Matteo Merli, Sijie Guo, and Sanjeev Kumar Ramasamy (Karthik Ramasamy was also a key early leader on Yahoo's messaging team). They began deploying Pulsar inside Yahoo around 2013. The project was open-sourced in September 2016, entered the Apache Incubator in June 2017, and graduated to a top-level Apache project in September 2018.

The multi-tenant origin matters because it explains Pulsar's most distinctive features. Kafka was built at LinkedIn to solve LinkedIn's data integration problem — multi-tenancy was an afterthought. Pulsar was built at Yahoo to be a shared service across Yahoo from day one. Tenancy, geo-replication, and resource isolation are not features bolted on; they are in the architecture.

In 2019, Merli and Guo left Yahoo (then Verizon Media) to found StreamNative, which became the primary commercial sponsor of Pulsar in much the same way Confluent stewards Kafka. DataStax — best known as the commercial company behind Apache Cassandra — also adopted Pulsar as the underlying technology for its Astra Streaming product, giving Pulsar a second commercial home.

What Makes Pulsar Architecturally Different

Pulsar's most distinctive design choice is the separation of compute and storage. In Kafka, brokers do both — they accept writes from producers and they store the data on local disk. If you need more storage, you add brokers; if you need more throughput, you add brokers. The two scale together whether you want them to or not.

In Pulsar, brokers are stateless — they handle producer/consumer connections and route messages, but they do not store data. Storage is handled by Apache BookKeeper, a separate distributed log storage system (originally built at Yahoo to hold the HDFS NameNode write-ahead log). BookKeeper nodes (called "bookies") store message data in segments. The result is that you can scale brokers and storage independently, replace brokers without moving data, and add storage capacity without rebalancing.

Think of it this way: Kafka is a single-tier architecture where each broker is its own little database. Pulsar is a two-tier architecture where the brokers are routers and BookKeeper is the database. Two-tier architectures are more flexible at scale; one-tier architectures are simpler to operate at small scale. That tradeoff is the fundamental Pulsar-vs-Kafka decision in a nutshell.

This architecture has real advantages:

Stateless brokers can be replaced or scaled in seconds. No partition rebalancing, no waiting for data to copy.
Tiered storage is native. Old segments are automatically offloaded to S3, GCS, or HDFS while recent data stays on the bookies. Kafka eventually added tiered storage in KIP-405; Pulsar has had it since the beginning.
Geo-replication is built into the protocol. You can configure topics to replicate to other clusters in other regions with a single config change. Kafka requires MirrorMaker 2 or other external tools.
Multi-tenancy with namespace isolation. A single Pulsar cluster can host hundreds of tenants with quotas, ACLs, and resource isolation enforced at the namespace level.

It also has a feature Kafka does not: unified queue and stream semantics. Pulsar topics support multiple subscription modes — exclusive (one consumer), shared (queue-style load balancing), failover, and key-shared. This means you can use Pulsar as both "Kafka-like log" and "RabbitMQ-like work queue" from the same cluster. For organizations that previously ran both, this consolidation is appealing.

The Companion Projects

Like Kafka, Pulsar has a small constellation of related capabilities, most of them embedded in the broker rather than separate projects:

Pulsar Functions. A lightweight, in-broker stream processing API. Analogous to Kafka Streams but without requiring a separate application or library. Useful for simple per-event transformations; not a substitute for Apache Flink for serious stateful processing.

Pulsar IO. A connector framework analogous to Kafka Connect, built on top of Pulsar Functions. Offers connectors to common sources and sinks. The connector library is smaller than Kafka Connect's.

Pulsar SQL. A Presto/Trino integration that lets you query topics directly with SQL. Useful for ad hoc inspection; not the primary way most teams query Pulsar data.

Apache BookKeeper. Strictly a separate ASF project, but functionally inseparable from Pulsar in production. BookKeeper is the durable storage tier, and operating Pulsar means operating BookKeeper.

Why Pulsar Lost the Mindshare War

Despite being technically elegant, Pulsar never came close to Kafka's adoption. The honest reasons:

1. Timing. By 2016 when Pulsar was open-sourced, Kafka was already dominant. Confluent had been selling Kafka commercially for two years, the Kafka ecosystem (Connect, Streams, Schema Registry) was mature, and every adjacent tool integrated with Kafka first. Pulsar came late to a market that had already coalesced.

2. Operational complexity. Pulsar's architecture is elegant on paper but operationally heavier. You run brokers, you run bookies, and (historically) you run ZooKeeper. Three components instead of one means three things to monitor, three things to upgrade, three things to debug. Kafka's "everything in one process" model is operationally simpler even if the scaling story at extreme scale is worse.

3. Ecosystem gap. Every modern data tool has a Kafka connector. Many fewer have a Pulsar connector, and those that do often lag in features. If you want to plug your stream into Snowflake, ClickHouse, Flink, or Debezium, the Kafka path is paved and the Pulsar path has potholes.

4. The Kafka API became the standard. Even Pulsar acknowledged this — it ships with a "Kafka-on-Pulsar" (KoP) compatibility layer that lets Kafka clients talk to Pulsar brokers. When your differentiating product needs to emulate the competitor's protocol, you have lost the protocol war.

5. The commercial sponsor never reached Confluent's scale. StreamNative is a real company with real customers, but it has not had the capital, headcount, or ecosystem-building muscle to push Pulsar into the default-choice position the way Confluent did with Kafka.

Where Pulsar Actually Wins

Pulsar is not dead — it has real, large-scale adopters whose use cases genuinely fit Pulsar better than Kafka:

Yahoo / Verizon Media. The original home, still running at massive scale.
Tencent. Reportedly the largest Pulsar deployment in the world, processing trillions of messages per day across thousands of tenants.
ByteDance. The TikTok parent company runs Pulsar for parts of its messaging infrastructure.
Splunk. Adopted Pulsar as part of its acquisition of Streamlio (a previous Pulsar commercial sponsor).
Telcos and large enterprises with strong multi-tenant or geo-replication requirements.

The pattern is consistent: very large organizations with explicit multi-tenant or multi-region requirements, where Pulsar's architectural advantages outweigh the ecosystem cost.

When Pulsar Is the Right Answer

The honest case for Pulsar in 2026: you have a multi-tenant requirement that Kafka would force you to solve by running many separate Kafka clusters, or you need built-in geo-replication and tiered storage and you do not want to wait for Kafka to mature its versions of those features, or you are running at the scale of Tencent or ByteDance and the architectural ceiling of Kafka's single-tier model is a real concern. If your main constraint is "we just need a Kafka-shaped thing," Kafka is the right answer.

The Honest Take

Pulsar is the streaming category's "technically better, commercially second" story. It is the Betamax to Kafka's VHS, the BeOS to Kafka's Windows. The architecture is genuinely cleaner in several ways. The operational story for very-large multi-tenant deployments is genuinely better. But none of that mattered enough to overcome Kafka's three-year head start, larger ecosystem, simpler operational model at small-to-medium scale, and dominant commercial sponsor.

If you are evaluating streaming platforms in 2026 and you are not Tencent or ByteDance, the default rational choice is Kafka. Pulsar is the answer when you have a specific reason — multi-tenancy at scale, geo-replication as a first-class concern, the unified queue-plus-stream semantics — that genuinely tips the calculus. Those reasons exist, and Pulsar shops are real, but they are the exception rather than the rule.

Where Pulsar Sits in the Stack

Pulsar lives in the same layer as Kafka and Kinesis — it transports events between producers and consumers. Pulsar Functions provides a lightweight stream processing API embedded in the brokers (analogous to Kafka Streams), and Pulsar IO provides connectors to external systems (analogous to Kafka Connect). Most organizations that pick Pulsar still use Apache Flink for serious stream processing rather than Pulsar Functions.

How TextQL Works with Apache Pulsar

As with Kafka, TextQL Ana does not query Pulsar directly. It connects to the systems where Pulsar events eventually land — typically a data warehouse, real-time analytics database, or lakehouse. The Pulsar layer determines event freshness; the analytics layer determines what TextQL can answer.

See TextQL in action

Apache Pulsar

Created 2012-2013 at Yahoo

Open-sourced September 2016

Apache TLP September 2018

Original creators Matteo Merli, Sijie Guo, Sanjeev Kumar Ramasamy

Governance Apache Software Foundation

Storage layer Apache BookKeeper

License Apache 2.0

Notable users Yahoo, Tencent, ByteDance, Splunk, Verizon Media

Commercial vendors StreamNative, DataStax

Category Event Streaming

Monthly mindshare ~30K · ~14K GitHub stars but smaller production footprint than Kafka; Yahoo origin