NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Stream Processing
Stream processing engines compute on data while it is in motion -- transforming, joining, aggregating, and filtering events as they flow. Apache Flink is the dominant open-source engine; Ververica, Decodable, Confluent (via Immerok), and AWS all sell managed Flink. Materialize and ksqlDB compete on different niches.
A stream processor is a computation engine that runs continuously on data in motion. While an event streaming platform like Kafka is the transport — a durable log that producers write to and consumers read from — a stream processor is the compute layer that does interesting things to those events as they pass through. It joins streams together, aggregates events into windows, deduplicates, enriches with reference data, detects patterns, and emits the results to downstream systems.
The mental model worth holding: if event streaming is the highway, stream processing is the factory next to the highway. Trucks of events drive in, the factory does work on them, and trucks of processed events drive back out. The highway and the factory are different things, with different vendors, different engineering teams, and different operational concerns.
Before going further, here is the distinction this wiki insists on: the open-source engine and the commercial company that sells it are not the same thing. In stream processing, the dominant open-source engine is Apache Flink, and there are multiple commercial vendors selling managed Flink, each with a slightly different positioning:
Multiple meaningful commercial vendors selling the same open-source engine is unusual — it more commonly resolves into one dominant steward (Confluent for Kafka, Databricks for Spark). Flink has not consolidated that way, and the multi-vendor landscape is one of the more interesting features of the stream processing market in 2026.
| OSS engine | Commercial vendors | Notes |
|---|---|---|
| —- | —- | —- |
| Apache Flink | Ververica, Decodable, Confluent (via Immerok), AWS, Aiven | The dominant engine, with multiple commercial homes. |
| Apache Spark Structured Streaming | Databricks, AWS EMR, others | Streaming bolted onto a batch engine. The "good enough" option for Spark shops. |
| Apache Beam | Google Dataflow (the canonical implementation), Flink runner, Spark runner | A pipeline API, not an engine. Compiles to Flink, Spark, or Dataflow. |
| ksqlDB | Confluent only | Confluent's SQL-on-Kafka engine, strategically de-emphasized in favor of Flink SQL. |
| Materialize (proprietary) | Materialize | Incremental view maintenance, built on differential dataflow. Not Flink-based. |
| RisingWave | RisingWave Labs | Newer SQL-streaming database, similar incremental computation model to Materialize. |
The traditional data world is batch: you collect a day's worth of data, run a job at midnight that processes all of it, and have results by morning. The streaming world is the opposite: you process each event as it arrives, maintain running state, and emit incrementally updated results.
The key insight is that batch processing is a special case of stream processing — specifically, the case where the "stream" happens to be a finite, bounded dataset. This is the philosophical foundation of Apache Flink, which treats batch as bounded streaming and uses the same engine for both. (Spark, by contrast, started as a batch system and bolted streaming on top, which is why Spark Structured Streaming has always felt slightly awkward compared to Flink.)
The practical consequence: a real stream processor must handle hard things that batch systems can ignore. Things like:
These are hard problems. They are what separates a "real" stream processor from a script that polls Kafka in a loop.
Stream processing engines fall into three rough architectural camps. Understanding these is the cleanest way to think about the category.
1. Dataflow processors (Flink, Spark Structured Streaming, Beam). You write a pipeline as a directed graph of operators — map, filter, join, aggregate — that the engine deploys across workers. State is checkpointed periodically; on failure, the engine restarts from the last checkpoint and replays events from the source. This is the dominant model for serious stream processing at scale. Apache Flink is the leader of this camp and, in 2026, the default choice for any new large-scale streaming workload.
2. SQL-on-streams (ksqlDB, Flink SQL, Materialize, RisingWave). You write SQL that describes a continuous query, and the engine compiles it into a dataflow under the hood. The pitch is accessibility — you don't need to write Java or Scala or learn a custom DSL. ksqlDB was the early leader of this camp; Flink SQL has since become the standard, and Materialize and RisingWave compete on incremental view maintenance.
3. Incremental view maintenance (Materialize, RisingWave, Feldera). A specialized variant of SQL-on-streams that focuses on maintaining the result of a SQL query as new data arrives, rather than processing events one at a time. You write a CREATE MATERIALIZED VIEW, and the engine guarantees the view is always up to date with sub-second latency. Materialize is its commercial flagship.
Windowing. A window is a finite slice of an infinite stream. The most common types: tumbling (every 5 minutes, non-overlapping), sliding (last 5 minutes, recomputed every 1 minute), and session (group events that happen close together, with a timeout). Windowing is how you turn a continuous stream into something you can aggregate.
Watermarks. Since events can arrive out of order, the engine needs to know when it is "safe" to close a window. A watermark is a timestamp the engine maintains that says "I do not expect any more events older than this." When the watermark passes the end of a window, the window is finalized and its result is emitted. Late events that arrive after the watermark are either dropped, sent to a side output, or used to update the already-emitted result.
State and checkpointing. Streaming jobs are stateful: they remember running counts, joined records, deduplication caches. That state lives in the worker's memory and on local disk, and is periodically checkpointed to durable storage (S3, HDFS) so that if a worker crashes, the engine can recover from the last checkpoint and replay events from the source.
Exactly-once semantics. The holy grail. Achieved through a combination of source replayability (Kafka), checkpointed state, and transactional or idempotent sinks. Flink was the first open-source engine to ship a credible end-to-end exactly-once implementation in 2017.
Apache Flink won the open-source category. It is the default choice for serious streaming at scale (Netflix, Uber, Stripe, Alibaba, ByteDance, Pinterest), and the foundation of every major managed streaming compute service. In 2026, if you are picking a stream processor for a new project and you do not have a strong reason to choose otherwise, you should pick Flink.
The commercial Flink market is unusually fragmented. Unlike Kafka (where Confluent is the dominant commercial steward) or Spark (where Databricks dominates), Flink has no single dominant vendor. Ververica has the deepest committer concentration but is constrained by its Alibaba ownership. Decodable has the most polished modern UX but is a smaller company. Confluent Cloud Flink has the largest distribution channel via the Kafka customer base. AWS Managed Service for Apache Flink dominates AWS-resident workloads by default. None of these has clearly won, and the choice between them often comes down to which other vendor relationships you already have.
Spark Structured Streaming is the "good enough" option for Spark shops. It is not Flink-quality for low-latency or stateful workloads, but if you already run Spark for batch and your streaming requirements are modest (latency in tens of seconds, simple transformations), it is the path of least resistance. This is what Databricks pushes.
ksqlDB has been quietly de-emphasized. Confluent has pivoted toward Flink SQL on Confluent Cloud as its primary stream processing story. ksqlDB still exists and is still useful for simple Kafka-to-Kafka transformations, but its strategic momentum has stalled.
Materialize is the bet on incremental view maintenance as a real category. Founded in 2019 by people who came from the academic world of differential dataflow, Materialize treats stream processing as "make a SQL view always up to date." Technically distinctive, with loyal users, but the category is small compared to Flink-style dataflow.
| Scenario | Stream processor? | Why |
|---|---|---|
| —- | —- | —- |
| Real-time fraud detection or alerting | Yes (Flink) | Latency in milliseconds, stateful pattern detection |
| Continuously updated leaderboards or counters | Yes (Flink, Materialize) | Stateful aggregation over a stream |
| Joining a Kafka stream with a slowly-changing reference table | Yes (Flink, Materialize) | Streaming joins are exactly the use case |
| Hourly batch report from data in S3 | No | Use a warehouse and a workflow orchestrator |
| "I want my Kafka data in the warehouse, that's it" | Maybe | Decodable, Upsolver, Snowpipe Streaming, or Flink SQL |
| Simple Kafka topic-to-topic transformation | Yes (ksqlDB or Kafka Streams) | Lighter weight than full Flink |
| ML feature computation in real time | Yes (Flink) | Stateful feature pipelines feeding online models |
Open-source engines:
Commercial vendors of Flink:
Other engines and approaches:
Stream processors are typically upstream of where TextQL connects. They sit between Kafka and the analytical destinations — warehouses, lakehouses, or real-time OLAP databases — that TextQL Ana queries. The stream processing layer determines what shape the data is in by the time TextQL sees it: whether events are pre-aggregated, joined, enriched, and how fresh they are. A well-designed streaming pipeline puts clean, queryable data in front of TextQL within seconds of the original event.
See TextQL in action