Stream Processing | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Wiki Stream Processing Stream Processing

Contents

Stream Processing

Stream processing engines compute on data while it is in motion -- transforming, joining, aggregating, and filtering events as they flow. Apache Flink is the dominant open-source engine; Ververica, Decodable, Confluent (via Immerok), and AWS all sell managed Flink. Materialize and ksqlDB compete on different niches.

A stream processor is a computation engine that runs continuously on data in motion. While an event streaming platform like Kafka is the transport — a durable log that producers write to and consumers read from — a stream processor is the compute layer that does interesting things to those events as they pass through. It joins streams together, aggregates events into windows, deduplicates, enriches with reference data, detects patterns, and emits the results to downstream systems.

The mental model worth holding: if event streaming is the highway, stream processing is the factory next to the highway. Trucks of events drive in, the factory does work on them, and trucks of processed events drive back out. The highway and the factory are different things, with different vendors, different engineering teams, and different operational concerns.

Open-Source Engine vs Commercial Vendor

Before going further, here is the distinction this wiki insists on: the open-source engine and the commercial company that sells it are not the same thing. In stream processing, the dominant open-source engine is Apache Flink, and there are multiple commercial vendors selling managed Flink, each with a slightly different positioning:

Apache Flink is the open-source project, originally developed at TU Berlin (the Stratosphere research project, 2010), donated to the ASF in 2014, and now governed by the Apache Software Foundation. Apache 2.0 licensed.
Ververica is the original commercial company behind Flink, founded in 2014 as data Artisans by the project's creators, renamed in 2018, and acquired by Alibaba in January 2019. The deepest concentration of Flink committers in any single vendor.
Decodable is a modern managed Flink platform founded in 2020 by Eric Sammer (ex-Cloudera, ex-Splunk). Pitches a SQL-first, opinionated, "Flink without operating Flink" experience.
Confluent entered the Flink market via its January 2023 acquisition of Immerok, a Flink-focused startup founded by ex-Ververica engineers. Confluent Cloud Flink is now Confluent's primary stream processing pitch, sold alongside Confluent Cloud Kafka.
AWS Managed Service for Apache Flink (formerly Kinesis Data Analytics for Apache Flink) is Amazon's managed Flink offering. The default for AWS-resident workloads.
Aiven for Apache Flink is part of Aiven's broader multi-cloud OSS managed services lineup.

Multiple meaningful commercial vendors selling the same open-source engine is unusual — it more commonly resolves into one dominant steward (Confluent for Kafka, Databricks for Spark). Flink has not consolidated that way, and the multi-vendor landscape is one of the more interesting features of the stream processing market in 2026.

OSS Engines to Commercial Vendors — The Stream Processing Map

OSS engine	Commercial vendors	Notes
—-	—-	—-
Apache Flink	Ververica, Decodable, Confluent (via Immerok), AWS, Aiven	The dominant engine, with multiple commercial homes.
Apache Spark Structured Streaming	Databricks, AWS EMR, others	Streaming bolted onto a batch engine. The "good enough" option for Spark shops.
Apache Beam	Google Dataflow (the canonical implementation), Flink runner, Spark runner	A pipeline API, not an engine. Compiles to Flink, Spark, or Dataflow.
ksqlDB	Confluent only	Confluent's SQL-on-Kafka engine, strategically de-emphasized in favor of Flink SQL.
Materialize (proprietary)	Materialize	Incremental view maintenance, built on differential dataflow. Not Flink-based.
RisingWave	RisingWave Labs	Newer SQL-streaming database, similar incremental computation model to Materialize.

Streaming vs Batch: A Subtle but Important Inversion

The traditional data world is batch: you collect a day's worth of data, run a job at midnight that processes all of it, and have results by morning. The streaming world is the opposite: you process each event as it arrives, maintain running state, and emit incrementally updated results.

The key insight is that batch processing is a special case of stream processing — specifically, the case where the "stream" happens to be a finite, bounded dataset. This is the philosophical foundation of Apache Flink, which treats batch as bounded streaming and uses the same engine for both. (Spark, by contrast, started as a batch system and bolted streaming on top, which is why Spark Structured Streaming has always felt slightly awkward compared to Flink.)

The practical consequence: a real stream processor must handle hard things that batch systems can ignore. Things like:

Event time vs. processing time. An event might say "this happened at 3:00 PM" but arrive at the processor at 3:05 PM. Which time does your hourly aggregation use?
Late and out-of-order data. Events from a mobile device that lost connectivity might arrive hours later. Do you reprocess the window or drop them?
State management. A streaming aggregation needs to remember the running count, sum, or set of events. That state has to be checkpointed and recovered on failure.
Exactly-once semantics. If a worker crashes mid-process, you must guarantee each event is reflected in the output exactly once — not zero times, not twice.
Windowing. Tumbling windows (fixed, non-overlapping), sliding windows (overlapping), session windows (gap-based) — each has different semantics and state requirements.

These are hard problems. They are what separates a "real" stream processor from a script that polls Kafka in a loop.

The Three Architectural Approaches

Stream processing engines fall into three rough architectural camps. Understanding these is the cleanest way to think about the category.

1. Dataflow processors (Flink, Spark Structured Streaming, Beam). You write a pipeline as a directed graph of operators — map, filter, join, aggregate — that the engine deploys across workers. State is checkpointed periodically; on failure, the engine restarts from the last checkpoint and replays events from the source. This is the dominant model for serious stream processing at scale. Apache Flink is the leader of this camp and, in 2026, the default choice for any new large-scale streaming workload.

2. SQL-on-streams (ksqlDB, Flink SQL, Materialize, RisingWave). You write SQL that describes a continuous query, and the engine compiles it into a dataflow under the hood. The pitch is accessibility — you don't need to write Java or Scala or learn a custom DSL. ksqlDB was the early leader of this camp; Flink SQL has since become the standard, and Materialize and RisingWave compete on incremental view maintenance.

3. Incremental view maintenance (Materialize, RisingWave, Feldera). A specialized variant of SQL-on-streams that focuses on maintaining the result of a SQL query as new data arrives, rather than processing events one at a time. You write a CREATE MATERIALIZED VIEW, and the engine guarantees the view is always up to date with sub-second latency. Materialize is its commercial flagship.

Key Concepts, Explained Simply

Windowing. A window is a finite slice of an infinite stream. The most common types: tumbling (every 5 minutes, non-overlapping), sliding (last 5 minutes, recomputed every 1 minute), and session (group events that happen close together, with a timeout). Windowing is how you turn a continuous stream into something you can aggregate.

Watermarks. Since events can arrive out of order, the engine needs to know when it is "safe" to close a window. A watermark is a timestamp the engine maintains that says "I do not expect any more events older than this." When the watermark passes the end of a window, the window is finalized and its result is emitted. Late events that arrive after the watermark are either dropped, sent to a side output, or used to update the already-emitted result.

State and checkpointing. Streaming jobs are stateful: they remember running counts, joined records, deduplication caches. That state lives in the worker's memory and on local disk, and is periodically checkpointed to durable storage (S3, HDFS) so that if a worker crashes, the engine can recover from the last checkpoint and replay events from the source.

Exactly-once semantics. The holy grail. Achieved through a combination of source replayability (Kafka), checkpointed state, and transactional or idempotent sinks. Flink was the first open-source engine to ship a credible end-to-end exactly-once implementation in 2017.

The Honest Vendor Take

Apache Flink won the open-source category. It is the default choice for serious streaming at scale (Netflix, Uber, Stripe, Alibaba, ByteDance, Pinterest), and the foundation of every major managed streaming compute service. In 2026, if you are picking a stream processor for a new project and you do not have a strong reason to choose otherwise, you should pick Flink.

The commercial Flink market is unusually fragmented. Unlike Kafka (where Confluent is the dominant commercial steward) or Spark (where Databricks dominates), Flink has no single dominant vendor. Ververica has the deepest committer concentration but is constrained by its Alibaba ownership. Decodable has the most polished modern UX but is a smaller company. Confluent Cloud Flink has the largest distribution channel via the Kafka customer base. AWS Managed Service for Apache Flink dominates AWS-resident workloads by default. None of these has clearly won, and the choice between them often comes down to which other vendor relationships you already have.

Spark Structured Streaming is the "good enough" option for Spark shops. It is not Flink-quality for low-latency or stateful workloads, but if you already run Spark for batch and your streaming requirements are modest (latency in tens of seconds, simple transformations), it is the path of least resistance. This is what Databricks pushes.

ksqlDB has been quietly de-emphasized. Confluent has pivoted toward Flink SQL on Confluent Cloud as its primary stream processing story. ksqlDB still exists and is still useful for simple Kafka-to-Kafka transformations, but its strategic momentum has stalled.

Materialize is the bet on incremental view maintenance as a real category. Founded in 2019 by people who came from the academic world of differential dataflow, Materialize treats stream processing as "make a SQL view always up to date." Technically distinctive, with loyal users, but the category is small compared to Flink-style dataflow.

When You Need Stream Processing (and When You Don't)

Scenario	Stream processor?	Why
—-	—-	—-
Real-time fraud detection or alerting	Yes (Flink)	Latency in milliseconds, stateful pattern detection
Continuously updated leaderboards or counters	Yes (Flink, Materialize)	Stateful aggregation over a stream
Joining a Kafka stream with a slowly-changing reference table	Yes (Flink, Materialize)	Streaming joins are exactly the use case
Hourly batch report from data in S3	No	Use a warehouse and a workflow orchestrator
"I want my Kafka data in the warehouse, that's it"	Maybe	Decodable, Upsolver, Snowpipe Streaming, or Flink SQL
Simple Kafka topic-to-topic transformation	Yes (ksqlDB or Kafka Streams)	Lighter weight than full Flink
ML feature computation in real time	Yes (Flink)	Stateful feature pipelines feeding online models

Tools in This Category

Open-source engines:

Apache Flink — The dominant open-source stream processor.

Commercial vendors of Flink:

Ververica — The original commercial Flink company, now an Alibaba subsidiary.
Decodable — Modern, SQL-first managed Flink.
Confluent — Acquired Immerok in 2023; sells Flink alongside Kafka.

Other engines and approaches:

Materialize — Streaming database built on incremental view maintenance.
ksqlDB — Confluent's SQL-on-Kafka engine, strategically de-emphasized.

How TextQL Works with Stream Processing

Stream processors are typically upstream of where TextQL connects. They sit between Kafka and the analytical destinations — warehouses, lakehouses, or real-time OLAP databases — that TextQL Ana queries. The stream processing layer determines what shape the data is in by the time TextQL sees it: whether events are pre-aggregated, joined, enriched, and how fresh they are. A well-designed streaming pipeline puts clean, queryable data in front of TextQL within seconds of the original event.

See TextQL in action

Stream Processing

Category Real-time computation on event streams

Also called Streaming compute, event processing, complex event processing (CEP)

Not to be confused with Event streaming (transport), real-time OLAP (query)

Dominant OSS engine Apache Flink

Key commercial vendors Ververica, Decodable, Confluent, AWS, Materialize

Typical input Kafka, Kinesis, Pulsar topics

Typical users Data engineers, platform engineers, real-time application developers

Monthly mindshare ~150K · specialized category; subset of streaming users