Apache Avro | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Apache Avro

Apache Avro is a row-based, schema-first data serialization format. It is the lingua franca of Kafka and the standard wire format for event streaming, where its strong schema evolution guarantees matter more than columnar scan performance.

Apache Avro is the row-based companion to the columnar formats that dominate analytics. Where Parquet stores data by column for fast scans, Avro stores data by row with a compact binary encoding and a strong schema. It is not what you would choose for running analytical queries on a petabyte-scale data lake. It is exactly what you would choose for serializing events flowing through Kafka, RPC messages between services, or change data capture streams — workloads where the unit of work is a single record and where schema compatibility over time is non-negotiable.

The metaphor: if Parquet is a filing cabinet with a drawer per field, Avro is a shipping envelope — compact, self-describing (the schema rides along or is referenced), and designed to survive a long journey through producers and consumers that don't always agree on exactly what the schema is today.

Origin: Doug Cutting and the Hadoop Era

Avro was created in 2009 by Doug Cutting — the same person who created Apache Hadoop and Apache Lucene — as a new data serialization format for the Hadoop ecosystem. The motivation: existing options were all flawed. Java's built-in serialization was slow and Java-only. Thrift and Protocol Buffers required generating code from schema definitions, which created painful deployment coupling between producers and consumers. Writeable (Hadoop's native format) was Java-only and evolved poorly.

Cutting wanted a format with three properties: language neutral, code-generation optional (so you could read and write Avro data dynamically from a schema string at runtime), and first-class schema evolution. Avro became an Apache top-level project in 2011.

For a few years, Avro was positioned as a general-purpose data format for Hadoop — file storage, RPC, MapReduce input/output. It was serviceable but never dominant for batch analytics, because by 2013 Parquet arrived and was dramatically better for columnar scans. Avro's status as an analytics format declined just as its status as an event streaming format took off.

Why Avro Won Event Streaming

Avro became the de facto standard serialization format for Apache Kafka, almost accidentally. Confluent (the commercial company behind Kafka) built the Confluent Schema Registry around Avro in 2015. The Schema Registry lets producers publish schemas, assigns each schema a unique ID, and lets consumers look up schemas by ID to deserialize messages. This model worked extraordinarily well for Kafka, and Avro's properties made it the natural fit:

Row-based layout. Kafka messages are individual records, not batches of columnar data. A row-based format is the right shape.
Compact binary encoding. No field names in the payload; just ordered values defined by the schema. This is substantially smaller than JSON.
Self-describing via schema. The schema is not optional. Every Avro message is interpreted against a schema, which means consumers know exactly what they are reading.
Strong schema evolution rules. Avro defines clear, mechanical rules for forward and backward compatibility: which changes are safe, which are breaking, what happens if a reader sees a newer or older schema than what's in the data. This is the killer feature.

Schema evolution is the thing Avro is unambiguously best at. In a long-running Kafka topic, producers and consumers are deployed on different cadences. A producer rolls out a new version that adds a field. Old consumers, still on the previous schema, need to keep reading the stream without blowing up. Avro's model makes this a solved problem: as long as you follow the compatibility rules (e.g., new fields must have defaults), old consumers can read new messages and new consumers can read old messages.

Schema Evolution Rules, in Brief

Avro's schema compatibility model, enforced by tools like the Confluent Schema Registry, is the spec sheet most production Kafka platforms live by:

Backward compatible: new schema can read data written with old schema. Adding a new field with a default is backward compatible. Removing a field with a default is backward compatible.
Forward compatible: old schema can read data written with new schema. Adding a field with a default is forward compatible. Removing a field is forward compatible only under certain conditions.
Full compatible: both backward and forward compatible.

These rules sound academic but are absolutely load-bearing for any serious event-driven architecture. They are the reason a company can run a Kafka topic for five years and still evolve its schema without coordinated big-bang deployments.

Avro vs Parquet (Row vs Column)

This is the cleanest dichotomy in the data format world:

Property	Avro (row)	Parquet (column)
—-	—-	—-
Layout	Row by row	Column by column
Best for	Streaming, record-at-a-time	Batch analytics, column scans
Write-heavy vs read-heavy	Write-heavy friendly	Read-heavy friendly
Compression ratio	Good	Excellent
Per-record access	Fast	Slow
Scan aggregate queries	Slow	Fast
Schema evolution	Gold standard	Good, but depends on table format
Typical use	Kafka, RPC, CDC	Data lake, warehouse

The punchline: Avro and Parquet are not competitors. Most modern data architectures use both. Avro serializes events in motion through Kafka. When those events land in a data lake, they are typically converted to Parquet for long-term storage and analytical querying. The Confluent ecosystem and tools like Kafka Connect make this handoff routine.

Avro's Supporting Roles

Avro also shows up in a few other places worth knowing:

Apache Hudi log files. Hudi's merge-on-read tables store row-level change logs as Avro files alongside the Parquet base files. This is a case where a row-based format is exactly right: the logs capture individual record mutations.
Confluent Schema Registry. The de facto schema registry for Kafka speaks Avro natively (and also JSON Schema and Protobuf).
Debezium. The leading open-source CDC tool serializes change events as Avro by default in Kafka deployments.
Hadoop and Spark job inputs/outputs. Still supported, though less common than Parquet in modern stacks.

Honest Take

Avro is a permanent, load-bearing piece of the data ecosystem in one specific role: the wire format for event streaming and CDC. In that role it is excellent and unchallenged. For analytical storage, Parquet beat it years ago and is not looking back. The right mental model is: if the data is moving and one record at a time, reach for Avro; if the data is at rest and being scanned in bulk, reach for Parquet. Most modern architectures need both.

How TextQL Works with Apache Avro

Avro is rarely the format TextQL queries directly. Instead, Avro is typically the format of the event stream that becomes the data Ana queries. Change events flow through Kafka as Avro, land in a data lake as Parquet, register as an Iceberg or Delta table, and that is where TextQL Ana takes over — running LLM-generated SQL against the resulting analytical tables. The Avro schema registry also plays a role upstream: the schemas defined there are effectively the contract that governs what downstream analytical tables will look like.

See TextQL in action

Apache Avro

Created 2009 within the Apache Hadoop project

Creator Doug Cutting (also created Hadoop)

Apache top-level 2011

License Apache 2.0

Type Row-based, schema-first serialization format

Primary use case Event streaming (Kafka), RPC, schema registries

Category Table Formats

Monthly mindshare ~150K · Kafka schema standard; row-oriented analog to Parquet