NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Apache Avro
Apache Avro is a row-based, schema-first data serialization format. It is the lingua franca of Kafka and the standard wire format for event streaming, where its strong schema evolution guarantees matter more than columnar scan performance.
Apache Avro is the row-based companion to the columnar formats that dominate analytics. Where Parquet stores data by column for fast scans, Avro stores data by row with a compact binary encoding and a strong schema. It is not what you would choose for running analytical queries on a petabyte-scale data lake. It is exactly what you would choose for serializing events flowing through Kafka, RPC messages between services, or change data capture streams — workloads where the unit of work is a single record and where schema compatibility over time is non-negotiable.
The metaphor: if Parquet is a filing cabinet with a drawer per field, Avro is a shipping envelope — compact, self-describing (the schema rides along or is referenced), and designed to survive a long journey through producers and consumers that don't always agree on exactly what the schema is today.
Avro was created in 2009 by Doug Cutting — the same person who created Apache Hadoop and Apache Lucene — as a new data serialization format for the Hadoop ecosystem. The motivation: existing options were all flawed. Java's built-in serialization was slow and Java-only. Thrift and Protocol Buffers required generating code from schema definitions, which created painful deployment coupling between producers and consumers. Writeable (Hadoop's native format) was Java-only and evolved poorly.
Cutting wanted a format with three properties: language neutral, code-generation optional (so you could read and write Avro data dynamically from a schema string at runtime), and first-class schema evolution. Avro became an Apache top-level project in 2011.
For a few years, Avro was positioned as a general-purpose data format for Hadoop — file storage, RPC, MapReduce input/output. It was serviceable but never dominant for batch analytics, because by 2013 Parquet arrived and was dramatically better for columnar scans. Avro's status as an analytics format declined just as its status as an event streaming format took off.
Avro became the de facto standard serialization format for Apache Kafka, almost accidentally. Confluent (the commercial company behind Kafka) built the Confluent Schema Registry around Avro in 2015. The Schema Registry lets producers publish schemas, assigns each schema a unique ID, and lets consumers look up schemas by ID to deserialize messages. This model worked extraordinarily well for Kafka, and Avro's properties made it the natural fit:
Schema evolution is the thing Avro is unambiguously best at. In a long-running Kafka topic, producers and consumers are deployed on different cadences. A producer rolls out a new version that adds a field. Old consumers, still on the previous schema, need to keep reading the stream without blowing up. Avro's model makes this a solved problem: as long as you follow the compatibility rules (e.g., new fields must have defaults), old consumers can read new messages and new consumers can read old messages.
Avro's schema compatibility model, enforced by tools like the Confluent Schema Registry, is the spec sheet most production Kafka platforms live by:
These rules sound academic but are absolutely load-bearing for any serious event-driven architecture. They are the reason a company can run a Kafka topic for five years and still evolve its schema without coordinated big-bang deployments.
This is the cleanest dichotomy in the data format world:
| Property | Avro (row) | Parquet (column) |
|---|---|---|
| —- | —- | —- |
| Layout | Row by row | Column by column |
| Best for | Streaming, record-at-a-time | Batch analytics, column scans |
| Write-heavy vs read-heavy | Write-heavy friendly | Read-heavy friendly |
| Compression ratio | Good | Excellent |
| Per-record access | Fast | Slow |
| Scan aggregate queries | Slow | Fast |
| Schema evolution | Gold standard | Good, but depends on table format |
| Typical use | Kafka, RPC, CDC | Data lake, warehouse |
The punchline: Avro and Parquet are not competitors. Most modern data architectures use both. Avro serializes events in motion through Kafka. When those events land in a data lake, they are typically converted to Parquet for long-term storage and analytical querying. The Confluent ecosystem and tools like Kafka Connect make this handoff routine.
Avro also shows up in a few other places worth knowing:
Avro is a permanent, load-bearing piece of the data ecosystem in one specific role: the wire format for event streaming and CDC. In that role it is excellent and unchallenged. For analytical storage, Parquet beat it years ago and is not looking back. The right mental model is: if the data is moving and one record at a time, reach for Avro; if the data is at rest and being scanned in bulk, reach for Parquet. Most modern architectures need both.
Avro is rarely the format TextQL queries directly. Instead, Avro is typically the format of the event stream that becomes the data Ana queries. Change events flow through Kafka as Avro, land in a data lake as Parquet, register as an Iceberg or Delta table, and that is where TextQL Ana takes over — running LLM-generated SQL against the resulting analytical tables. The Avro schema registry also plays a role upstream: the schemas defined there are effectively the contract that governs what downstream analytical tables will look like.
See TextQL in action