NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Amazon S3
Amazon S3 (Simple Storage Service) is the object storage service AWS launched in 2006. Its API became the de facto standard for object storage on the internet, and S3 is the storage substrate beneath nearly all modern data lakes and lakehouses.
Amazon S3 is, more than any other single product, the foundation of the modern internet's data layer. Launched on March 14, 2006 — before AWS was even called AWS in the way we mean it today — S3 was the first general-purpose cloud object storage service, and its API became the de facto standard for object storage everywhere. When you say "object storage" in 2026, you essentially mean "an API-compatible reimplementation of S3." That standardization is the most important thing about S3, more important than any individual feature.
S3 is also the storage substrate beneath almost every modern data lake and lakehouse. The Parquet files that hold your Iceberg or Delta tables live in S3. The raw JSON your application logs land in S3. The intermediate files between Spark stages spill to S3. The model weights your ML team trains land in S3. If a piece of data is large and durable and not a row in a transactional database, it is overwhelmingly likely to be stored as an S3 object.
S3 is an object store, not a file system. The distinction matters. A file system is hierarchical (folders contain files, files have a path) and supports operations like rename, append, and partial overwrite. An object store is a flat key-value namespace — you have a "bucket" (a top-level container) and inside the bucket, every object has a unique key (which can contain slashes, but those slashes are part of the key, not real folders) and a blob of bytes. The operations are roughly: PUT (upload an object), GET (download an object), LIST (list keys with a prefix), DELETE (remove an object). That's mostly it.
The simple analogy: a file system is like a filing cabinet with drawers and folders. S3 is like a giant hash table where every entry has a name and some bytes. You don't navigate it; you look things up by key.
This minimalism is the secret of S3's scale. By giving up file-system semantics (especially atomic rename and partial writes), S3 was able to be massively distributed, eventually consistent at first (now strongly consistent since 2020), absurdly durable (11 nines), and cheap. The tradeoff — no append, no rename — is exactly the tradeoff that table formats like Iceberg and Delta Lake were invented to paper over.
S3 launched on March 14, 2006, the first AWS service to be released after SQS, and arguably the service that made AWS into AWS. The pitch was radical at the time: pay-as-you-go storage with a REST API, durable enough that you could trust it, cheap enough that startups could afford it, no upfront commitment. The first listed price was $0.15 per GB-month, which was already shockingly cheap compared to running your own SAN.
Within a few years, S3 was the default place to put any blob of data on the internet. Image hosts moved to S3. Backups moved to S3. Static websites moved to S3. By the 2010s, S3 was so ubiquitous that the API itself became the standard — not because anyone formally standardized it, but because every storage product on Earth eventually had to support "S3 compatibility" or be ignored. Today, almost every object storage product implements the S3 API: MinIO, Wasabi, Backblaze B2, Cloudflare R2, Google Cloud Storage (via interop), Ceph, even Azure Blob Storage indirectly. The S3 API is the POSIX of cloud storage.
The opinionated take, in plain English: S3 won because it was first, simple, and good enough to never need to be replaced. The API has barely changed in 20 years. There are exactly four operations you really need (PUT, GET, LIST, DELETE), and S3 had them in 2006. Every feature added since — versioning, lifecycle policies, encryption, access points, multipart upload, strong consistency — has been additive, not breaking.
Compare this to the storage protocols that existed before S3. NFS was complex and tied to file system semantics. Hadoop's HDFS was tied to a JVM cluster you had to operate yourself. SAN protocols were enterprise-only and came with hardware. None of them had a simple HTTP API. None of them were billed by the GB-month with no commitment. S3 made storage feel like a utility, and once you can use storage like a utility, you stop wanting to operate it.
The result is that the entire data lake / lakehouse architecture only works because S3 exists (and because Google Cloud Storage and Azure Blob Storage exist as compatible alternatives in their respective clouds). The idea that you can store petabytes of Parquet files cheaply, durably, and on someone else's hardware — and then point any query engine on Earth at them — is a direct consequence of S3.
S3 has many storage classes, but the mental model is simple:
S3 storage is cheap. S3 egress is not. AWS charges for data transferred out of S3 to the public internet at rates that have made the egress fee one of the most-complained-about line items in cloud bills. This pricing structure is also a strategic moat: it makes it expensive to leave AWS. Cloudflare R2 and Backblaze B2 specifically launched as S3-compatible alternatives that don't charge egress, and have grown rapidly on that pitch alone.
S3 sits at the very bottom of the analytics stack, beneath query engines, table formats, and warehouses. It is the substrate. Nearly every other tool in the data stack either reads from or writes to S3 (or its GCS/ADLS equivalents):
TextQL Ana connects to S3 indirectly through the warehouse or query engine that sits on top of it. When a business user asks Ana a question about data stored as Parquet files in an S3-backed lakehouse, Ana sends the query to the underlying engine (Athena, Trino, Databricks SQL, Snowflake on Iceberg) and returns the result. Ana doesn't replace S3 or the engines that read it — it's the natural-language layer on top.
See TextQL in action