NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →

NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →

Wiki Storage Storage Primitives

Storage Primitives

Cloud object storage and distributed file systems — the bottom layer of the modern data stack where every byte ultimately lives.

Object storage is like a really, really big USB drive in the cloud. You put files in, you get files out. It doesn't care what's in the files — could be CSVs, Parquet files, photos, whatever. It just stores bytes.

That's the entire mental model. Storage primitives are the absolute bottom layer of the data ecosystem. Everything else — data lakes, lakehouses, warehouses, query engines — is built on top of this layer or exists in response to its limitations.

Why storage primitives matter

Think of storage as a parking garage. It holds cars, but it has no idea where anyone is driving. A data warehouse is more like a valet service — it knows the cars, knows the owners, and can retrieve what you want instantly. Storage doesn't do any of that. It just holds things.

This distinction is the key to understanding the entire modern data stack. Storage gives you three things:

  1. Durability — your data won't disappear. Cloud object stores replicate across multiple data centers. S3 famously advertises 99.999999999% (eleven 9s) durability. You are more likely to be struck by lightning than to lose a file in S3.
  2. Scalability — it's basically infinite. You never need to "provision more storage." You just keep writing files.
  3. Cost — it's the cheapest way to store data, period. Orders of magnitude cheaper per GB than a data warehouse.

What storage does not give you: query capability, schema awareness, transactions, or any understanding of what's inside the files. That's why every other layer of the stack exists.

The three types of storage

Not all storage works the same way. There are three fundamental paradigms, and understanding the differences explains a lot about why the data stack looks the way it does.

### Object storage

Object storage is the dominant model for analytics data. You store objects (files) in buckets (top-level containers), each identified by a unique key (its path). There's no directory hierarchy — that /data/2024/01/events.parquet path is just a flat string key, even though tools render it as folders.

Object storage is optimized for throughput, not latency. Writing and reading large files is fast. Listing millions of objects or making thousands of small random reads is slow. This single tradeoff explains much of the design of table formats like Apache Iceberg and Delta Lake, which exist partly to work around object storage's listing and read overhead.

### Block storage

Block storage (AWS EBS, Azure Managed Disks, GCP Persistent Disks) is what your database server's hard drive looks like. It's raw storage organized into fixed-size blocks, attached to a single compute instance. It's fast for random reads and writes — which is why databases use it — but it doesn't scale independently and you can't share it across machines without a distributed filesystem on top.

### File storage (NFS/shared filesystems)

Network-attached file storage (AWS EFS, Azure Files, Google Filestore) provides a traditional filesystem interface shared across machines. It's a middle ground: more structured than object storage, more shareable than block storage. Rarely used for large-scale analytics data because it's more expensive than object storage without meaningful advantages for that workload.

For the data ecosystem, object storage is the one that matters. When people say "storage layer" in a data stack context, they mean object storage.

The major players

### Amazon S3

S3 is the de facto standard. Launched in 2006, it essentially invented cloud object storage and defined the API that everyone else copied. When you see "S3-compatible" on a product page, they mean it speaks the S3 REST API. This is true of Google Cloud Storage, MinIO, Cloudflare R2, Backblaze B2, and dozens of others.

S3 won the API war. This matters because it means tooling built for S3 works almost everywhere, and it means AWS has less storage lock-in than you might expect — the real lock-in is in the ecosystem around S3 (IAM roles, VPC networking, Lambda integrations, proximity to Redshift and EMR compute).

### Google Cloud Storage (GCS)

GCS is functionally equivalent to S3. It supports its own native API plus an S3-compatible interoperability mode. The main reason to use GCS over S3 is that the rest of your stack is on Google Cloud — BigQuery reads from GCS natively, Dataproc lives in GCP, and data egress between GCP services is free.

### Azure Blob Storage

Azure Blob Storage is Microsoft's equivalent. It has its own API conventions (containers instead of buckets, blobs instead of objects) but also supports S3-compatible access via the Azure S3 proxy. Azure shops use it because their Synapse, Databricks-on-Azure, and Fabric ecosystems assume it.

### HDFS

HDFS (Hadoop Distributed File System) is the legacy option. It was the storage layer of the Hadoop era — a distributed filesystem that ran across a cluster of commodity Linux machines. HDFS gave you durability through replication and scalability through adding nodes.

HDFS is largely being replaced by cloud object storage. The economics are straightforward: maintaining an HDFS cluster requires managing servers, while S3 requires a credit card. Most organizations still running HDFS are either mid-migration to the cloud or have specific on-premises requirements. New greenfield architectures almost never choose HDFS.

### The honest comparison

All three cloud object stores are functionally identical for analytics workloads. The storage itself is a commodity. Nobody switches cloud providers because of object storage — they pick the storage that matches the cloud they're already on. The differentiators are all in the surrounding ecosystem: IAM, networking, compute engine proximity, and managed service integration.

Cost tiers: hot, warm, cold, archive

Every cloud provider offers tiered storage pricing because not all data is accessed equally. The tradeoff is always the same: cheaper storage, more expensive or slower retrieval.

TierUse caseRetrievalExample (AWS)
—————————————-————————
HotFrequently accessed data, active analyticsInstantS3 Standard
WarmInfrequently accessed, still needs fast readsInstant, higher per-read costS3 Infrequent Access
ColdRarely accessed, bulk retrieval OKMinutes to hoursS3 Glacier Instant / Flexible
ArchiveCompliance, long-term retentionHours to daysS3 Glacier Deep Archive

The price spread is dramatic — archive storage can be 20-50x cheaper per GB than hot storage. Lifecycle policies automatically move aging data down the tiers. This is well-understood infrastructure, not a competitive differentiator.

The relationship to everything above

Storage primitives are the foundation, but they're deliberately dumb. The entire modern data stack is a series of layers that add intelligence on top of raw storage:

  • Table formats (Iceberg, Delta Lake, Hudi) add schema, ACID transactions, and time travel to files sitting in object storage.
  • Query engines (Trino, DuckDB, Presto) add the ability to run SQL against those files.
  • Data lakes are just organized object storage — a set of conventions for how files are laid out in buckets.
  • Lakehouses are storage + table formats + query engines composed together.
  • Data warehouses skip the composition entirely and provide storage, schema, and compute as a bundled service.

Understanding storage as the base layer makes the rest of the stack legible. Every architectural decision above this layer is, in some sense, compensating for the fact that storage is "just bytes" and the rest of the business needs structure, speed, and semantics.

Tools in this category

How TextQL works with storage primitives

Storage is a solved problem — the hard part is making sense of everything that sits on top of it. TextQL connects across the storage, table format, and compute layers so teams can query their data wherever it lives without needing to know which bucket, format, or engine is underneath. Read more about TextQL in the stack.

See TextQL in action

See TextQL in action

Storage Primitives