Snowpipe | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Snowpipe

Snowpipe is Snowflake's continuous, file-based ingestion service -- the way most Snowflake customers get data into the warehouse in near-real-time.

Snowpipe is the conveyor belt that drops new files into Snowflake automatically. You point it at a cloud storage location (an S3 bucket, an Azure container, a GCS bucket), tell it which target table to load into, and from then on, every new file that lands in that location gets loaded into Snowflake within a minute or two — without you running a COPY INTO statement, scheduling a job, or sizing a virtual warehouse.

The simple metaphor: Snowpipe is the mailroom of the data warehouse. Files arrive at the loading dock, the mailroom picks them up, and they end up sorted into the right table on the right shelf. You don't have to walk down to the dock and check.

Origin Story

Snowpipe became generally available in 2017, three years after Snowflake's core warehouse went GA. It was launched to solve a specific, embarrassing problem with the original Snowflake design: a "modern" cloud warehouse still required customers to run manual COPY INTO commands on a schedule to load data. That was acceptable in 2014 when Snowflake was selling against Teradata and the bar was low, but by 2017 the competitive landscape included streaming-native systems and the modern data stack was forming around the idea that data should just arrive. Customers wanted "set it and forget it" ingestion; Snowflake had nothing to offer them.

Snowpipe was the answer. Architecturally, it was a small but important shift: Snowflake would maintain its own serverless ingestion fleet (separate from customer virtual warehouses), watch cloud storage for new files via cloud-native event notifications (S3 Event Notifications, Azure Event Grid, GCS Pub/Sub), and load files within ~1 minute of arrival. Customers didn't have to size warehouses, didn't have to schedule jobs, and got billed in tiny per-file increments.

In 2023, Snowflake added Snowpipe Streaming, a row-level streaming API that bypasses the file-staging step entirely. That second iteration was the response to a different competitor: anyone using Kafka, Kinesis, or a CDC tool who wanted true low-latency streaming and was getting impatient with the file-based model. Snowpipe Streaming pushes the latency floor down from ~1 minute to a few seconds.

How It Works

Classic Snowpipe is conceptually three pieces:

1. A stage. A pointer to a cloud storage location. Snowflake calls these "external stages" and they're just metadata that tells the warehouse "watch this S3 prefix."

2. A pipe. A small object that wraps a COPY INTO target_table FROM @stage statement. The pipe defines what gets loaded, into which table, with which file format and transformations.

3. An event hook. Either (a) the cloud provider notifies Snowflake automatically when a new file lands (event-based, the default and recommended path), or (b) your code calls Snowpipe's REST API to say "here are some new files." Snowflake then queues those files, picks them up on its serverless ingestion fleet, and loads them.

You pay per file loaded, not per warehouse-hour. There is no virtual warehouse to size for ingestion — Snowflake handles compute on its end. This pricing model is one of the reasons Snowpipe took off: for spiky, file-based workloads, it's dramatically cheaper than running a dedicated warehouse just to handle COPY INTO.

Snowpipe Streaming is a different beast. Instead of files, you push individual rows (or small batches of rows) through a Java SDK or a Snowflake Connector for Kafka. Rows land in the target table within seconds, not minutes. There's no file staging, no event notification, no COPY INTO. The tradeoff: you have to integrate the SDK into your producer, and the per-row pricing model is different. For Kafka users especially, Snowpipe Streaming is now the default path — it's cheaper, faster, and simpler than the file-based approach.

What It's Good At

Loading the output of upstream batch jobs. A dbt Cloud job, a Spark job, or a Fivetran sync writes Parquet to S3; Snowpipe picks it up automatically.
CDC pipelines from operational databases. Tools like Fivetran, Airbyte, Estuary, and Debezium drop change files into S3, and Snowpipe loads them within a minute.
Webhook and event payloads landing in object storage. A Lambda or Cloud Function writes JSON files to S3; Snowpipe loads them as semi-structured data into a VARIANT column.
Replacing scheduled COPY INTO. Anywhere you used to run a cron job that called COPY INTO, Snowpipe is almost always cheaper and lower-latency.
Kafka ingestion (via Snowpipe Streaming). The Snowflake Connector for Kafka now writes through Snowpipe Streaming and is the supported path for streaming Kafka data into Snowflake.

What It's Not Good At

True sub-second latency. Even Snowpipe Streaming targets seconds, not milliseconds. If you need real-time analytics with millisecond freshness, you want a real-time OLAP system (ClickHouse, Druid, Pinot), not Snowflake.
Heavy in-flight transformation. Snowpipe is a loader, not an ETL engine. You can do simple column mappings and casts in the COPY INTO statement, but anything substantial belongs in a downstream transformation layer (dbt, Snowpark, scheduled tasks).
Many tiny files. Snowpipe is happiest with files in the 100MB—250MB range. If you stream millions of 1KB files, you'll pay per-file overhead and ingestion will lag. Either batch upstream or use Snowpipe Streaming.
Strict ordering guarantees. Files can be loaded out of order. If your downstream logic depends on strict event ordering, you need to handle that with a sequence column or merge logic, not assume Snowpipe preserves it.

The Opinionated Take

Snowpipe is one of the least glamorous and most important products in the Snowflake stack. Nobody buys Snowflake for Snowpipe, but almost every production Snowflake deployment depends on it. It quietly removed an entire category of customer pain (manual file loading, warehouse sizing for ingestion) and did so with a pricing model that aligned with how customers actually used it.

The interesting strategic shift is the move from file-based Snowpipe to row-based Snowpipe Streaming. File-based ingestion was a concession to the world Snowflake grew up in, where everyone already had S3 and the Modern Data Stack was file-batch underneath. Snowpipe Streaming is Snowflake admitting that the next decade is row-based and event-driven, and that if they don't own that ingestion path, Confluent or a streaming-native warehouse will. The two coexist, but the center of gravity is moving toward streaming.

The honest comparison to Databricks: Databricks Autoloader is the structural equivalent of Snowpipe (file-based, event-triggered, schema-evolving), and Delta Live Tables plus structured streaming sit in the same conceptual neighborhood as Snowpipe Streaming. Both companies arrived at similar designs for the same reason — the customer wants files to disappear from the dock and reappear in tables, automatically, with low latency, without managing a cluster.

How TextQL Fits

Snowpipe is mostly invisible to TextQL: by the time TextQL Ana queries Snowflake, Snowpipe has already done its job. But it matters in one specific way — Snowpipe is what makes Snowflake feel "fresh enough" for AI-driven exploration. When a business user asks Ana about today's data, the reason that data exists is almost always because Snowpipe (or its streaming sibling) loaded it within the last few minutes. A well-configured Snowpipe pipeline is the difference between an AI analyst that feels live and one that feels stale.

See TextQL in action

Snowpipe

Released 2017 (GA)

Vendor Snowflake

Type Continuous (micro-batch) data ingestion

Sources S3, Azure Blob, GCS; REST API; Kafka (via Snowpipe Streaming)

Category Data Warehouse

Monthly mindshare ~30K · ingestion-only feature; Snowflake customers doing streaming loads