Google BigQuery | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Google BigQuery

Google BigQuery — Google Cloud's serverless, pay-per-query data warehouse. Pioneered separation of storage and compute and remains the easiest warehouse to get started with.

Google BigQuery is a data warehouse you don't have to run. There's no cluster to provision, no nodes to resize, no "warehouse size" dropdown to pick. You drop data in, write SQL, and Google figures out the rest. The bill shows up at the end of the month based on how much data your queries scanned (or how many "slots" of compute you reserved).

If Snowflake is a rental car where you still have to pick your engine size, BigQuery is an Uber — you say where you want to go, and a car shows up. You never think about the car.

That "just show up and query" experience is BigQuery's single biggest advantage, and it comes from the fact that BigQuery is not really a product Google built for you. It's a commercial wrapper around a system Google built for itself almost two decades ago to search its own logs.

Origin Story: From Dremel to BigQuery

The story starts in 2006 inside Google, with a system called Dremel. Google engineers needed to run interactive SQL-like queries across petabytes of log data — web crawl stats, ad click streams, production traces — and MapReduce was too slow. MapReduce is a batch system; you submit a job and come back hours later. Dremel was designed to return results in seconds over trillions of rows.

The key insight of Dremel, published in a famous 2010 paper ("Dremel: Interactive Analysis of Web-Scale Datasets"), was to combine two ideas:

Columnar storage with a nested record format (later open-sourced as the inspiration for Apache Parquet) so you only read the columns your query touches.
A massively parallel tree architecture where a root node fans a query out to thousands of leaf workers, each scanning a shard of data, and the results are aggregated back up the tree.

At Google's internal scale, Dremel could scan tens of billions of rows in seconds. In 2010, Google announced BigQuery as a commercial version of Dremel, and it became generally available in 2011 as one of the earliest Google Cloud products. The team was led by engineers including Jordan Tigani and Siddartha Naidu; Tigani later co-founded MotherDuck around DuckDB, partly as a reaction to what he saw running BigQuery for a decade.

BigQuery predates Snowflake's public launch by about three years and predates Redshift by about a year. It was, depending on how you count, the first true serverless cloud data warehouse.

How BigQuery Actually Works

BigQuery looks like a single black box, but inside it's four services bolted together:

1. Colossus (storage). Your tables live as columnar files in Colossus, Google's successor to GFS. The format is called Capacitor and it's Google's proprietary columnar encoding, with heavy compression and value-level indexing. Crucially, storage is fully separated from compute — Colossus is a planet-scale object store that BigQuery queries read from directly.

2. Dremel (query engine). When you run a query, BigQuery parses the SQL, builds a query plan, and dispatches it as a tree of workers called "slots." A slot is a unit of CPU + memory. Large queries get more slots and finish faster; small queries get fewer. You never see slots as individual machines because they're scheduled out of a shared pool.

3. Jupiter (network). Google's internal petabit-scale datacenter network. This is the unsung hero of BigQuery — because Jupiter is so fast, BigQuery can shuffle data between workers at speeds that would melt a normal cloud network. This is why BigQuery joins at scale surprisingly well.

4. Borg (scheduling). Google's cluster manager (the ancestor of Kubernetes) schedules slots onto physical machines. You never see this layer.

The practical result: you type SELECT ... FROM huge_table and within a few hundred milliseconds, thousands of workers across Google's fleet are scanning your data in parallel. No warehouse to wake up, no autoscaling lag.

Pricing: The Most Controversial Thing About BigQuery

BigQuery has two main pricing models, and picking the wrong one can cost you ten times more than the right one.

On-demand pricing charges per byte scanned — roughly $6.25 per TB as of 2026. This sounds simple but has a nasty property: an SELECT * on a wide table is dramatically more expensive than a selective query, because columnar storage means you pay for the columns you touch. Teams that don't know this file $50,000 surprise bills.

Capacity (reserved slots) pricing lets you buy slots — either flat-rate monthly commitments or the newer BigQuery Editions (Standard, Enterprise, Enterprise Plus) launched in 2023, which auto-scale slot usage within a range. This is flat-rate compute, like a traditional warehouse.

The rough rule: if you're spending more than about $10K/month on on-demand, switch to Editions. Below that, on-demand is almost always cheaper and simpler.

This is where BigQuery's positioning gets awkward. For years, Google pitched on-demand as "pay only for what you use." In practice, it's pay-per-query-shape, which is an alien concept to most data teams who think in terms of compute hours. The 2023 introduction of Editions was a quiet admission that the consumption model needed to look more like Snowflake's for BigQuery to compete in large enterprises.

Where BigQuery Fits, Vendor-Style

Google's official positioning: BigQuery is the analytical heart of a broader "data cloud" that includes BigLake (open-format table support), BigQuery ML (in-warehouse machine learning), Vertex AI integration, and Gemini-powered natural language querying. The pitch is that BigQuery is the only warehouse deeply integrated with a tier-one AI stack, because Google owns both ends.

What Google would rather you not focus on: BigQuery is structurally married to Google Cloud. If your org runs on AWS or Azure, the cross-cloud story — BigQuery Omni, which runs BigQuery compute inside other clouds — exists but is a second-class experience with real feature gaps. BigQuery is by far the strongest warehouse if you're on GCP and a much weaker fit if you're not.

BigQuery vs Snowflake. Snowflake is multi-cloud first; BigQuery is Google-first. Snowflake gives you fine-grained control over warehouses (T-shirt sizes, auto-suspend, per-workload isolation); BigQuery is more opaque and more automatic. Most teams pick Snowflake when they want control and multi-cloud portability, and BigQuery when they want to write zero operational code and they already live on GCP.

BigQuery vs Redshift. Not really a contest anymore. Redshift is faster than it used to be, but BigQuery is dramatically easier to operate and scales smoother. Redshift wins mostly on "we're already all-in on AWS and IAM integration matters."

BigQuery vs Databricks. Databricks is optimized for ML/data-science workloads and raw file access; BigQuery is optimized for SQL analytics and BI. They're converging but still clearly different cultures — Databricks feels like an ML platform that learned SQL, BigQuery feels like a SQL engine that learned ML.

What BigQuery Is Good At (and Not)

BigQuery is excellent at:

Ad hoc SQL over massive datasets. The Dremel DNA shows here — scanning a petabyte-scale table with a filter often takes under a minute.
Zero-ops for small teams. You literally cannot misconfigure a BigQuery cluster because there's no cluster.
Semi-structured data. JSON, nested and repeated fields, and arrays are first-class. BigQuery was built on a nested record model from day one, before the industry caught up.
Ecosystem integration inside GCP. Native connectors to GA4, Google Ads, Search Console, Firebase, Pub/Sub, Dataflow, and Looker Studio. If you're a digital-native company, much of your marketing data arrives in BigQuery for free.
In-warehouse ML. BigQuery ML lets you train models with CREATE MODEL SQL. It's not Databricks, but it's shockingly good for forecasting, clustering, and logistic regression without leaving SQL.

BigQuery is bad at:

Predictable costs on variable workloads. On-demand pricing rewards discipline and punishes experimentation. One rogue intern's SELECT * can blow a month's budget.
Sub-second latency. BigQuery queries have meaningful startup overhead — a trivial query takes ~1 second. It's not a replacement for ClickHouse or Druid for dashboards that need 50ms responses. BI Engine helps but doesn't fully close the gap.
Transactional or high-frequency update workloads. BigQuery supports DML, but it's not meant for thousands of small updates per second. Streaming inserts help but have their own quirks.
Multi-cloud portability. Data in BigQuery's native storage is locked in. Omni and BigLake help, but the core experience assumes GCP.
Fine-grained resource control. If you want to pin a specific query to a specific warehouse for performance isolation, Snowflake is more explicit.

Where the Puck Is Going

Three trends worth watching:

1. BigLake and Iceberg. Google's answer to the open-format trend is BigLake, which lets BigQuery query Apache Iceberg and Parquet files in Google Cloud Storage (and increasingly S3) with the same engine. This is Google hedging against the "warehouse as closed store" critique from Databricks.

2. Gemini in BigQuery. Google is integrating Gemini models into BigQuery Studio for natural-language-to-SQL and code assist. Expect the warehouse boundary to blur with the AI layer over the next two years.

3. BigQuery Editions as the new default. On-demand pricing will quietly become a niche option for small users. Enterprises will live on Editions, which looks much more like Snowflake credits.

TextQL and BigQuery

TextQL Ana connects natively to BigQuery through the standard SQL dialect and BigQuery's authorization model, including row-level and column-level security. Because BigQuery tables are schema-enforced and typically well-governed (especially under a Dataplex or Dataform setup), they tend to be the cleanest substrate for LLM-generated SQL. TextQL respects BigQuery slot reservations and project-level cost controls, which matters for teams on Editions pricing that want to avoid runaway on-demand scans.

See TextQL in action

Google BigQuery

Founded 2010 (public launch 2011)

HQ Mountain View, CA

Parent Google Cloud (Alphabet)

Category Data Warehouse

Based on Dremel (Google internal, 2006)

Storage Colossus (columnar, Capacitor format)

Pricing On-demand ($/TB scanned) or reserved slots

Monthly mindshare ~700K · serverless = many casual users; ~3M GCP project users with BigQuery enabled; #1 cloud DW by user count