BigQuery Deep Intuition — Deep Tech Intuition

1. One-Sentence Essence

BigQuery is a multi-tenant, serverless implementation of Dremel — a tree-of-workers query engine — running over a columnar file format on a shared-nothing distributed file system, where storage and compute are separate services that scale independently and communicate over a Clos-network fabric fast enough to make remote disk feel local.

That’s a mouthful. Unpack it:

Multi-tenant, serverless: there is no “your cluster.” Every query you run borrows workers from a shared pool the size of a small datacenter, executes, and returns them. You never see a machine.
Dremel: a 2010-era Google research system whose execution model is a tree of coordinator → mixers → leaf workers. BigQuery is, fundamentally, productized Dremel.
Columnar file format: data is stored column-by-column (Capacitor format) on disk, so queries that read 3 columns out of 200 only touch ~1.5% of the bytes.
Shared-nothing distributed file system: storage is Colossus, Google’s successor to GFS. No single machine “owns” your table. Every byte is replicated and erasure-coded across many disks.
Storage and compute are separate: this is the architectural pivot. Compute clusters can scale to zero while data sits durably in storage. Two queries against the same table involve zero coordination at the storage layer.
Network fast enough to make remote disk feel local: Jupiter, Google’s datacenter network, gives every machine ~petabits/sec of cross-section bandwidth. This is the secret ingredient — without it, the separated storage/compute model would die from network latency.

If you remember one thing: BigQuery succeeds because the network is fast enough that storage doesn’t need to be near compute. That single physical fact is what lets the whole architecture exist. Every other property — elasticity, no provisioning, paying per-query — is downstream of that.

2. The Problem It Solved

In the late 2000s, if you had a petabyte of data and wanted to ask a question of it, you had two bad options.

Option one: a traditional MPP data warehouse. Teradata, Netezza, Oracle Exadata. These were specialized appliances — racks of CPU and tightly-coupled disk that you bought as a unit. They were fast for the data they were sized for, but they were rigid. Storage and compute were welded together: to store more data, you bought more compute (whether you needed it or not); to do a one-off massive query, you couldn’t borrow compute from somewhere else. Price tags ran into millions, capacity planning was a perpetual political fight, and “the warehouse is full” was a sentence engineers heard regularly.

Option two: Hadoop and MapReduce. Cheap commodity hardware, open source, and you could store everything. But MapReduce was fundamentally a batch system. A query took minutes to hours. Engineers wrote Java to ask questions that should have been a SELECT. Hive bolted SQL on top, but it just rewrote your query as MapReduce jobs underneath, so it was slow in the same way. Interactive analysis at scale didn’t exist.

Google had this exact problem internally. Engineers across the company needed to ask quick questions of huge logs — ad clicks, search behavior, indexing pipeline output. MapReduce was the wrong tool for “give me the top 100 user-agents that hit error 500 yesterday.” So in 2006 they built Dremel: a system designed from the start for interactive analysis of read-only data, using two key ideas — columnar storage of nested records, and tree-shaped distributed execution borrowed from web search infrastructure (web search engines had been dispatching queries across thousands of leaf servers and aggregating results for over a decade by then). Dremel could chew through a trillion rows in seconds. It was published in a 2010 paper.

In 2011, Google packaged Dremel as a public cloud service: BigQuery. The pitch was radical for the era — no clusters to provision, no nodes to size, just write SQL and get an answer. That serverless framing wasn’t marketing fluff; it was a direct consequence of how Dremel was already being used internally as a multi-tenant service. The unique thing BigQuery did, that nobody else in the cloud warehouse market did at first, was sell you the engine without selling you the cluster. (Snowflake, founded the same year, sold a similar separation-of-storage-and-compute story but with explicit virtual warehouses you sized — a different bet that turned out to also work.)

The deep insight: once your network fabric is fast enough, “data near compute” stops being a hard constraint, and once that constraint goes, you don’t need to own a cluster to query a petabyte. That insight is the entire reason BigQuery exists.

3. The Concepts You Need

Before we go further, you need the vocabulary. None of this is optional jargon — these are the words you need to think clearly about what comes next. Read this section carefully; later sections will reference these terms by name.

Storage and physical layout

Colossus: Google’s distributed file system, the successor to GFS. Every BigQuery byte lives in Colossus. It handles replication, erasure coding (Reed-Solomon), and durability. From BigQuery’s perspective, Colossus is “infinite, durable, somewhat-slow remote storage.” Treat it as your S3 equivalent, except integrated tighter with BigQuery’s metadata layer.
Capacitor: BigQuery’s columnar file format. Capacitor is what’s actually written into Colossus when you load data. It’s a successor to ColumnIO (the format from the original Dremel paper) and shares deep DNA with Apache Parquet — they were developed by overlapping people, and Parquet adopted Capacitor’s nesting-encoding scheme. Capacitor stores each column separately, applies aggressive compression (RLE, dictionary, frame-of-reference), and can encode deeply nested records without losing structure.
Shard (sometimes called a “block” or, in the original paper, a “tablet”): the unit of storage parallelism. A table is split into many shards. Each shard contains all columns for a contiguous slice of rows. When you query, BigQuery dispatches one slot per shard (roughly). Shard size is automatically managed — you don’t pick it.
Definition and repetition levels: a clever encoding (introduced in the Dremel paper, copied by Parquet) that lets a columnar format losslessly store nested records — ARRAY<STRUCT<...>> and friends — without having to read parent columns to reconstruct child structure. You don’t need to know the math; you need to know that nested data is a first-class citizen in BigQuery, not bolted on, and that’s why queries against nested fields don’t pay a structural penalty.

Compute and execution

Dremel: the query engine. Don’t think of it as software that runs on your behalf — think of it as a continuously-running, multi-tenant service inside Google datacenters that accepts queries from millions of customers and dispatches work onto a shared pool of workers.
Slot: a virtual unit of compute. Roughly: one slot ≈ one CPU core’s worth of in-flight work for one second. When you “use 1,000 slots for 5 seconds,” you’ve consumed 5,000 slot-seconds. Slots are the currency of BigQuery compute — pricing, quotas, scheduling, and reservations are all denominated in them. Importantly, the number of slots used by a query and the amount of slot-seconds consumed are different things. You can use 2,000 slots for 1 second or 100 slots for 20 seconds — same total work, different parallelism.
Borg: Google’s cluster manager (the spiritual ancestor of Kubernetes). When BigQuery needs workers to run a query, Borg allocates them onto physical machines somewhere in the datacenter. You will never interact with Borg directly. It just means: “your slots are scheduled by the same system that schedules Search and Gmail.”
Serving tree (or execution tree): the runtime topology of a query. A coordinator at the root, one or more layers of mixers in the middle, and many leaves at the bottom. The query is rewritten and pushed down the tree; partial results flow back up. We’ll spend a lot of time on this.
Leaves / leaf workers: the workers at the bottom of the serving tree. They actually read columnar data from Colossus and execute filters, projections, and partial aggregations. Most of the query’s CPU work happens here.
Mixers: intermediate-level workers that aggregate partial results from leaves (or from lower mixers) and pass them up. They exist because if 5,000 leaves all reported directly to one root, the root would be a bottleneck. Tree topology fans the aggregation out.
Coordinator / root: the worker that receives your query, plans it, dispatches the tree, collects final results, and returns them.

Network and shuffle

Jupiter: Google’s datacenter network fabric. It’s a Clos topology (multi-stage non-blocking switch) that delivers ~petabits/sec of cross-section bandwidth across an entire datacenter. The relevant fact: any machine can talk to any other machine at near-line-rate, and bandwidth is so high that “where data lives” stops mattering at the latency BigQuery cares about. This is what makes separating compute from storage tractable.
Shuffle: the act of redistributing intermediate data between query stages so that downstream workers see the rows they need. Joins, aggregations, and window functions all shuffle. BigQuery’s shuffle is in-memory and remote — intermediate data lives in a pool of dedicated memory-server nodes that producers write to and consumers read from. This matters: shuffle isn’t a barrier (consumers can read while producers are still writing), and it isn’t your slot’s problem (the memory is separate). But shuffle bytes still cost you wall-clock time and can blow out the shuffle quota.

Data organization

Partitioning: dividing a table into discrete sections by the value of one column (typically a date or timestamp, sometimes an integer range, sometimes ingestion time). Each partition is a logically separate set of shards. Queries with a WHERE clause on the partition column can prune partitions before scanning — paying for and reading only the partitions that matter.
Clustering: sorting the data within shards by one or more columns (up to four). Unlike partitioning, clustering doesn’t divide the table into separate buckets — it just orders the rows so that rows with similar cluster-column values end up co-located. Filters on clustered columns let BigQuery skip blocks within a shard. Clustering is BigQuery’s answer to “I have many high-cardinality filter columns and partitioning by all of them is impossible.”
Block pruning: at query time, BigQuery uses metadata about each shard (and within clustered shards, each block) to skip reads that can’t possibly match the filter. Partition pruning is one form; cluster pruning is another. Pruning is how BigQuery makes a table you couldn’t afford to fully scan still cheap to query.

Pricing / billing

On-demand pricing: you pay per byte scanned. ~$6.25 per TiB processed (US, current rate). This was the original BigQuery model and is still the default. A single project gets up to 2,000 concurrent slots, shared across queries, allocated automatically. You can’t see or control how many slots you got.
Capacity (Editions) pricing: you pay for slot-hours. You commit to a baseline of slots, optionally let it autoscale up to a max, and you’re billed per slot per second (1-minute minimum). Three editions exist: Standard (basic features), Enterprise (BQML, materialized views, idle slot sharing, security features), Enterprise Plus (CMEK, advanced governance, highest tier). On-demand and Editions can coexist on a per-project basis — it’s not an account-wide flag.
Active vs long-term storage: a partition or table that hasn’t been modified in 90 days drops to ~half storage cost automatically. No performance difference. This is one of those quietly-helpful BigQuery features people forget exists.

OK. That’s the vocabulary. From here on, I’ll use these terms without re-defining.

4. The Distilled Introduction

This section covers what a 10-hour BigQuery course covers, distilled. Think of this as “what a competent practitioner needs to be able to do” — the mental sequence from “I have data” to “I have insights.” If you’ve used BigQuery before, you can skim. If you haven’t, this is your fast track.

Getting set up

There is no installation. BigQuery is a service. You interact with it through:

The Google Cloud Console UI (the in-browser SQL editor with autocomplete, query plan viewer, etc.). Most engineers spend most of their BigQuery time here.
The bq CLI (a thin wrapper over the REST API). Useful for scripting.
Client libraries for every major language. Python’s google-cloud-bigquery is the workhorse.
JDBC/ODBC drivers for BI tools.
The Storage Read API and Storage Write API: gRPC interfaces for high-throughput read/write that bypass the SQL layer. We’ll come back to these.

To use BigQuery you need a Google Cloud project, billing enabled, and the BigQuery User or BigQuery Admin IAM role on it. Datasets live inside projects; tables live inside datasets. The fully-qualified name is project.dataset.table — backticks around the whole thing in SQL.

The unit hierarchy

project
└── dataset           ← like a schema; access is granted at this level
    └── table         ← partitioned + clustered; columnar; billed for storage
    │   └── partition ← physical division by date/timestamp/integer
    │       └── shard ← unit of parallel scan; auto-managed
    └── view          ← saved SQL; runs every time
    └── materialized view ← actually-stored, auto-refreshed view
    └── routine       ← UDF, stored procedure, table function

Datasets are regional (or multi-regional). You cannot join a US dataset with an EU dataset directly — the data physically lives in different places, and BigQuery enforces that separation. If you need cross-region work, you replicate.

Loading data

Five paths get data in. They differ mostly in latency, cost, and atomicity.

Batch load (bq load, load job via API, “Cloud Storage → BigQuery” connector). You drop files in Cloud Storage (CSV, JSON, Avro, Parquet, ORC) and ask BigQuery to ingest them. Free from a query-cost standpoint (loads consume “free” slots), atomic per-job, can append or replace, supports schema autodetect. This is the default for batch ETL. Use this unless you have a reason not to.
Storage Write API (gRPC, replaces the legacy tabledata.insertAll streaming API). Two modes: a default stream (at-least-once, simple) and application-created streams (exactly-once via offsets, transactional commit). Direct columnar writes — the data lands in proper Capacitor format almost immediately, no streaming buffer purgatory. This is what you should use for any new streaming pipeline.
Legacy streaming inserts (tabledata.insertAll). The original streaming API — JSON over HTTP, rows go to a streaming buffer first, then get periodically extracted into columnar storage. Has weird edge cases: rows in the buffer are visible to queries but invisible to copy/extract jobs; DML against recently-streamed rows used to fail outright (now sort of works). Newer code should use the Storage Write API instead.
DML statements (INSERT, UPDATE, DELETE, MERGE). Atomic, immediately visible. Slow and expensive at scale. Fine for occasional small mutations. Don’t use for bulk ingestion.
External tables / federated queries. Define a table whose data actually lives in Cloud Storage, Bigtable, Spanner, Drive, etc. Query in place. Slower than native storage, but handy when you don’t want to copy.

The big call: use Storage Write API for streaming, batch load for everything else, DML only for genuine row-level mutations.

A first useful workflow: load → query → optimize

You’ll spend most of your BigQuery life in this loop.

-- Create a dataset
CREATE SCHEMA `my-project.analytics`
  OPTIONS (location = 'US');

-- Create a partitioned, clustered table
CREATE TABLE `my-project.analytics.events` (
  event_time TIMESTAMP NOT NULL,
  user_id    STRING NOT NULL,
  event_type STRING,
  payload    JSON
)
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type;

-- Load from Cloud Storage
LOAD DATA INTO `my-project.analytics.events`
FROM FILES (
  format = 'PARQUET',
  uris = ['gs://my-bucket/events/2024-*.parquet']
);

-- Query with partition pruning AND cluster pruning
SELECT user_id, COUNT(*) AS n
FROM `my-project.analytics.events`
WHERE DATE(event_time) BETWEEN '2024-06-01' AND '2024-06-07'
  AND user_id = 'user_42'
GROUP BY user_id;

That last query, on a 10-billion-row table, will scan a few megabytes — not because BigQuery is magic, but because partition pruning skipped 99% of the dates and cluster pruning skipped 99% of the user_ids within the remaining week. Without the partitioning and clustering, the same query scans the full table.

The fifteen things you’ll do most often

Here’s the working set. Memorize the shape of each, not the exact syntax:

Run a SELECT and see what it costs before running. Click the “validator” in the UI or use bq query --dry_run — BigQuery returns the bytes that would be processed without actually running. Always do this for queries you haven’t run before.
Filter on the partition column. Always. WHERE event_date BETWEEN x AND y or WHERE _PARTITIONTIME = ... for ingestion-time-partitioned tables. Forgetting this is the #1 way to burn money.
SELECT * EXCEPT(col1, col2) when you need most columns but not a few. Beats listing 80 column names.
Avoid SELECT *. Columnar storage means you pay only for columns you read. SELECT * reads everything.
WITH (CTEs) for readability — but CTEs are not materialized. They’re inlined into the outer query. If you reference a CTE three times, you compute it three times. For repeated heavy CTEs, materialize to a TEMP TABLE instead.
CREATE TEMP TABLE inside a script for intermediate results that get used multiple times. Cleaner than CTEs in those cases.
MERGE INTO for upserts. Atomic. The right tool when you have late-arriving updates against a base table.
Window functions (ROW_NUMBER() OVER (PARTITION BY ...), LAG, LEAD, etc.). Standard SQL, but watch for memory issues — see gotchas section.
UNNEST(array_column) to flatten a repeated field into rows. This is the BigQuery idiom for working with nested data — you don’t normalize first, you UNNEST at query time.
STRUCT(...) and ARRAY[...] constructors for building nested output, often inside ARRAY_AGG for collapsing related rows into a parent row’s nested array.
APPROX_COUNT_DISTINCT instead of COUNT(DISTINCT ...) when an exact answer doesn’t matter. HyperLogLog under the hood, dramatically cheaper.
Materialized views for queries that get run many times against slowly-changing data. BigQuery automatically incremental-refreshes them and can transparently route a base-table query to the materialized view.
Authorized views for sharing access to a slice of a table without granting access to the whole table.
Job history / INFORMATION_SCHEMA.JOBS for seeing what queries ran, what they cost, who ran them, and how long they took. This is your audit trail and your optimization starting point.
The query plan explanation (the “Execution Details” tab in the UI). Shows the stage-by-stage breakdown of your query — how many bytes shuffled, how long each stage took, where slot time was spent. Learning to read this is the single biggest level-up for a BigQuery user. If you’ve never opened it, open it on your next query.

Configuration that matters from day one

Location: pick the right region (or multi-region) for your dataset at creation time. Cannot be changed. EU vs US is a meaningful distinction for compliance and cost.
Cache results: queries that return the same result against unchanged data return cached results for free, for 24 hours. Default-on. Disable if you need fresh reads (rare).
Maximum bytes billed: a per-query safety cap. If a query tries to scan more than N bytes, it errors out instead of running. Set this organization-wide if you want hard cost controls.
Default partition expiration: per-dataset, automatically deletes old partitions. Essential for log-style tables.
Default table expiration: same idea but for whole tables. Useful in dev/sandbox datasets.
bq --use_legacy_sql=false: the default since 2017, but if you find old code with weird [project:dataset.table] syntax, that’s legacy SQL. Migrate it.

That’s the working knowledge. From here on, everything we cover is about why this works and how to do it well.

5. The Mental Model

There are four ideas that, once you have them in your head, let you predict almost any BigQuery behavior without consulting docs. Internalize these and the rest of the system becomes mechanical to reason about.

Core Idea 1: Storage and compute are separate services that meet at the network.

In a traditional data warehouse, compute and storage live on the same machine. The data is on the disk attached to the CPU that’s processing it. Performance comes from physical proximity. Capacity planning is “buy more boxes” — and you buy compute and storage together whether you wanted to or not.

BigQuery breaks this assumption. Your data lives in Colossus. Your queries run on stateless workers Borg conjures up on demand. The two layers communicate over Jupiter. There is no “your machine.” There is no “your cluster.” A query on your table this morning ran on a different physical worker than the same query this afternoon, and the storage didn’t move.

What this predicts:

Concurrency is a non-event for storage. A thousand queries can hit the same table simultaneously without any contention at the storage layer. There’s no “lock,” no “waiting for the disk.” The bottleneck, if any, is at the compute pool.
Storage is essentially infinite. You will never get a “disk full” error in any normal sense. The thing you’ll bump into is a quota (rows per stream, partitions per table, slot-seconds per query) — administrative limits, not physical ones.
You don’t tune storage separately from compute. The thing you “tune” is how much work the compute layer has to do — and that’s controlled by query shape, partitioning, and clustering, all of which exist to reduce the bytes of storage you ask compute to read.
Cold-start vs warm performance is mostly absent. Because there’s no cluster to warm up, there’s no bring-up cost for occasional queries. (Technically, there is some cost for very large queries to spin up many workers, but on the order of seconds, not minutes.)
Backups, snapshots, and time travel are storage features, not query features. BigQuery keeps 7 days of history “for free” and lets you query an older state with FOR SYSTEM_TIME AS OF. This works because it’s all just immutable Capacitor files in Colossus — older versions still exist.

Core Idea 2: Capacitor reads only the columns you ask for, only the shards that match.

The storage layer isn’t a passive byte store — it’s aware of what’s inside it. When you submit a query:

The planner figures out which columns the query touches. Capacitor stores each column separately, so the leaves only fetch those column files.
The planner figures out which partitions the query’s WHERE clause permits. Other partitions’ shards aren’t even listed.
Within the surviving partitions, if the table is clustered, the planner uses block-level metadata to figure out which blocks within shards could possibly contain matching rows. Other blocks aren’t read.
Within the blocks that are read, Capacitor’s encodings (RLE, dictionary) often let it answer simple predicates without decoding — running a comparison on dictionary IDs is dramatically faster than decompressing strings.

This is the chain of reductions that turns a 10-billion-row table into a 5-megabyte scan. Every layer is an opportunity to skip data, and the planner exploits all of them automatically — but only if your query gives it the information to.

What this predicts:

SELECT * is genuinely expensive. Every column you ask for is bytes scanned. With on-demand pricing, every byte costs money. With slot-based pricing, every byte costs slot-time.
Filters on partition / cluster columns are essentially free wins. They don’t just speed the query — they reduce the bytes scanned, which reduces the cost.
Filters that can’t prune (WHERE LOWER(col) = ..., function calls on the column) defeat pruning. Predicate pushdown only works on simple comparisons against the raw column.
Wide tables are fine. Don’t be afraid of 200-column tables. You only pay for what you read. This is the opposite of row-store advice.
Sometimes denormalizing into nested fields is faster than joining. A STRUCT or ARRAY lives in a single column; querying it doesn’t require a join. We’ll come back to this.

Core Idea 3: Queries are trees of stateless workers, with shuffle as the in-memory connective tissue.

When BigQuery runs your query, it doesn’t run it as a single program on one machine. It compiles your SQL into a DAG of stages, instantiates the DAG as a tree of workers (root → mixers → leaves), and runs the stages in roughly bottom-up order. Each stage:

Reads input (from Colossus, or from shuffle).
Does its work in parallel across many slots.
Writes output (to shuffle, or to the next stage’s input, or to results).

Between stages, data flows through shuffle — a separate, dedicated pool of memory-server nodes (and Colossus as overflow when memory is exhausted). Workers don’t talk to each other directly; they talk to shuffle. The producer worker writes its output to shuffle keyed appropriately; the consumer worker reads its inputs from shuffle.

What this predicts:

Workers are stateless and disposable. If a worker dies, BigQuery just picks up the work and runs it on a different worker, reading from shuffle to recover state. You will rarely notice a worker failure even on long queries — it’s invisible because it has to be (Google datacenter machines fail constantly).
Shuffle is the limiting resource for big joins and group-bys. If you GROUP BY user_id on a billion-row table, every row must be hashed and sent through shuffle. The bytes shuffled scales with the bytes of input, and you can hit shuffle quotas. Looking at “bytes shuffled” in the query plan is often more diagnostic than “bytes scanned.”
Two-table joins typically shuffle both sides by the join key, unless one side is small enough to broadcast — in which case BigQuery sends a copy of the small side to every leaf and avoids shuffling the big side. The threshold is “tens of MB to ~10 GB” — exact value isn’t documented but BigQuery picks automatically.
Window functions can shuffle the entire dataset to a single worker if the OVER clause has no PARTITION BY — which is why ROW_NUMBER() OVER (ORDER BY x) over a billion rows blows up with memory errors. The fix is to add a PARTITION BY that distributes work.
Dynamic execution adapts to data. BigQuery’s planner has access to runtime statistics; if early stages reveal the data is more skewed than expected, later stages adjust. You don’t usually need to think about this, but it’s why the same query on different data sometimes runs with different topologies.

Core Idea 4: Slots are the universal unit of compute, but you can’t see them.

Every CPU-second of work BigQuery does on your behalf is denominated in slot-seconds. A slot is roughly “one CPU core’s worth of work.” The query plan tells you how many slots a stage used and for how long.

But here’s the thing: you don’t allocate slots; they’re allocated to you. Under on-demand pricing, BigQuery’s scheduler decides how many slots your query gets based on its complexity, the table size, and overall system load. You can hit a soft cap (~2,000 concurrent slots per project, with bursts higher), but you can’t say “give this query 1,000 slots, please.” Under capacity (Editions) pricing, you reserve a baseline; queries draw from the reservation; if you’ve enabled autoscaling, more slots come in increments of 100 to handle bursts.

What this predicts:

Throwing more slots at a query has diminishing returns past a point. A query that processes 100 GB might use 800 slots productively. At 2,000 slots, the extra slots are sitting idle waiting for I/O or waiting for shuffle. More parallelism isn’t always faster.
A slow query is rarely “BigQuery is slow today.” It’s almost always either (a) shuffle bottleneck, (b) a single skewed stage where one slot has all the work, or (c) genuine data scale. The query plan tells you which.
Per-query latency has a floor. Even a tiny query takes ~1-2 seconds because the system has to plan, dispatch, and return results across the tree. BigQuery is optimized for throughput on large queries, not point-lookup latency. (BI Engine, an in-memory cache layer, exists specifically to bring sub-second latency to dashboard queries.)
Concurrency is a slot-pool problem. If twenty users all run massive queries simultaneously, they share a slot pool. Under on-demand, this means queries queue. Under reservations, this means autoscale kicks in or you hit your max.
Slots and bytes are correlated but not identical. A query that reads 1 TiB and does heavy aggregation might use far more slot-time than a query that reads 1 TiB and just filters. On-demand pricing bills the bytes; capacity pricing bills the slot-time. The choice between pricing models depends on which of those numbers dominates your workload.

These four ideas are the model. Memorize them. Most of what follows is consequences of them.

6. The Architecture in Plain English

Let’s walk a query through the whole system, end to end. You type:

SELECT user_id, SUM(amount) AS total
FROM `proj.sales.transactions`
WHERE DATE(event_time) = '2024-06-15'
  AND country = 'US'
GROUP BY user_id
ORDER BY total DESC
LIMIT 100;

Press run. Here’s what physically happens.

Step 1: Submission and metadata fetch

Your client (browser, bq CLI, Python library) hits the BigQuery REST API with a jobs.insert call. The request lands at a frontend service, which authenticates you (IAM check), validates the SQL syntactically, and creates a Job entity in BigQuery’s metadata store. The job has an ID; you can see it in the job history.

The job is handed to the query planner. The planner pulls the table’s metadata from BigQuery’s metadata service: schema, partitioning info, clustering info, table size, statistics, and crucially, the list of shards (with metadata about each shard’s content range — min/max values per column, used for pruning). This metadata fetch is the unsung hero of BigQuery — it’s what lets the planner know which 0.3% of the table needs to be read without doing any I/O on the data itself.

Step 2: Planning and rewriting

The planner builds an execution DAG. For our query, it might look like:

Stage 0:  read shards from sales.transactions (partition 2024-06-15 only)
          filter country = 'US'
          project user_id, amount
          partial GROUP BY user_id, partial SUM(amount)
          → write partial groups to shuffle, hash-keyed by user_id

Stage 1:  read from shuffle, grouped by user_id
          finalize SUM
          → write to next shuffle, sorted by total DESC

Stage 2:  TOP 100 from sorted input
          → write to result table

The plan exploits partition pruning (only the 2024-06-15 partition’s shards are listed), column pruning (only event_time, country, user_id, amount are read; payload and other columns are not), and predicate pushdown (country = 'US' filter happens during the scan, not after).

If the table were clustered on country, the planner would also enable block pruning — within each shard, only blocks containing ‘US’ would be read.

Step 3: Tree dispatch

The planner submits the DAG to Dremel’s dispatcher, which decides how many slots to allocate per stage and constructs the serving tree. It asks Borg for workers; Borg finds them somewhere in the datacenter and spins up Dremel processes on them. The workers form a tree: a root, one or more layers of mixers, and many leaves (typically hundreds, sometimes thousands depending on table size and complexity).

The query is rewritten as it descends the tree. The root receives the SQL. The root rewrites it into per-mixer subqueries. Each mixer rewrites again into per-leaf subqueries. By the time it reaches the leaves, each leaf has a tiny task: “read these specific shards, apply this filter, do this partial aggregation, send results to shuffle bucket N.”

Step 4: Leaf execution — the actual work

This is where the bytes are read. Each leaf:

Opens the Capacitor files in Colossus that contain its assigned shards. Only the relevant column files for its columns of interest.
Streams Capacitor blocks into memory. Blocks are decompressed (RLE → expanded values, dictionary IDs → strings).
Applies filters. Where possible, filters run on encoded data — comparing a dictionary ID to a target ID is faster than full decompression.
Applies the partial aggregation (SUM(amount) per user_id).
Writes its partial results to shuffle, hash-keyed by user_id so that all rows for the same user end up in the same shuffle bucket.

Leaves don’t communicate with each other. They only communicate with Colossus (read) and shuffle (write).

Step 5: Shuffle — the in-memory connective tissue

Shuffle is a separate service. It’s a pool of dedicated machines whose entire job is to hold intermediate query data in memory and serve it efficiently. When a leaf writes a row to shuffle keyed by user_id, the shuffle service routes that row to a specific shuffle node based on the hash of user_id. Rows for user_id=42 all go to one shuffle node. Rows for user_id=43 go to another (or possibly the same, depending on hashing).

This is a purely in-memory repartitioning, with disk overflow on Colossus only when memory fills. Crucially, it’s not a barrier — Stage 1 consumers can start reading rows from a shuffle bucket as soon as the first row is written there, even if Stage 0 producers haven’t finished. This pipelining is one reason BigQuery feels faster than systems like classic Spark, which historically waited for all map-side output before any reduce-side work began.

Step 6: Mixer / next-stage execution

Stage 1 workers (mixers, in tree topology) read from shuffle. Each Stage 1 worker handles a slice of the user_id space. It reads all partial sums for its slice (which, thanks to the hash routing in step 5, are all in its local shuffle buckets), finalizes the SUM, and writes the final (user_id, total) pairs to the next shuffle (this time keyed for sorting).

Step 7: Sort and limit

Stage 2 reads from the second shuffle, sorts within its slice, and applies LIMIT 100 per slice. Then the root collects the per-slice top-100 results, merges them, and produces the final top-100.

Step 8: Result delivery

The root writes the result to a temporary result table (also in Colossus, also in Capacitor format). The Job entity is updated to “DONE.” The client receives a notification (via polling or callback) and can fetch results either through the REST API (for small results) or through the Storage Read API (gRPC, for large results — much faster).

The cached result lives for 24 hours. If you re-run the same SQL against the same data, BigQuery will return the cached result without running anything.

Where state lives

Look at the trip we just took:

Persistent state: Colossus — your tables, results, table snapshots, time-travel history. Distributed file system.
Metadata state: BigQuery’s metadata service — schema, partitions, shards, statistics. Highly available distributed store.
Transient query state: shuffle service — intermediate query data. In memory, regional pool.
Compute state: workers — basically none. They read from Colossus or shuffle, write to shuffle or Colossus, and die. Stateless.

The fact that workers are stateless is what enables fault tolerance, autoscaling, and multi-tenancy. The fact that storage and compute are separate services is what enables independent scaling of each. The fact that the network is fast enough to make this all work is Jupiter.

That’s the architecture. Everything in BigQuery — every quirk, every gotcha, every judgment call — is downstream of this picture.

7. The Things That Bite You

These are the behaviors that confound smart engineers in their first six months. Each of them is a direct consequence of the mental model — but you have to internalize them to stop getting bitten.

1. Forgetting to filter on the partition column

You partition events by DATE(event_time). You write:

WHERE event_time >= '2024-06-15'

You expect partition pruning. You don’t get it — or rather, you might, but it depends on whether the planner can prove the filter aligns with partition boundaries. Some forms of date arithmetic break it. The truly safe form, especially for ingestion-time partitioned tables, is:

WHERE _PARTITIONTIME = TIMESTAMP('2024-06-15')
-- or for column-based partitions:
WHERE DATE(event_time) = '2024-06-15'

Why this happens: partition pruning depends on the planner being able to evaluate the predicate against partition metadata without looking at row data. Function calls, type coercions, or comparisons against complex expressions can defeat this. Always do a dry run after writing a query against a partitioned table. If the bytes-scanned estimate is “the whole table,” you didn’t prune.

A particularly nasty version: filtering on a column that equals the partition column but isn’t the partition column itself. For example, the table is partitioned by ingestion time (_PARTITIONTIME), and your data also has a created_at column that’s basically identical. Filtering on created_at won’t prune — only _PARTITIONTIME will.

2. CTEs are not materialized

WITH big_cte AS (
  SELECT ... FROM huge_table WHERE expensive_filter
)
SELECT * FROM big_cte JOIN other_table ON ...
UNION ALL
SELECT * FROM big_cte WHERE x = 1
UNION ALL
SELECT * FROM big_cte WHERE x = 2;

You wrote big_cte once. BigQuery will compute it three times — once per reference. CTEs are syntactic sugar; they get inlined. If big_cte reads 500 GB, you’ve billed for 1.5 TB.

The fix: if a heavy CTE is referenced multiple times, materialize it:

CREATE TEMP TABLE big_cte AS
  SELECT ... FROM huge_table WHERE expensive_filter;

-- Now references are cheap reads from the temp table

This trips up engineers from Postgres or Spark backgrounds, where CTEs sometimes do get materialized or planners de-duplicate. BigQuery’s planner doesn’t automatically deduplicate CTE references.

3. `LIMIT` doesn’t reduce cost

LIMIT 1000 on a multi-billion row table reduces wall-clock time (because shuffle work is reduced) but does not reduce bytes scanned. The full table scan happens; LIMIT applies after. Engineers from row-store backgrounds expect the opposite — in OLTP databases, LIMIT often lets the planner stop reading early. Not in a columnar system reading all shards in parallel.

Use LIMIT for output truncation only. To explore data, use the table preview feature (free, paginated, doesn’t count as a query). To reduce scan, add real predicates on partition/cluster columns.

4. Functions on filter columns kill pruning

WHERE LOWER(country) = 'us'      -- DEFEATS pruning
WHERE country = 'US'             -- ENABLES pruning (if clustered)

WHERE DATE(timestamp_col) = '2024-06-15'   -- depends on partition definition
WHERE timestamp_col >= TIMESTAMP('2024-06-15')
  AND timestamp_col <  TIMESTAMP('2024-06-16')   -- always prunable

When you wrap the column in a function, the planner generally can’t push the predicate down into the storage scan. It has to read everything and apply the function row-by-row. Same issue exists in Postgres, but it bites harder in BigQuery because you’re paying per byte scanned.

For partitioned tables, BigQuery has special-cased some forms (like DATE(timestamp_col) = constant against a DATE-truncated partition) but not all. Always verify with a dry run.

5. Window functions can serialize the entire query

SELECT
  *,
  ROW_NUMBER() OVER (ORDER BY created_at) AS rn
FROM events_billion_rows;

This window has no PARTITION BY. That means the entire result needs to be ordered globally. BigQuery has to send every row through a single worker (or a small number) to compute the running row number. Memory blows up. Query fails with Resources exceeded during query execution.

Fix: add a PARTITION BY whenever you can:

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at)

Now the work parallelizes. Each user’s rows go to one worker, but different users go to different workers.

If you genuinely need a global row number on a billion rows… reconsider. There’s almost always a way to express what you want without a global ordering.

6. Shuffle quotas are real and you’ll find them

For very large queries with massive joins or aggregations, you can hit:

Resources exceeded during query execution: Your project or organization
exceeded the maximum disk and memory limit available for shuffle operations.

This is not “your query is slow.” This is “your query couldn’t run at all.” The shuffle pool, while huge, isn’t infinite, and it’s shared across your project. A query that wants to shuffle 50 TB of intermediate data can simply not be admitted.

Mitigations: pre-filter aggressively before joining, replace string join keys with integer surrogate keys (massive shuffle reduction), break the query into multiple smaller queries with temp tables, or pay for more slots (Editions). The fundamental fix is “shuffle less,” which means thinking about what your query is actually asking the system to redistribute.

7. Streaming buffer makes recently-streamed data weird

If you use the (legacy) tabledata.insertAll streaming API, rows go into a streaming buffer first, then get extracted into Capacitor storage on a delay. While in the buffer:

Queries see them (good).
Copy and extract jobs don’t see them (surprising).
DML against them used to fail with “Streaming buffer error” (somewhat fixed now, but still has gotchas).
Time travel doesn’t work for them yet.

This is one of the reasons the Storage Write API was introduced — it writes directly to columnar storage with no streaming buffer in between. For new code, always use the Storage Write API; the legacy streaming API has all these edge cases waiting to bite you.

8. Joining with a single-row “config” table can defeat the optimizer

WITH config AS (SELECT '2024-06-15' AS d)
SELECT *
FROM big_table, config
WHERE event_date = config.d;

You’d think this is fine. Sometimes it is. Sometimes BigQuery doesn’t realize config.d is a constant in the way it would need to to apply partition pruning, and you scan the whole table.

The fix: declare it as a query parameter or a DECLARE variable in a script, both of which BigQuery handles as constants:

DECLARE d DATE DEFAULT '2024-06-15';
SELECT * FROM big_table WHERE event_date = d;

9. The 4,000-partition-per-table limit

You partition a table by user_id (high cardinality, integer). Or by hour over many years. Or by some combination that produces lots of partitions. You hit the limit:

Too many partitions created by this table. The maximum is 4000.

Why: every partition is metadata BigQuery has to maintain. Beyond 4,000, performance of metadata operations degrades, so there’s a hard cap.

This limit informs the partition vs cluster decision: if your filter column has high cardinality (many distinct values), partitioning by it doesn’t work. Cluster on it instead. Reserve partitioning for low-to-moderate cardinality columns where you have <4,000 distinct values — typically a date.

10. Skew makes some slots do all the work

SELECT key, COUNT(*) FROM events GROUP BY key;

If one value of key appears in 90% of rows, the slot that gets that hash bucket does 90% of the work. The other slots finish in seconds; the skewed one runs for minutes; the query latency is dominated by that one slot. The query plan shows “stage 1 has one slot at 99% utilization, others at 1%” — the textbook signature of skew.

Fixes are query-specific: pre-filter the dominant key separately, salt the key (GROUP BY CONCAT(key, MOD(some_id, 100)) then re-aggregate), or use APPROX_* functions if exact counts aren’t required.

Skew is fundamental to distributed systems. BigQuery’s hash-based shuffle assumes the hash function distributes work evenly. When data is genuinely lopsided, that assumption breaks. Watch for “Wait” time at the slot level in the query plan — that’s slots-blocked-on-other-slots, the calling card of skew.

8. The Judgment Calls

This section is what separates someone who uses BigQuery from someone who understands it. None of these have hand-wave-y answers; each is a real tradeoff with a defensible right answer in context. I’ll tell you what experienced engineers actually do and why.

Judgment Call 1: Partition or cluster?

Situation: you have a large table and want pruning.

Partition when: you have a single low-to-moderate cardinality column (date, often timestamp truncated to day), and queries reliably filter on that column. Partitioning gives you:

Cost predictability (dry run shows post-pruning cost).
Partition-level operations (delete a partition, set partition expiration, load into a specific partition).
Strong guarantees about what gets read.

Cluster when: you want pruning on a high-cardinality column, or on multiple columns, or the column has too many distinct values for partitioning. Clustering is softer (cost not known until after the query runs, since pruning happens at execution time) but more flexible.

Both when: you have a clear time dimension and a high-cardinality lookup column. This is the modal case for analytics tables. PARTITION BY DATE(event_time) CLUSTER BY user_id, event_type is the workhorse pattern.

The signal: look at your top 20 expensive queries’ WHERE clauses. The most common single low-cardinality filter is your partition column. The next 1-3 most common high-cardinality filters are your cluster columns. Use the “partitioning and clustering recommender” in the Cloud Console (it analyzes 30 days of query patterns).

What experienced engineers do: default to PARTITION BY DATE(<event_time>) CLUSTER BY <user_id or other primary lookup key>. They reach for that pattern as the starting point and only deviate when query patterns demand it.

Judgment Call 2: On-demand or capacity (Editions) pricing?

Situation: you’re choosing the billing model for a new project.

On-demand when: your usage is bursty, unpredictable, or low-volume; you don’t have steady-state slot needs above ~100 slots; you want zero capacity management. Pay per byte scanned. Up to 2,000 slots per project, allocated by the system.

Capacity (Editions) when: you have a predictable baseline workload that consistently uses many slots; you want cost predictability (slot-hours don’t depend on data volume); you want features like materialized view refresh, BQML beyond a certain scale, idle slot sharing across projects.

The signal: compute your monthly on-demand spend. If it’s regularly above ~$3,000-$5,000 from query costs, capacity pricing with a baseline commitment will likely save money. If you have spiky workloads (nightly batch + ad hoc), use a small commitment + autoscaling. If you barely query, stay on-demand.

What experienced engineers do: start on-demand for new projects. Move to a Standard or Enterprise reservation only when monthly bill stabilizes and you have weeks of usage data to size the commitment. Many orgs end up with a hybrid: some projects on-demand (dev, ad-hoc), some on Editions (production pipelines).

Pitfall: people commit to slots without measuring. Autoscaling rounds up to multiples of 100 and has a 1-minute minimum, so a workload that needs 30 slots for 10 seconds gets billed for 100 slots × 60 seconds. For tiny intermittent workloads, on-demand is often cheaper than the smallest commitment.

Judgment Call 3: Materialized view, scheduled query into a table, or live view?

Situation: you have a derived dataset that gets queried often.

Live view when: the derivation is cheap, the underlying data changes constantly, freshness matters more than cost, or it’s a security/access-control wrapper rather than a performance optimization.

Materialized view when: the underlying data appends or has limited updates, the aggregation is expressible in the (somewhat restricted) materialized view SQL, you want automatic refresh, and you want BigQuery to automatically substitute the materialized view into queries that reference the base table (this is a real superpower — your existing dashboards get faster without code changes).

Scheduled query → table when: the materialization logic is too complex for materialized view SQL, you need fine-grained control over refresh schedule, or you want to materialize multiple intermediate stages and pick which one queries hit.

The signal: if the same heavy query runs many times against slowly-changing data, you probably want a materialized view. If you find yourself wanting MERGE or stored-procedure logic in the materialization, you want a scheduled query into a table.

What experienced engineers do: prefer materialized views when the SQL fits — the automatic-substitution feature is genuinely magical, and the incremental refresh is well-engineered. Fall back to scheduled queries when MV restrictions bite (no OUTER JOIN outside specific patterns, no LIMIT, no analytic functions in some places).

Judgment Call 4: Storage Write API, batch load, or DML?

Situation: you need to get data in.

Batch load when: data arrives in files, latency tolerance is “available within an hour,” workload is hourly/daily ETL. Cheapest, simplest, atomic. Default choice.

Storage Write API when: you need rows queryable within seconds of generation, the source is application code or a streaming pipeline, you want exactly-once semantics. New code’s streaming default.

DML when: you have small, occasional row-level mutations (correct a value, delete a few records). Atomic, immediate, but slow and expensive at scale. Do not use for bulk ingestion. A nightly INSERT INTO ... SELECT of millions of rows is a common anti-pattern; use a load job instead.

Federated / external tables when: you genuinely don’t want to copy the data and the source format is supported. Slower than native, but avoids duplication. Useful for one-off analyses or for data that primarily lives elsewhere.

The signal: if you’re writing more than ~100 rows/second per stream sustained, and you can’t tolerate waiting for the next batch, you need streaming. Otherwise, batch.

What experienced engineers do: batch load whenever possible. For real-time use cases, use Dataflow as the pipeline orchestrator with the BigQueryIO connector that uses Storage Write API under the hood — this gets you exactly-once, retries, and dead-letter handling for free.

Judgment Call 5: Nested fields or normalized schema?

Situation: you have a one-to-many relationship — orders and their line items, users and their events.

Nested (ARRAY<STRUCT<...>>) when: the child rows are always queried in the context of the parent, the relationship is hierarchical, and you can tolerate slightly more verbose queries (UNNEST everywhere). Nested gives you join-free queries, smaller storage, and faster scans.

Normalized when: child rows are independently queried (you frequently query line items without their orders), or the child data is genuinely huge and varied, or you’re moving from a relational source where normalization is the natural shape.

The signal: ask “do my queries always JOIN parent + child, or do I sometimes query child alone?” If always joined, denormalize into nested. If sometimes alone, normalize.

What experienced engineers do: lean toward nested. BigQuery’s storage was specifically designed for nested data — it’s not a hack. The performance advantage is real, and the SQL becomes natural once you’re comfortable with UNNEST. Most people who avoid nested schemas do so because they’re uncomfortable with the SQL, not because of a real performance reason.

Judgment Call 6: Approximate or exact aggregations?

Situation: you need COUNT(DISTINCT), percentiles, top-K, etc.

Exact when: the answer is reported externally (regulatory, billing, customer-facing counts), or the cardinality is small enough that exact is cheap.

Approximate (APPROX_COUNT_DISTINCT, APPROX_QUANTILES, APPROX_TOP_COUNT) when: the answer is for internal analysis, dashboards, or anywhere a few percent error is fine.

The signal: how is the number being used? “I want to see roughly how many unique users hit this endpoint” is approximate. “How many users do we bill?” is exact.

What experienced engineers do: default to approximate for analytics. The performance gap is enormous — COUNT(DISTINCT) shuffles every distinct value through the system; HyperLogLog (which APPROX_COUNT_DISTINCT uses) sends tiny sketches. On a billion-row table, approximate runs in seconds vs minutes.

Judgment Call 7: Build a star schema or wide denormalized tables?

Situation: you’re modeling for analytics — the classic Kimball question.

Wide denormalized fact table when: the dimensions are small and slow-changing, queries always combine fact + dimensions anyway, you want maximum scan performance.

Star schema (separate fact and dimension tables) when: dimensions are large or change rapidly (slowly-changing dimensions are a real thing), you have multiple fact tables sharing dimensions and want a single source of truth for each.

The signal: in a row-store, star schema saves space (denormalization duplicates data). In BigQuery’s columnar format with aggressive compression, the storage cost of denormalization is dramatically lower — duplicated low-cardinality columns compress to almost nothing via dictionary encoding. So the traditional argument for star schemas is weakened.

What experienced engineers do: lean wider than they would in Postgres or Snowflake. A common pattern is to keep raw data in star schema (for upstream flexibility) and build wide denormalized tables for query consumption (for performance). The wide tables are materialized views or scheduled queries that snapshot the joined state.

Judgment Call 8: Optimizing for cost or for latency?

Situation: you have a slow, expensive query.

Optimize for bytes scanned (cost) when: on-demand pricing, ad-hoc query that runs occasionally, the query reads a lot of data.

Optimize for slot-time (latency) when: capacity pricing, query runs in a critical path, the query has heavy aggregation or shuffle relative to scan size.

The signal: open the query plan. Compare “bytes processed” to “slot time consumed” mentally. If a query reads 100 GB but uses 30 minutes of slot-time, the work is in the compute, not the I/O — partition/cluster pruning won’t help much; you need to optimize the algorithm. If a query reads 5 TB but uses 2 minutes of slot-time, it’s I/O bound — pruning is the lever.

What experienced engineers do: always check the plan before optimizing. The wrong mental model (“just add an index” — there are no indexes) leads to wasted effort. The plan tells you whether scan, shuffle, or compute is dominant, and that determines what to fix.

Judgment Call 9: Large query in one go, or break into stages?

Situation: you have a complex multi-step transformation.

One large query when: all stages can stream pipelined through shuffle, intermediate results aren’t reused, the planner can do a good job globally optimizing.

Multi-step with temp tables when: an intermediate is reused multiple times (CTE problem from before), debugging requires inspecting intermediates, the query is so complex the planner is making bad choices, or memory/shuffle limits force splitting.

The signal: if a single query keeps failing with resource-exceeded errors, or has wildly variable runtime, split it. If it runs reliably and you don’t reuse intermediates, leave it as one query.

What experienced engineers do: write a big query first; split only when the planner falls down. Splitting introduces orchestration concerns (what if step 2 fails?) and prevents global optimizations. Don’t pre-emptively over-engineer.

Judgment Call 10: Region selection — single-region or multi-region?

Situation: you’re picking where your dataset lives.

Single-region (e.g., us-central1, europe-west1) when: you want lowest cost, you have data residency requirements that map to a specific country, you’re co-locating with other GCP services in the same region.

Multi-region (US, EU) when: you want highest availability, you don’t need region-specific residency, and you can pay slightly more for storage in exchange for resilience.

The signal: what failure domain matters to you? Multi-region tolerates a regional outage transparently. Single-region is cheaper and simpler.

What experienced engineers do: use multi-region for production analytics workloads where resilience matters, single-region for dev/test or for compliance-bound data. Once you pick, you’re stuck — you cannot move a dataset between regions without copying everything. So get this right at creation time.

Judgment Call 11: BigQuery, BigLake, or external table?

Situation: you have data in Cloud Storage. Where should it live for querying?

Native BigQuery storage (load it in): best query performance, full feature set (DML, time travel, materialized views, streaming inserts), incurs storage cost in BigQuery. Default.

BigLake table over Parquet/ORC in GCS: BigQuery uses its query engine on data in open formats sitting in your GCS bucket. Great for data-lake patterns where multiple engines (Spark, Presto, BQ) need to read the same files. Slightly slower than native, but unified governance via BigQuery IAM and metadata.

External table (legacy federated): BigQuery treats GCS files as a table for queries. Slower, fewer features, no caching. Useful for one-off analyses.

The signal: who else needs this data? If only BigQuery, native. If a lake-first architecture with Spark/Iceberg/Hudi, BigLake. If just a “let me peek at this CSV,” external.

What experienced engineers do: native by default. BigLake when there’s a real lakehouse story. External tables for transient stuff that’s going to disappear.

Judgment Call 12: When not to use BigQuery

This isn’t a feature question; it’s an architectural one. BigQuery is wrong for:

OLTP workloads. Per-row inserts, transactional updates, primary-key lookups under 100ms. Use Spanner, Cloud SQL, or AlloyDB.
Sub-second dashboards on petabytes. BigQuery’s per-query latency floor is ~1-2 seconds. BI Engine helps, but for true sub-second, push aggregates to Bigtable or a caching layer.
High-frequency mutable data with point lookups. Use Bigtable (and federate from BigQuery if you also need analytics).
Tiny data. A 10 GB workload that runs hourly may be cheaper and simpler on Postgres, especially if it doesn’t need scale.
When you need indexes for selective lookups on arbitrary columns. BigQuery’s pruning is via partition + cluster, not arbitrary indexes. Search-index style queries belong in Elasticsearch / Vertex AI Search.

What experienced engineers do: pair BigQuery with operational stores. The pattern is: OLTP system (Spanner / Postgres) → CDC → BigQuery for analytics. Trying to use BigQuery for both ends in tears.

9. The Commands and APIs That Actually Matter

The Distilled Introduction covered commands in the context of a workflow. This section is the working practitioner’s reference — what you reach for and why.

Inspecting

-- What's in this dataset?
SELECT table_name, table_type, creation_time
FROM `proj.dataset.INFORMATION_SCHEMA.TABLES`
ORDER BY creation_time DESC;

-- What does this table actually look like (size, partitioning)?
SELECT *
FROM `proj.dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE table_name = 'mytable';

SELECT *
FROM `proj.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'mytable';

-- What did I run yesterday and what did it cost?
SELECT
  job_id, user_email, query, total_bytes_billed,
  total_slot_ms, creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  AND statement_type = 'SELECT'
ORDER BY total_bytes_billed DESC;

INFORMATION_SCHEMA is your friend. It exposes nearly everything BigQuery knows about your queries, jobs, tables, partitions, and reservations. Learning to query INFORMATION_SCHEMA is the difference between guessing and knowing.

Estimating cost before running

bq query --use_legacy_sql=false --dry_run "
  SELECT user_id, COUNT(*) FROM \`proj.dataset.events\`
  WHERE DATE(event_time) = '2024-06-15'
"

Returns the bytes that would be processed. Multiply by $6.25 / TiB to get on-demand cost. Always dry-run unfamiliar queries against large tables.

Schema and table management

-- Make a table partitioned and clustered (most common pattern)
CREATE TABLE proj.ds.events
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type
AS SELECT * FROM proj.ds.staging_events;

-- Add a column (cheap, metadata-only operation)
ALTER TABLE proj.ds.events ADD COLUMN session_id STRING;

-- Drop a column (rewrites the column data — expensive on big tables)
ALTER TABLE proj.ds.events DROP COLUMN deprecated_field;

-- Set partition expiration (auto-delete old partitions)
ALTER TABLE proj.ds.events
SET OPTIONS (partition_expiration_days = 90);

-- Snapshot for backup
CREATE SNAPSHOT TABLE proj.ds.events_2024_06_15
CLONE proj.ds.events
FOR SYSTEM_TIME AS OF TIMESTAMP('2024-06-15 00:00:00');

Writing data efficiently

For batch loads, use the LOAD DATA SQL statement (or bq load CLI) and prefer Parquet or Avro over CSV/JSON. Parquet ingestion is dramatically faster and cheaper because BigQuery can skip the row-to-columnar conversion step in some cases.

-- Append from Cloud Storage Parquet files
LOAD DATA INTO proj.ds.events
FROM FILES (
  format = 'PARQUET',
  uris = ['gs://bucket/path/*.parquet']
);

-- Replace specific partitions (overwrite a day's data)
LOAD DATA OVERWRITE proj.ds.events
PARTITION (DATE '2024-06-15')
FROM FILES (
  format = 'PARQUET',
  uris = ['gs://bucket/2024-06-15/*.parquet']
);

For the Storage Write API, use the Google Cloud client library for your language. Don’t try to call the gRPC API directly unless you really need to — the libraries handle the stream lifecycle, retries, and offset bookkeeping for exactly-once semantics.

Querying patterns worth knowing

-- Anti-join via NOT EXISTS (often faster than NOT IN with NULLs)
SELECT a.* FROM tableA a
WHERE NOT EXISTS (SELECT 1 FROM tableB b WHERE b.id = a.id);

-- Pivot (BigQuery has native PIVOT and UNPIVOT)
SELECT * FROM (
  SELECT category, region, sales FROM proj.ds.fact
)
PIVOT (SUM(sales) FOR region IN ('US', 'EU', 'APAC'));

-- Conditional aggregation (often replaces multiple subqueries)
SELECT
  user_id,
  COUNTIF(event_type = 'click') AS clicks,
  COUNTIF(event_type = 'purchase') AS purchases,
  AVG(IF(event_type = 'view', duration, NULL)) AS avg_view_dur
FROM proj.ds.events
GROUP BY user_id;

-- ARRAY_AGG with LIMIT to get the most recent N per group
SELECT
  user_id,
  ARRAY_AGG(event ORDER BY event_time DESC LIMIT 10) AS last_10
FROM proj.ds.events
GROUP BY user_id;

-- QUALIFY (filter after window functions)
SELECT user_id, event_time, value
FROM proj.ds.events
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) = 1;

The last one — QUALIFY — is one of those features people don’t know about and miss for years. It’s a WHERE clause that runs after window functions. Without it, you have to wrap the query in a subquery to filter on a ROW_NUMBER() result. With QUALIFY, the code is half as long.

Scripting

BigQuery supports a procedural scripting language. For complex multi-step transforms:

DECLARE start_date DATE DEFAULT '2024-06-01';

CREATE TEMP TABLE staging AS
SELECT * FROM proj.ds.raw_events
WHERE DATE(event_time) >= start_date;

CREATE TEMP TABLE cleaned AS
SELECT * EXCEPT(_raw_payload),
  PARSE_JSON(_raw_payload) AS payload
FROM staging
WHERE _is_valid;

INSERT INTO proj.ds.events
SELECT * FROM cleaned;

Inside scripts you get variables, loops, conditionals, and exceptions. This is the right tool when a single SQL statement can’t express what you want.

Permissions you actually need to know

roles/bigquery.dataViewer — read tables.
roles/bigquery.dataEditor — read/write tables in a dataset.
roles/bigquery.user — run queries, create datasets/jobs in a project.
roles/bigquery.jobUser — run queries (but not create datasets).
roles/bigquery.admin — full control.

The granularity gotcha: you can grant dataViewer at the dataset level, table level, or column level (with policy tags). For row-level filtering, use authorized views or row-level access policies.

10. How It Breaks

When BigQuery misbehaves, the failure modes fall into a small number of categories. Knowing them by sight cuts debugging time enormously.

Failure mode 1: “Resources exceeded during query execution”

Symptoms: query fails with this error, possibly mentioning shuffle, memory, or analytic OVER clauses.

Root cause: a single stage of the query needs more resources than any one slot has, or more shuffle than the project quota allows. Almost always one of:

Skewed shuffle — one hash bucket has way more data than others.
Window function without PARTITION BY — global ordering forces serialization.
Cartesian-product blow-up — a CROSS JOIN or a JOIN with the wrong key.
Genuinely too much data — you’re trying to shuffle 100 TB through a project that doesn’t have the slot capacity.

How to diagnose: open the query plan. Look at the stage that failed. Check input rows, output rows, slot wait time, and per-slot work distribution. Skew shows up as “max slot time >> avg slot time.” Cartesian shows as “output rows >> input rows.”

How to fix: depends on which it is. Skew → salt the key, or pre-filter the dominant value. Window function → add PARTITION BY. Cartesian → fix the join condition. Data scale → break into smaller queries with temp tables, or get more slots.

Failure mode 2: Query is much slower than expected

Symptoms: query you ran yesterday in 10 seconds takes 5 minutes today.

Root cause (in rough order of probability):

Slot contention. On-demand has a shared pool; if your project (or a noisy neighbor) is using slots, your query queues. Capacity pricing has a hard limit; if your reservation is full, queries queue.
Data volume changed. The table grew. The partition you’re scanning has more data.
Plan changed. The optimizer made a different choice based on updated stats — maybe it switched from broadcast to shuffle join.
Cache miss. Repeated queries return cached results in milliseconds; if anything changed (schema, data, even a comment), the cache is invalidated.

How to diagnose: check INFORMATION_SCHEMA.JOBS for the same query before/after — compare slot time and bytes. Open both query plans side by side; look for stages that grew.

How to fix: depending on cause. Contention — buy slots, schedule away from busy times, or split workload across projects. Plan change — use query hints (@@optimizer_options) sparingly, or restructure SQL. Data growth — add or refine partition/cluster.

Failure mode 3: Costs spike

Symptoms: monthly bill jumps from $X to $5X with no obvious change.

Root cause (in rough order of probability):

Someone removed a partition filter in a query that runs frequently. A daily dashboard that used to scan 1 GB now scans 1 TB because the user changed WHERE date >= '2024-06-01' to WHERE date >= '2024-01-01'.
A scheduled query started failing and retrying. Each retry scans the data again.
A new BI tool connected and is issuing SELECT * FROM huge_table under the hood for previews.
Streaming costs grew — streaming inserts are billed per row and can sneak up on you.
SELECT * in a query someone copy-pasted from somewhere.

How to diagnose: query INFORMATION_SCHEMA.JOBS_BY_PROJECT for the time period, group by user / query hash, and look at top spenders by total_bytes_billed. The culprit usually pops out within minutes.

How to fix: short-term, kill the offender. Long-term, set a maximum_bytes_billed ceiling project-wide, set up cost alerts in Cloud Billing, and review INFORMATION_SCHEMA.JOBS weekly.

Failure mode 4: Streaming data isn’t visible / DML against streamed data fails

Symptoms: your application writes to BigQuery via streaming, but the rows aren’t queryable, or DML against them errors.

Root cause: legacy streaming buffer. Rows take a moment to be queryable, and DML-against-streaming-buffer has historically failed.

How to fix: migrate to the Storage Write API. The streaming buffer disappears as a concept; rows are immediately in Capacitor storage, and DML works normally.

Failure mode 5: Cluster pruning isn’t kicking in

Symptoms: you clustered the table on user_id, but a query filtering on user_id still scans the whole table.

Root cause (in order):

The filter expression isn’t a simple equality (you used LOWER(user_id) or wrapped in a function).
The WHERE predicate involves the cluster column but in a way the planner can’t push down (e.g., cluster column on the right side of a non-equality JOIN).
The clustering is genuinely fresh but the table hasn’t been re-clustered yet (BigQuery re-clusters in the background; freshly loaded data may not be sorted yet).

How to fix: rewrite the predicate as a direct equality or range on the raw cluster column. If recently loaded, wait — BigQuery’s automatic re-clustering is asynchronous and free, but takes time on big tables.

Failure mode 6: `MERGE` is slow / expensive

Symptoms: a nightly upsert via MERGE INTO target USING source ON target.id = source.id takes hours and bills heavily.

Root cause: MERGE against a partitioned table is most efficient when the source data is also partitioned by the same key, and when the JOIN condition includes the partition column. Without that, BigQuery may scan the entire target table to find rows to update.

How to fix: include the partition column in the JOIN condition (target.id = source.id AND target.date = source.date), so partition pruning applies on the target. Pre-filter the source if you can. Consider the Storage Write API’s CDC upsert mode for streaming use cases.

The standard debugging workflow

When something is wrong, run through this checklist:

Open the query plan. What does the bytes processed show vs what you expected? What was the slot time?
Look at per-stage rows. Which stage exploded the row count? That’s your join/cartesian or your window-function-without-partition.
Look at per-slot timing. Is one slot at 99% while others are idle? That’s skew.
Look at shuffle bytes. Is shuffle dramatically larger than scan? That’s a join or aggregation that’s expensive.
Check INFORMATION_SCHEMA.JOBS for similar queries — did the runtime change recently? When?
Compare to the dry run. If runtime bytes wildly exceed dry-run bytes, dynamic execution decided something different (often skew).
Run with --dry_run and inspect the plan without running. Sometimes you can see the planner’s plan and spot trouble before paying for it.

The query plan is the single most important debugging tool in BigQuery. Make a habit of opening it on every non-trivial query, even when things look fine. It builds intuition.

11. The Taste Test

You can usually tell within thirty seconds whether someone using BigQuery knows what they’re doing. Here’s what good vs bad looks like in practice.

Schema and table design

beginner pattern:

CREATE TABLE events (
  id STRING,
  data JSON,
  created TIMESTAMP
);
-- No partitioning. No clustering. JSON blob holding everything.
-- Will be unusable at 100 GB. Painful at 1 TB. Catastrophic at 10 TB.

experienced pattern:

CREATE TABLE events (
  event_id STRING NOT NULL,
  event_time TIMESTAMP NOT NULL,
  user_id STRING NOT NULL,
  event_type STRING NOT NULL,
  -- typed top-level fields for hot-path filters
  device_type STRING,
  country STRING,
  -- structured payload, but TYPED, not JSON blob
  payload STRUCT<
    session_id STRING,
    page_url STRING,
    referrer STRING,
    properties ARRAY<STRUCT<key STRING, value STRING>>
  >
)
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type
OPTIONS (
  partition_expiration_days = 730,
  description = "User behavior events. Partitioned daily, clustered for user/event lookups."
);

The experienced version: partitioned for time-based pruning, clustered for the common high-cardinality filter columns, types known at column level (so the planner can prune and the user can index by IDE autocomplete), with partition expiration so storage doesn’t grow forever, and described so the next person knows what it is.

Query style

beginner:

SELECT * FROM events WHERE event_time > '2024-06-01';

experienced:

SELECT event_id, user_id, event_type, payload.session_id
FROM `proj.analytics.events`
WHERE DATE(event_time) BETWEEN '2024-06-01' AND '2024-06-15'
  AND event_type = 'purchase';

What’s different: the experienced version names columns explicitly (so we only pay for what we need), uses fully-qualified table names (so it works regardless of default project), filters by DATE(event_time) matching the partition definition (so partition pruning kicks in), and bounds the upper end of the date range too (so accidentally not specifying an end-date doesn’t scan the future-and-also-everything).

Aggregation patterns

beginner:

SELECT COUNT(DISTINCT user_id) FROM huge_events;
-- Will shuffle every distinct user_id through the system.
-- For exact counts of low-cardinality this is fine. For 100M+ users, it's wasteful.

experienced:

-- For an analytics dashboard:
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_users FROM huge_events;
-- For a business-critical exact count:
SELECT COUNT(DISTINCT user_id) FROM huge_events;
-- (experienced knows which case is which and chooses accordingly.)

CTE / temp table usage

beginner:

WITH base AS (
  SELECT * FROM huge_table WHERE expensive_predicate
)
SELECT * FROM base WHERE x = 1
UNION ALL
SELECT * FROM base WHERE x = 2
UNION ALL
SELECT * FROM base WHERE x = 3;
-- 'base' computed three times. 3x the bytes.

experienced:

CREATE TEMP TABLE base AS
  SELECT * FROM huge_table WHERE expensive_predicate;

SELECT * FROM base WHERE x = 1
UNION ALL
SELECT * FROM base WHERE x = 2
UNION ALL
SELECT * FROM base WHERE x = 3;
-- 'base' computed once. Subsequent reads are cheap.

Joins

beginner:

SELECT a.*, b.*
FROM big_table a
JOIN big_table b ON LOWER(a.user_id) = LOWER(b.user_id)
WHERE a.event_time = b.event_time;
-- LOWER() on join key defeats clustering on user_id.
-- Implicit cartesian risk if event_time matches across rows.

experienced:

SELECT a.event_id, a.amount, b.first_seen
FROM `proj.ds.events` a
JOIN `proj.ds.user_profiles` b USING (user_id)
WHERE DATE(a.event_time) = '2024-06-15';
-- Direct join on raw user_id (cluster pruning works on user_profiles).
-- Partition filter on a.event_time means a is pruned to one day.
-- Only the columns we actually need.
-- USING is cleaner than ON when the join column has the same name.

Pipelines and workflow

beginner: a single 800-line query with 20 nested CTEs that occasionally fails with Resources exceeded and nobody knows why.

experienced: a multi-statement script with CREATE TEMP TABLE for each meaningful intermediate, partition-aligned MERGE statements for upserts, parameterized for the run date so backfills are trivial, and an INFORMATION_SCHEMA.JOBS query at the end logging row counts and slot time for monitoring.

Operations

beginner: doesn’t know what the query plan looks like. Tunes by trial and error.

experienced: opens the query plan on every non-trivial query. Writes INFORMATION_SCHEMA queries to find expensive jobs. Has a saved bookmark to “queries that scanned >1 TB last week, ranked by frequency.” Sets maximum_bytes_billed on dashboards. Reviews the partitioning-and-clustering recommender output monthly.

Cost awareness

beginner: surprised when the bill comes.

experienced: knows roughly how many TB their workload scans per month, watches the trend in INFORMATION_SCHEMA, set up budget alerts, and can tell you within a factor of 2 what any given query will cost before running it.

The vibe check

Things that make me trust someone’s BigQuery work, in 30 seconds of looking at their code:

Tables have explicit partitioning + clustering with a comment explaining the choice.
Queries name columns, qualify tables, filter on partition columns.
CTEs are used for readability with single references; multi-reference intermediates use temp tables.
They use APPROX_* where it’s clearly fine.
They use nested fields where the relationship is hierarchical.
Their dashboards have maximum_bytes_billed set.
They’ve checked the query plan recently.
They know the difference between bytes scanned and slot time.

Things that make me suspicious:

SELECT * in production queries.
DML used for bulk inserts.
No partitioning on tables clearly intended to grow over time.
Date columns that are STRING because “we’ll figure out the format later.”
Joins on string keys when integer surrogates would do.
A 900-line query that “works.”
Cost surprises met with confusion rather than with INFORMATION_SCHEMA.JOBS queries.

12. Where to Go Deeper

These are the resources I actually recommend. They’re ranked by what gives you the most depth per minute of reading.

The original Dremel paper (2010): “Dremel: Interactive Analysis of Web-Scale Datasets” — Sergey Melnik et al. It’s only 10 pages. Read it. The columnar nested encoding, the serving tree, and the experimental results on a 24-billion-row table all started here. Available from research.google. Read this first if you want to understand BigQuery as a system rather than a product.
“BigQuery under the hood” and “Inside Capacitor” — Google Cloud blog — Two posts by the BigQuery team that are unusually candid about internals. The Capacitor post in particular gives you concrete intuition about the storage format and why nested data is fast. Both are short. Read them after the Dremel paper.
“In-memory query execution in Google BigQuery” — Google Cloud blog — Walks through how shuffle works in BigQuery and why it’s different from MapReduce / Spark shuffle. If you only read one thing about how distributed execution works in BigQuery, read this.
The official BigQuery best-practices documentation, specifically: “Optimize query computation,” “Optimize storage,” and “Introduction to optimizing query performance.” Most cloud docs are tedious; these aren’t. They distill production wisdom into action items. The “anti-patterns” section of the query optimization page is gold.
“Storing and Querying Tree-structured Records in Dremel” (2014) — The follow-up paper to Dremel, which formalizes the algebra of operating on nested columnar data. You don’t need this to use BigQuery, but it’s beautiful if you want to understand why the encoding is what it is.
Felipe Hoffa’s blog and Twitter (@felipehoffa) — Felipe was a long-time BigQuery developer advocate and writes the most consistently good intermediate-to-advanced BigQuery content on the internet. Search his name + your topic and you’ll usually find something useful.
Hands-on project: take a public dataset (BigQuery has dozens — try bigquery-public-data.crypto_ethereum or bigquery-public-data.github_repos), build a meaningful analytical query, then iterate on optimizing it. Start with SELECT *, then progressively add partitioning predicates, restrict columns, and watch the dry-run cost drop. Open the query plan after each version. The intuition builds in your hands, not just in your head.
The “BigQuery Sandbox” — free tier on the official Google Cloud Console. 1 TB free queries per month, 10 GB free storage, no credit card required for the sandbox. Enough to do real exploration.

If you want one trajectory: read this document, then the Dremel paper, then the Capacitor blog post, then the in-memory shuffle post, then go run queries against a public dataset until the intuitions in this document feel like reflexes rather than memorized rules.

The fundamentals haven’t changed since the 2010 Dremel paper — the engine has gotten better, the storage format has evolved, the pricing has been re-architected, but the core ideas (columnar storage, tree execution, separated storage and compute, fast network, multi-tenant slots) are exactly what was sketched in that paper. Understand those five ideas and you understand BigQuery.

Now go open a query plan.

The ideas are mine. The writing is AI assisted

1. One-Sentence Essence

2. The Problem It Solved

3. The Concepts You Need

Storage and physical layout

Compute and execution

Network and shuffle

Data organization

Pricing / billing

4. The Distilled Introduction

Getting set up

The unit hierarchy

Loading data

A first useful workflow: load → query → optimize

The fifteen things you’ll do most often

Configuration that matters from day one

5. The Mental Model

Core Idea 1: Storage and compute are separate services that meet at the network.

Core Idea 2: Capacitor reads only the columns you ask for, only the shards that match.

Core Idea 3: Queries are trees of stateless workers, with shuffle as the in-memory connective tissue.

Core Idea 4: Slots are the universal unit of compute, but you can’t see them.

6. The Architecture in Plain English

Step 1: Submission and metadata fetch

Step 2: Planning and rewriting

Step 3: Tree dispatch

Step 4: Leaf execution — the actual work

Step 5: Shuffle — the in-memory connective tissue

Step 6: Mixer / next-stage execution

Step 7: Sort and limit

Step 8: Result delivery

Where state lives

7. The Things That Bite You

1. Forgetting to filter on the partition column

2. CTEs are not materialized

3. LIMIT doesn’t reduce cost

4. Functions on filter columns kill pruning

5. Window functions can serialize the entire query

6. Shuffle quotas are real and you’ll find them

7. Streaming buffer makes recently-streamed data weird

8. Joining with a single-row “config” table can defeat the optimizer

9. The 4,000-partition-per-table limit

10. Skew makes some slots do all the work

8. The Judgment Calls

Judgment Call 1: Partition or cluster?

Judgment Call 2: On-demand or capacity (Editions) pricing?

Judgment Call 3: Materialized view, scheduled query into a table, or live view?

Judgment Call 4: Storage Write API, batch load, or DML?

Judgment Call 5: Nested fields or normalized schema?

Judgment Call 6: Approximate or exact aggregations?

Judgment Call 7: Build a star schema or wide denormalized tables?

Judgment Call 8: Optimizing for cost or for latency?

Judgment Call 9: Large query in one go, or break into stages?

Judgment Call 10: Region selection — single-region or multi-region?

Judgment Call 11: BigQuery, BigLake, or external table?

Judgment Call 12: When not to use BigQuery

9. The Commands and APIs That Actually Matter

Inspecting

Estimating cost before running

Schema and table management

Writing data efficiently

Querying patterns worth knowing

Scripting

Permissions you actually need to know

10. How It Breaks

Failure mode 1: “Resources exceeded during query execution”

Failure mode 2: Query is much slower than expected

Failure mode 3: Costs spike

Failure mode 4: Streaming data isn’t visible / DML against streamed data fails

Failure mode 5: Cluster pruning isn’t kicking in

Failure mode 6: MERGE is slow / expensive

The standard debugging workflow

11. The Taste Test

Schema and table design

Query style

Aggregation patterns

CTE / temp table usage

Joins

Pipelines and workflow

Operations

Cost awareness

The vibe check

3. `LIMIT` doesn’t reduce cost

Failure mode 6: `MERGE` is slow / expensive