deep·tech·intuition
intermediate ·

dbt Deep Intuition

An experienced engineer's guide to dbt

1. One-Sentence Essence

dbt is a SQL compiler that turns a folder of select statements into a DAG of materialized tables and views in your data warehouse — and nothing else.

dbt does not move data. dbt does not have a runtime engine. dbt does not orchestrate anything outside of itself. It reads your SQL files, resolves the references between them with Jinja, figures out the dependency order, and submits CREATE TABLE AS / CREATE VIEW AS statements to a warehouse you connect it to. Everything else — tests, documentation, snapshots, incremental logic — is more SQL that dbt generates and submits to that same warehouse.

The rest of this document is unpacking what you can do once you have that machine.


2. The Problem It Solved

Before dbt, the transformation layer of an analytics stack was a horror show. You had three flavors of bad option, and most companies had all three at once.

Option 1: Stored procedures. Your transformation logic lived inside the warehouse as CREATE PROCEDURE blocks. There was no version control, no code review, no tests, no documentation. Two analysts would edit the same procedure on Tuesday and overwrite each other. When something broke at 3am you’d open a SQL client and start reading 4,000-line procedures with no comments.

Option 2: Drag-and-drop ETL tools. Informatica, Talend, SSIS. The logic was buried in a GUI as boxes-and-arrows. You couldn’t git diff it. Migrating between tools meant rebuilding everything. The vendor charged you per CPU.

Option 3: A pile of Python scripts running on a cron. Every team rolled their own. There was no shared idiom for “how do you build a table from another table.” You ended up with a Python script that pulls data out of Postgres, transforms it in Pandas (running out of memory at 10GB), and writes it back. You were paying to move data out of a system that could already do the transformation, just to move it back in.

Meanwhile two things happened that changed the economics. Cloud warehouses got cheap and fast — Redshift, BigQuery, then Snowflake. Suddenly the warehouse itself could chew through terabytes in minutes, and storage was practically free. Software engineering practices got mature — git, pull requests, CI, modular code, tests. Analysts looked at how their software-engineering colleagues worked and asked: why don’t we get any of this?

dbt’s insight, articulated by its creators (originally Fishtown Analytics, now dbt Labs) around 2016, was almost embarrassingly simple: stop trying to be an ETL tool. The warehouse can already transform data, and it’s good at it. Just give analysts a way to write SQL like software engineers write code — files in a git repo, references between them, tests, environments, code review — and compile that down to plain SQL the warehouse already knows how to run. The “ELT” pattern (Extract, Load, then Transform inside the warehouse) was already happening. dbt was the missing tool for the T.

This is why dbt is described as bringing “software engineering best practices to analytics.” It’s not adding magic. It’s adding a build system to a layer that previously didn’t have one.


3. The Concepts You Need

Before anything else makes sense, you need this vocabulary. Each of these will appear repeatedly. None of them is optional.

The structural concepts

Project. A folder on disk with a dbt_project.yml file at its root. Everything dbt does is scoped to a project. A project is the unit of version control — typically one git repo per project, though larger orgs split into multiple projects (this is called dbt Mesh).

Model. A single .sql file containing exactly one select statement. That’s it. dbt will wrap your select in CREATE TABLE AS (...) or CREATE VIEW AS (...) and run it. The file’s name (without the .sql) becomes the table/view name in the warehouse. Models are the core unit of work in dbt — when people say “I’m building a dbt model,” they mean writing a SQL file.

Source. A raw table in the warehouse that dbt didn’t create — it was loaded by something upstream (Fivetran, Airbyte, your custom ingestion). Sources aren’t files; they’re declarations in a YAML file that say “this raw table exists in this database, in this schema, with this name, and we’re going to call it stripe.orders from inside our dbt code.”

Seed. A small CSV checked into the repo that dbt loads into the warehouse as a table. Useful for static lookup data — country code mappings, fiscal calendar, account exclusion lists. Not for actual data; if your CSV is over a few thousand rows, it doesn’t belong as a seed.

Snapshot. A special model type that implements Slowly Changing Dimension Type 2 (SCD2) — it captures changes over time in a mutable source table. We’ll cover this fully in section 5; for now, know that it’s the third kind of “thing dbt builds in your warehouse.”

Test. An assertion about your data, written as a SQL query that returns failing rows. If the query returns zero rows, the test passes. If it returns any rows, those rows are the failures. dbt has four built-in generic tests (unique, not_null, accepted_values, relationships) and you can write your own.

Macro. A reusable Jinja function. Think def in Python. Macros let you generate SQL programmatically — looping, conditionals, parameterization.

The compilation concepts

Jinja. A Python templating language (used in Flask, Django, Ansible). dbt uses it heavily. Anything in {{ ... }} is an expression that produces output; anything in {% ... %} is a control statement (if, for, set). Your warehouse doesn’t speak Jinja — dbt evaluates the Jinja and produces plain SQL, then sends the plain SQL to the warehouse.

ref(). A Jinja function. The most important one in all of dbt. When model customers.sql writes {{ ref('stg_customers') }}, dbt does two things at compile time: (1) replaces it with the fully-qualified table name (e.g. analytics_dev.staging.stg_customers), and (2) records that customers depends on stg_customers in the dependency graph. Without ref(), dbt cannot know the order to build models.

source(). Jinja function for referring to a raw source table — {{ source('stripe', 'orders') }}. Same idea as ref() but for sources rather than models.

Manifest. The output of dbt’s parse phase. A JSON file (target/manifest.json) that contains every model, source, test, macro, and edge in the dependency graph. This is the project’s compiled state — the manifest is what dbt uses to plan what to run.

DAG. Directed Acyclic Graph. The dependency graph of your project, derived from ref() and source() calls. dbt walks the DAG topologically — anything a model depends on must finish before that model runs. Cycles are forbidden (and dbt will refuse to run your project if you accidentally create one).

Compiled SQL. What target/compiled/.../my_model.sql contains: your model after Jinja has been resolved, but before any DDL has been wrapped around it. This is the SQL you’d run yourself in a query console. Run SQL (in target/run/) is the compiled SQL plus the materialization wrapping (the CREATE TABLE AS, the merge statement, etc.) — this is what actually gets submitted to the warehouse.

The materialization concepts

Materialization. The strategy dbt uses to persist a model’s output in the warehouse. There are five built-in: view (default), table, incremental, ephemeral, and materialized_view. The choice changes the SQL dbt generates from your select. Same model, different materialization, completely different write pattern.

Incremental. A materialization where the first run builds the full table, but subsequent runs only insert/update new or changed rows — typically detected with where created_at > (select max(created_at) from {{ this }}). The whole point is to avoid rebuilding 5TB of historical data every night when only yesterday changed.

is_incremental(). A Jinja function that returns true only when (a) the model is materialized as incremental, (b) the table already exists, and (c) you’re not running with --full-refresh. The standard pattern is to wrap your incremental filter in {% if is_incremental() %} ... {% endif %} so the first run gets all the data and subsequent runs only get the delta.

Adapter. The plugin that lets dbt talk to a specific warehouse — dbt-snowflake, dbt-bigquery, dbt-postgres, dbt-redshift, dbt-databricks, dbt-duckdb. Adapters translate dbt’s generic operations (e.g. “merge these rows”) into warehouse-specific SQL. Most user-facing behavior is the same across adapters; the differences live in incremental_strategy, materialization details, and warehouse-specific configs.

The environment concepts

Profile. Connection credentials. Lives in ~/.dbt/profiles.yml, never in your repo. A profile has multiple targets (e.g. dev, prod) — one of which is the default.

Target. A named environment within a profile — typically dev (your sandbox) and prod (the real deal). Each target has its own database, schema, threads, and credentials. The {{ target.name }} Jinja variable lets your models behave differently in dev vs prod (e.g. only process the last 7 days in dev to keep development fast).

Threads. How many models dbt builds in parallel. dbt walks the DAG and submits up to threads models concurrently to the warehouse. More threads = faster, until your warehouse is the bottleneck.

These twenty-something concepts are the entire vocabulary. Once you have them, the rest is mechanics.


4. The Distilled Introduction

Setup

dbt Core is a Python package. Install the adapter for your warehouse — that pulls in dbt itself as a dependency:

pip install dbt-snowflake   # or dbt-bigquery, dbt-postgres, etc.
dbt --version

dbt init my_project creates a project skeleton. It prompts you for connection details and writes them to ~/.dbt/profiles.yml. The repo gets a dbt_project.yml (project config) and a models/ folder.

In production, most teams use one of: dbt Cloud (the hosted UI/scheduler from dbt Labs), or dbt Core orchestrated by Airflow / Dagster / Prefect / GitHub Actions. There is no built-in scheduler in dbt Core — dbt run is the entire interface. Something else has to call it on a schedule. (As of 2025, dbt Labs released the Fusion engine, a Rust rewrite of dbt with built-in features like state-aware orchestration; we’ll cover that in Section 5 along with the mental model.)

Your first model

Create models/customers.sql:

select
    customer_id,
    first_name,
    last_name,
    created_at
from {{ source('jaffle_shop', 'customers') }}
where created_at >= '2020-01-01'

You’d also need a models/_sources.yml declaring the source:

version: 2
sources:
  - name: jaffle_shop
    database: raw
    schema: jaffle_shop_raw
    tables:
      - name: customers

Run it:

dbt run --select customers

dbt compiles {{ source('jaffle_shop', 'customers') }} to raw.jaffle_shop_raw.customers, wraps the whole select in create or replace view dev.my_schema.customers as (...), and submits it. You now have a view called customers in your dev schema. That’s the entire flow.

Building a layered project

Real projects layer models. The convention is staging → intermediate → marts.

Staging models (models/staging/jaffle_shop/stg_jaffle_shop__customers.sql) sit 1-to-1 with source tables. Their job is only: rename columns to your conventions, cast types, do trivial cleanup. No joins, no business logic, no aggregations. They’re the only place that uses source(). Materialize them as views (default) — they’re cheap, always fresh, and just a thin facade over raw data.

-- stg_jaffle_shop__customers.sql
select
    id          as customer_id,
    first_name,
    last_name,
    email,
    created_at
from {{ source('jaffle_shop', 'customers') }}

Intermediate models (models/intermediate/int_orders_with_payments.sql) do the joining and reshaping that’s complex enough to deserve its own model but doesn’t represent a final business concept. They ref() staging models. Often materialized as ephemeral or view.

Mart models (models/marts/finance/orders.sql) are the polished, business-facing tables. They have clean names (orders, customers, not fct_orders__v3_final_REAL), they’re the things your BI tool queries, they’re materialized as table (or incremental if large). They ref() other marts and intermediates and (occasionally) staging models.

The arc is: source-conformed → business-conformed. You start with whatever shape Stripe or Shopify hands you, and you end with the shape your finance team thinks in.

ref() and the DAG

Inside your models, you never write select * from analytics.staging.stg_customers. You write select * from {{ ref('stg_customers') }}. dbt does two jobs from this:

  1. At compile time, replaces the call with the actual table name in whatever environment you’re running in. In dev it becomes dev_alice.staging.stg_customers. In prod it becomes analytics.staging.stg_customers. Your code is environment-agnostic.

  2. Builds the DAG. If customers calls ref('stg_customers'), dbt knows stg_customers must build first. You never write a build order; the DAG is implicit in your refs.

When you run dbt run, dbt:

  1. Parses every .sql and .yml file. Finds every ref and source call. Builds the manifest.
  2. Plans the DAG: a topological order, parallelizing across threads.
  3. Compiles each model: resolves Jinja, produces compiled SQL.
  4. Executes: submits run-SQL to the warehouse, in DAG order.

Materializations: choosing the right one

This is one of the most common decisions you’ll make. Five options:

view (default). dbt creates a view. Zero storage, always reflects upstream. Use for staging models and any model that’s queried rarely or where the underlying SQL is cheap.

table. dbt does create or replace table ... as (select ...). Full rebuild on every run. Use for marts, especially anything queried by dashboards (snappy reads matter), and for any model whose SQL is too expensive to recompute on every BI query.

incremental. Builds full table on first run; appends/merges new rows on subsequent runs. Use when (a) the model is large enough that rebuilding hurts, AND (b) you can identify “new” rows reliably (a created_at or updated_at column, an event timestamp).

{{ config(materialized='incremental', unique_key='order_id') }}

select * from {{ ref('stg_orders') }}
{% if is_incremental() %}
  where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

The {% if is_incremental() %} block is omitted on the first run (so the full table builds) and active on subsequent runs (so only deltas are processed). The unique_key tells dbt to merge on this column rather than blindly append.

ephemeral. Doesn’t build anything in the warehouse. Instead, dbt inlines the model’s SQL as a CTE in any model that refs it. Use rarely — for tiny intermediate logic that exists only to keep code DRY but doesn’t deserve its own object. Heavy ephemeral chains produce massive compiled queries that are hard to debug.

materialized_view. The warehouse-managed materialized view (Snowflake’s dynamic tables, BigQuery materialized views, Postgres MVs). dbt creates and configures the MV; the warehouse refreshes it. Newer, less commonly used.

Default rule: views for staging, tables for marts, incremental only when you’ve measured that table is too slow. Don’t reach for incremental first — it’s an order of magnitude more complex and an order of magnitude more bug-prone.

Tests

A test is a SQL query that should return zero rows. dbt has four built-in generic tests, configured in YAML:

# models/marts/_marts__models.yml
version: 2
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']
      - name: customer_id
        tests:
          - relationships:
              to: ref('customers')
              field: customer_id

Run them with dbt test. dbt compiles each test to a select count(*) ... where <bad condition> query and runs it. If the count is non-zero, the test fails.

You can write singular tests as standalone SQL files in tests/:

-- tests/orders_no_negative_amounts.sql
select * from {{ ref('orders') }} where amount < 0

You can also write custom generic tests that take parameters — this lets you reuse one test across many models. And as of dbt 1.8+, unit tests let you test transformation logic with mocked input rows (genuine engineer-style unit tests, not data tests).

Snapshots

Snapshots solve a specific problem: your source table mutates in place (status changes from “pending” to “shipped” with no history kept), but you want history. Snapshot definitions go in the snapshots/ folder:

{% snapshot orders_snapshot %}
{{
    config(
      target_schema='snapshots',
      unique_key='order_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

You run dbt snapshot (separately from dbt run). On the first run, it captures the current state of the table with dbt_valid_from and dbt_valid_to columns. On subsequent runs, when it sees a row whose updated_at is newer than what’s in the snapshot table, it closes off the previous version (sets dbt_valid_to) and inserts a new row with the current values. Result: full SCD2 history, with no custom code on your part.

dbt build vs dbt run vs dbt test

dbt run builds models. dbt test runs tests. dbt seed loads seeds. dbt snapshot runs snapshots.

dbt build does all of the above, in DAG order, with one extra superpower: if a test fails on a model, downstream models that depend on it are skipped. This is the “data quality circuit breaker” you want in production. dbt build is the production command. dbt run is for development when you want speed and don’t care about tests right now.

Selectors

The thing that makes dbt usable on big projects is the --select (or -s) flag. You almost never run all your models — you run a subset.

dbt build --select my_model              # just my_model
dbt build --select my_model+             # my_model and all descendants
dbt build --select +my_model             # my_model and all ancestors
dbt build --select +my_model+            # the full lineage around my_model
dbt build --select tag:nightly           # everything tagged 'nightly'
dbt build --select staging.stripe        # everything in models/staging/stripe/
dbt build --select state:modified+       # everything you changed + descendants
dbt build --select source:stripe+        # everything downstream of any stripe source

The + operator is the workhorse. model+ means “this model and everything that depends on it.” +model means “this model and everything it depends on.” You’ll type + more than you type dbt.

Documentation

Every model and column can have a description in YAML:

- name: orders
  description: "One row per order. Source: `jaffle_shop.orders` (Postgres replica)."
  columns:
    - name: order_id
      description: "Primary key. UUID."

dbt docs generate produces a static site with model descriptions, column descriptions, lineage graphs, and the compiled SQL for every model. It’s hosted at localhost:8080 after dbt docs serve. Most teams pin this to a URL their analysts and stakeholders use as the catalog.

Packages

dbt has a package system. packages.yml lists dependencies; dbt deps installs them.

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: calogica/dbt_expectations
    version: 0.10.1

dbt-utils is essentially the standard library — surrogate keys, date spines, pivoting, deduplication. dbt-expectations is a port of Great Expectations as dbt tests. The codegen package generates source YAML and staging model boilerplate for you. Install the first one (dbt-utils) on day one; you’ll reach for it within an hour.

That’s the working surface area. There’s more — exposures, semantic models, hooks, custom materializations — but if you’ve internalized the above, you can ship.


5. The Mental Model

Four ideas to stop any surprises.

Core Idea 1: dbt is a SQL compiler, not a runtime.

dbt itself never touches your data. It reads files, evaluates Jinja, and produces SQL strings. It hands those strings to the warehouse via the adapter. The warehouse does all the work. dbt’s job is finished the moment the SQL is submitted.

This predicts a lot:

  • dbt performance is mostly warehouse performance. If your dbt run takes 2 hours, the fix is almost always smarter SQL or a bigger warehouse, not a dbt-level change.
  • dbt can’t optimize your SQL. It will compile your select * from x cross join y exactly as written. There’s no query planner above the warehouse’s.
  • Anything you can do in your warehouse’s SQL, you can do in dbt. Window functions, CTEs, JSON parsing, vendor-specific functions — all fair game.
  • dbt errors are split into two kinds. Compilation errors (Jinja didn’t resolve, ref points nowhere, YAML is malformed) happen in dbt itself, before SQL is sent. Runtime errors (column doesn’t exist, divide by zero, permission denied) come back from the warehouse. The fix path is completely different — and recognizing which kind of error you’re looking at is the first step in debugging anything.

Core Idea 2: The DAG is implicit in ref(). Everything follows from this.

Once you write {{ ref('stg_customers') }} instead of select from stg_customers, three superpowers light up:

  1. Build order is automatic. dbt walks the dependency graph topologically. You never schedule “run staging models, then intermediate, then marts.” The graph does it for you. You add a model and dbt knows where it fits.

  2. Environments come for free. Because ref() resolves the actual table name at compile time based on the current target, the same code runs in dev_alice, dev_bob, staging, and prod with different physical tables. No environment-specific SQL. No find-and-replace before deploys.

  3. Selectors work. dbt run --select my_model+ means “this model and everything downstream” — but only because dbt knows what is downstream. It knows because of ref().

This is why the cardinal sin in dbt is hardcoded referencesselect * from analytics.staging.stg_customers instead of {{ ref('stg_customers') }}. The hardcoded version works, but it severs the model from the DAG. dbt won’t know it’s a dependency. Build order will be wrong. Environment switching will break. Lineage will lie. Linters exist precisely to catch this (the dbt_project_evaluator package will flag it).

Core Idea 3: A model is the SQL you write. A materialization is what dbt wraps around it.

This separation is fundamental and weirdly underappreciated by beginners. Your model file is a select statement — describing what data should exist. The materialization config tells dbt how to physically persist it: as a view, as a full-rebuild table, as an incrementally-merged table, as nothing-at-all (ephemeral).

Same model, different materializations, drastically different behavior:

MaterializationWhat dbt sends to the warehouseWhen to use
viewcreate or replace view X as (your select)Staging; cheap; freshness matters
tablecreate or replace table X as (your select)Marts; query speed matters; full rebuild is acceptable
incrementalFirst run: create table. Later: merge / insert overwrite / delete+insert based on is_incremental() filterLarge tables where rebuild is too slow
ephemeralNothing built. SQL inlined as a CTE in downstream modelsGlue logic too small to deserve a real model

The implication: changing a materialization should not require rewriting the model. You can promote staging models from view to table, marts from table to incremental, without touching the SQL. You touch the {{ config() }} block at the top, or the dbt_project.yml directory-level config, and that’s it. This means start simple — view, table, incremental, in that order — and only escalate when you’ve measured a real problem.

Core Idea 4: dbt is stateless between runs (and this is the source of half its quirks).

dbt Core, in its classic form, has no memory. Every dbt run starts fresh. It parses your project, builds the manifest, runs models. When the run ends, dbt forgets everything. The only state is what’s in the warehouse: the actual tables and views.

This predicts:

  • Incremental models work via SQL queries against the warehouseselect max(updated_at) from {{ this }} — not via some dbt-managed bookmark. dbt has no idea what was processed last time. It asks the table.
  • Deletes don’t propagate. dbt builds models from upstream sources. If a row is deleted from the source, your incremental table still has it. There’s no cascade. (For SCD2 history tracking, that’s what snapshots are for.)
  • “State comparison” requires you to keep an old manifest.json around. Slim CI works by you saving prod’s manifest as an artifact, then running --select state:modified+ --defer --state ./prod-artifacts in CI to “compare against prod.” dbt isn’t doing magic — you handed it the previous state file.
  • Schema drift bites you. If a column appears in the source, your model picks it up; if a column changes type, dbt cheerfully creates the table with the new type. There’s no migration story; the warehouse is the migration.

This is also where the dbt Fusion engine (released as public beta May 2025) and competitors like SQLMesh come in. Fusion adds state-aware orchestration: a real-time fingerprint of model code and data state, so dbt can skip models whose inputs haven’t changed. SQLMesh was built stateful from day one and uses that to do environment-promotion via pointer swaps rather than rebuilds. If you’re learning dbt today, expect statefulness to become more central — but understand that the classic mental model is stateless, and that’s still what dbt Core does today.


6. The Architecture in Plain English

Here’s what happens when you type dbt run. Not the marketing diagram — the actual sequence.

Phase 1: Read the project. dbt walks the models/, macros/, snapshots/, seeds/, tests/ directories. It reads every .sql, .yml, and .csv. It also reads your installed packages (everything in dbt_packages/, populated by dbt deps). It reads dbt_project.yml to figure out which directory configs apply to what.

Phase 2: Parse. For every model file, dbt does a first-pass Jinja render with execute = False. The goal here is to extract the static structure: every ref(), source(), and config() call. Anything that would actually need to talk to the warehouse (like run_query()) is replaced with placeholders or skipped. dbt builds the manifest — a JSON object containing every node (models, sources, tests, etc.), every edge (the dependency graph), and every config. The manifest is written to target/manifest.json. This is the parse phase, and it’s where most “compilation errors” surface.

Phase 3: Plan. dbt takes your --select argument (or all models if you didn’t pass one), filters the manifest down to selected nodes, and computes a topological order. It also figures out parallelism: up to threads models can run concurrently as long as their dependencies are satisfied.

Phase 4: Compile. For each selected node, dbt does a second-pass Jinja render with execute = True. Now ref() resolves to actual table names (computed from your target’s database/schema and the model’s name). is_incremental() checks whether the table exists. run_query() actually queries the warehouse. The output is compiled SQL, written to target/compiled/<project>/<path>.sql.

Phase 5: Build run-SQL. The compiled SQL is wrapped in materialization-specific DDL. For a table model, this is create or replace table X as (compiled SQL). For incremental, it’s a merge statement (or insert overwrite, depending on adapter and strategy) using the compiled SQL as the source. For a view, create or replace view. The result lands in target/run/<project>/<path>.sql. This is the actual SQL the warehouse runs.

Phase 6: Execute. dbt opens connections to the warehouse (one per thread). It submits run-SQL for each node when its dependencies finish. It captures success/failure, timing, row counts. Results stream to the console.

Phase 7: Write artifacts. When the run finishes, dbt writes target/run_results.json (every node’s outcome) and updates target/manifest.json. These artifacts are the key to a lot of tooling — Slim CI uses them, dbt docs uses them, monitoring tools parse them.

A few things to internalize from this:

  • Parse is global; compile is per-model. Adding a new model to a 5,000-model project still parses all 5,000. This is why parse time matters at scale and why dbt Fusion (rewritten in Rust) was built — Python’s parse-time overhead becomes the bottleneck.

  • Where state lives: in two places only. Your code lives in git. Your data lives in the warehouse. dbt itself stores nothing between runs except the artifacts in target/ (which you regenerate from scratch on each run). This is why moving dbt projects is trivial and why “what’s in production right now?” is unambiguous: it’s whatever’s in the warehouse, period.

  • The warehouse connection is the bottleneck. dbt parallelizes by submitting concurrent queries. Your warehouse decides how many it can actually run at once. On Snowflake, this depends on warehouse size. On BigQuery, it’s slot-based. On Redshift, query queues. Tuning threads is a dance between dbt parallelism and warehouse capacity.

  • dbt Cloud and Fusion add layers. dbt Cloud adds a scheduler, a web IDE, and (with state-aware orchestration on Fusion) a persistent state store that tracks code and data fingerprints. But the core compile-and-submit mechanic above is unchanged. If you understand dbt Core’s flow, you understand 90% of what dbt Cloud is doing — Cloud is dbt Core in a nicer wrapper with state on top.


7. The Things That Bite You

Six to nine months in, every dbt user has hit most of these. They all trace back to the four core ideas above. Knowing them in advance is worth months of pain.

Gotcha 1: ref() in dynamic Jinja silently drops dependencies.

If you write {{ ref('something') }} inside an {% if execute %} block or behind a runtime condition, dbt won’t see it during parse. It’ll happily compile and run — but the DAG won’t know about the dependency. Build order can be wrong. Lineage is wrong. Slim CI’s state:modified+ won’t pick up downstream changes.

The fix: put a SQL comment with the ref() outside the conditional, like -- depends_on: {{ ref('something') }}. dbt parses this and registers the dependency, even if the actual select happens conditionally. This is a hack, but it’s the official hack.

This connects to Core Idea 2: the DAG is built from ref() calls discovered at parse time. Conditional refs hide from the parser.

Gotcha 2: The unique_key lie in incremental models.

You write an incremental model with unique_key='order_id', expecting deduplication. Your run finishes successfully. Months later you notice your table has duplicates.

What happened: the unique_key config doesn’t enforce uniqueness on the source data. It tells dbt how to match incoming rows against existing rows when merging. If your incremental SELECT returns the same order_id twice in a single batch, the merge can fail or produce duplicates depending on the warehouse. If the source data has duplicates, dbt will faithfully insert all of them.

The fix is twofold. First, deduplicate inside the model — qualify row_number() over (partition by order_id order by updated_at desc) = 1 is the standard incantation. Second, add a unique test on the model so you find out the moment dedup logic breaks.

Gotcha 3: Schema changes in incremental models.

You add a column to your incremental model. You run. You get an error like target table has columns [a,b,c] but source query produces [a,b,c,d].

dbt’s default behavior on an incremental model whose schema changed is to fail loudly. This is not a bug — it’s the safer default. You don’t want a column silently disappearing because you forgot to update the target.

The fix: set on_schema_change in the config. Options are ignore (skip the new column entirely — usually wrong), append_new_columns (add the column to the target, leave old data NULL — usually right), sync_all_columns (add new columns and drop removed ones — most aggressive), fail (the default). Pick append_new_columns and move on.

For more invasive changes, run dbt run --full-refresh --select my_model to drop and rebuild from scratch.

Gotcha 4: Hardcoded production references in dev.

Someone writes select * from prod.analytics.users in a model. It “works” — they can see prod data. They commit it. It runs in prod. It runs in dev. Nobody notices anything is wrong.

Then someone runs Slim CI in a feature branch. The state:modified+ selector misses dependencies because there’s no ref() to follow. Or they run dbt against a non-prod warehouse and the model breaks. Or the prod schema gets renamed and now the model fails everywhere.

This is the most common dbt anti-pattern. Linters (dbt_project_evaluator) catch it. Code review should catch it. Every reference to a dbt-managed table must go through ref(). Every reference to a raw source must go through source(). No exceptions.

Gotcha 5: Snapshots that aren’t running often enough.

You set up a snapshot on a table where rows update multiple times per day. Your snapshot job runs once daily. You discover too late that you’ve only captured the daily endpoint state — every intermediate state was overwritten in the source before your snapshot saw it.

Snapshots only see what’s there when they run. They are not a CDC stream. If you need every change, you need either (a) a CDC tool feeding an append-only log, which dbt then snapshots, or (b) snapshot frequency higher than change frequency. Daily snapshots are fine for daily-changing data; they are silent data loss for hourly-changing data.

Gotcha 6: Tests pass in dev because dev has 1% of the data.

You filter dev to last 7 days for speed. Your unique test passes — there are no duplicates in 7 days. You merge to prod. Your unique test fails because last year’s batch load had duplicates that dev never saw.

The fix isn’t subtle. Either (a) run a final test pass against full data before deploying (typical Slim CI pattern: build modified models in dev, then defer-and-test against prod), or (b) accept that some tests can only catch bugs in prod and have an alerting strategy for that.

Gotcha 7: Ephemeral chains turn into mile-long compiled queries.

You make a bunch of intermediate models ephemeral because “they don’t deserve a table.” Each one becomes a CTE inlined into its consumers. Now your final model’s compiled SQL is 800 lines of nested CTEs and your warehouse’s optimizer is choking on it. Debugging a result requires reading the entire compiled file because there are no intermediate tables to query.

Rule: ephemeral is for trivial logic — a one-line column rename, a constant filter — that you don’t want as a real object. The moment you’d want to inspect intermediate results, materialize as view or table.

Gotcha 8: dbt run doesn’t run tests.

Beginners are shocked by this. You set up tests, run dbt run, get green output, and assume your tests passed. They didn’t. dbt run runs only models. dbt test runs only tests. dbt build runs both, in DAG order, with the right circuit-breaker behavior.

Use dbt build in production. Use dbt run only when you specifically want to skip tests (development, fast iteration). Most “tests are passing” incidents trace back to someone using run and never realizing tests existed.

Gotcha 9: Macro changes silently invalidate Slim CI.

state:modified+ compares the current manifest to a previous manifest. If you change a macro that’s used by 200 models, every one of those 200 models is “modified” — because their compiled SQL changes. So far, so good.

But if you write a macro and use a var() or env_var() inside it, dbt can’t track the lineage of the variable. A change to the variable’s value won’t show up as state:modified. Your CI will skip models that should rebuild. You’ll deploy stale data and not know it.

The fix: don’t put logic in macros that depends on values dbt can’t track. Or, treat changes to vars as an explicit --full-refresh event and don’t rely on Slim CI for that case.


8. The Judgment Calls

These are the decisions an experienced dbt practitioner makes without thinking. They’re worth thinking about carefully.

Judgment Call 1: View vs Table vs Incremental.

The decision: how to materialize a model.

Option A — view: Always-fresh, zero storage, computation happens at query time.

Option B — table: Pre-computed once per build, fast to query, costs storage and compute on every dbt run.

Option C — incremental: Pre-computed but only for new/changed rows; complex to write and debug.

What experienced engineers actually choose: Start with view. Promote to table when query latency on the view becomes unacceptable for downstream consumers. Promote to incremental only when the table rebuild cost becomes unacceptable for your warehouse. In practice, staging models stay views forever; intermediate models are usually views or ephemeral; marts start as tables; only the largest fact tables become incremental. The signal: you measure that the cheaper option is causing real pain. Don’t preempt.

Judgment Call 2: Incremental strategy: merge vs insert_overwrite vs append vs delete+insert.

The decision: when a model is incremental, how should new data be merged in?

  • merge (default on Snowflake, BigQuery, Databricks): SQL MERGE statement, upserts on unique_key. Correct, expensive, the safe default.
  • insert_overwrite (BigQuery, Spark, Databricks): replaces entire partitions with new data. Cheap when you have a partitioned table and your delta is partition-scoped (e.g. always rebuilding “yesterday’s data”).
  • append: blind insert, no dedup. Fastest. Only safe when source is genuinely append-only (event logs, immutable transactions).
  • delete+insert (Redshift, Snowflake): delete matching rows by unique_key, then insert. Outperforms merge on Redshift specifically.

What experienced engineers actually choose: For mutable rows (orders changing status, profiles being edited), merge. For partition-scoped late-arriving data on BigQuery, insert_overwrite — often 90%+ cheaper. For pure event streams (clickstream, audit logs), append — fastest, no dedup overhead. For Redshift specifically, delete+insert because Redshift’s MERGE is emulated and slow.

The signal: look at your data. If rows mutate, merge. If they don’t but late-arrive in partitions, insert_overwrite. If they’re pure inserts, append.

Judgment Call 3: Should this be a snapshot or an incremental model?

The decision: source data mutates and you want history. Two options:

Option A — Snapshot: dbt-managed SCD2, automatic dbt_valid_from / dbt_valid_to.

Option B — Custom incremental SCD2 model: handcrafted, more flexible, no dbt magic.

What experienced engineers actually choose: Snapshots for “I want SCD2 with no surprises.” They’re declarative, well-tested, and the right call 80% of the time. Custom incremental SCD2 makes sense when (a) you need behavior dbt snapshots don’t give you, like soft-delete handling or SCD3, (b) you need very high frequency (snapshots are heavy on the warehouse for very wide tables), or (c) you’re already capturing CDC events and want to project them differently.

The signal: if your answer to “what changed?” is just “look at the row’s columns now vs before,” snapshot. If you need to reason about why it changed or carry richer event semantics, build it yourself.

Judgment Call 4: One project (monorepo) vs dbt Mesh (multi-project).

The decision: as your dbt project grows past ~500 models or two teams, do you split it?

Option A — Monorepo: one dbt_project.yml, everything in one git repo, shared CI, simple model resolution.

Option B — dbt Mesh: multiple projects, each owned by a domain team, with cross-project ref() and access controls.

What experienced engineers actually choose: Monorepo until it hurts. The pain points that justify Mesh are: (a) parse times exceeding several minutes, (b) teams stepping on each other’s models without clear ownership, (c) governance requirements (e.g. PII isolation), (d) needing different release cadences per domain.

For most teams under 1,000 models and one platform team plus embedded analysts, monorepo wins. The mesh tax — managing project dependencies, public/private access, coordinating versioning — is real and not always worth it. If you split prematurely, you spend the next year regretting the boundary you drew.

Judgment Call 5: Where does business logic live — staging or marts?

The decision: a transformation rule (e.g. “exclude internal test accounts”) could go in staging, intermediate, or marts.

Option A — staging: every downstream model gets it for free.

Option B — marts: only applied where it’s contextually correct.

What experienced engineers actually choose: Staging models do not embed business logic. Their job is renaming, casting, light cleanup — making source data conformant. Anything that could be “wrong” by another consumer’s definition (filters, aggregations, joins) goes in intermediate or marts.

The signal: if “should this be excluded?” has more than one answer depending on who’s asking, you cannot put the exclusion in staging. The “internal test accounts” filter feels universal until the fraud team needs to include them for an investigation.

Judgment Call 6: Slim CI vs full CI.

The decision: when a developer opens a PR, what does CI build?

Option A — Full build: every model, every test. Slow, expensive, complete.

Option B — Slim CI: state:modified+ --defer --state prod-artifacts — only build models you changed plus their downstream, defer everything else to prod.

What experienced engineers actually choose: Slim CI, always, on any project past ~50 models. The compute savings are massive (10x-100x typical) and you only pay it for the models actually affected by the change. The setup work — keeping prod artifacts uploaded somewhere CI can fetch — is one-time and well documented. Pair it with dbt build --fail-fast and you’ll catch issues fast and cheap.

The signal: if your CI takes more than 10 minutes, you should be on Slim CI. If you’ve ever skipped CI because it’s too slow, you definitely should.

Judgment Call 7: dbt Core (free, self-hosted) vs dbt Cloud vs Fusion.

The decision: which dbt to use.

Option A — dbt Core + your own orchestrator (Airflow, Dagster, GitHub Actions): cheapest, most control, you manage everything.

Option B — dbt Cloud: paid, hosted scheduler, web IDE, semantic layer, integrated CI.

Option C — dbt Fusion engine (Rust rewrite, public beta May 2025): faster parse, state-aware orchestration, runs on dbt Cloud platform.

What experienced engineers actually choose: dbt Core for small teams and embedded data engineers who already have Airflow/Dagster. dbt Cloud (or Fusion-on-Cloud) for teams that want to ship faster and have non-engineer analysts contributing — the IDE and managed CI/CD remove a lot of friction. Fusion is becoming the default in dbt Cloud as of 2025–2026; if you’re starting today on dbt Cloud, you’re probably on Fusion anyway.

The signal: if your data team is two engineers who like the command line, Core. If your data team is twenty mixed engineers and analysts who need a sane web UI, Cloud. If you’re scaling past a few thousand models and noticing parse times, evaluate Fusion specifically for that.

Judgment Call 8: Should this be a singular test or a generic test?

The decision: you’ve identified a data assertion. How do you write it?

Option A — Generic test: parameterizable, declared in YAML, can be reused across models.

Option B — Singular test: one-off SQL file, applies to one specific model.

What experienced engineers actually choose: Singular when the test is logic-specific to one model (“orders.total should equal sum of order_items.line_total”). Generic when the assertion repeats across models (“our tenant_id columns should never be null”).

The mistake to avoid: don’t promote singular tests to generic prematurely. Wait until you’d repeat the same logic at least three times. Generic tests have a real complexity tax (Jinja, parameterization) and a singular test you read once is way more legible than a generic test with five arguments you have to mentally instantiate every time.

Judgment Call 9: Ephemeral vs view for intermediate logic.

The decision: you have a small intermediate transformation. Persist as a view (queryable, cheap) or inline as ephemeral (no warehouse object)?

Option A — view: real warehouse object, queryable, debuggable, shows up in lineage as a real node.

Option B — ephemeral: inlined as CTE in downstream consumers, no warehouse footprint.

What experienced engineers actually choose: View, almost always. Views are essentially free (zero storage, just stored SQL), they’re queryable when debugging (“what does this intermediate look like?”), and they don’t bloat downstream compiled SQL. Ephemeral is only worth it for genuinely trivial logic that you specifically don’t want users to discover and query directly.

The signal: if you’ve ever wanted to select * from this_intermediate_thing to debug something, it should not be ephemeral.

Judgment Call 10: Run frequency.

The decision: how often does your dbt project run in production?

Options: hourly, every-few-hours, daily, on-event.

What experienced engineers actually choose: Match the slowest acceptable freshness across your stakeholders, then run on that schedule. If finance needs daily-fresh-by-9am numbers, run nightly. If product analytics needs hourly cohorts, run hourly. Don’t run more often than needed — every dbt run is warehouse compute.

The trap: someone says “real-time would be nice” and the team rebuilds everything for hourly runs that nobody actually uses for hourly decisions. Hourly is often a 4x or 24x compute increase for zero business outcome. Push back on freshness requests with “what specifically is the decision that requires fresher data?”

Judgment Call 11: Test severity: error vs warn.

The decision: a test fails. Does it stop the pipeline?

Option A — severity: error (default): test failure breaks the build, downstream models skip.

Option B — severity: warn: test failure is logged but doesn’t stop anything.

What experienced engineers actually choose: error for invariants (primary key uniqueness, not-null on join columns). warn for “this looks suspicious but isn’t catastrophic” — e.g. row count drops by 20% (might be a real signal, might be a holiday). Never set everything to error; you’ll get burned by flaky tests blocking deploys and people will start ignoring the alerts. Never set everything to warn; you’ll deploy broken data because nothing actually stopped the pipeline.

The signal: if you’re tempted to set severity to warn because a test is “flaky,” the test itself is wrong — fix the test, don’t downgrade the severity.


9. The Commands/APIs That Actually Matter

Most of dbt’s surface area is ignorable. Here’s the 20% you’ll use 80% of the time.

Daily commands

dbt build --select my_model+              # build my model and everything downstream, run tests
dbt build --select +my_model              # build everything upstream first, then my model
dbt build --select state:modified+ --defer --state prod-artifacts  # Slim CI / dev iteration
dbt run --select my_model --full-refresh  # rebuild incremental from scratch
dbt test --select my_model                # tests on this model and (by default) tests that reference it
dbt compile --select my_model             # produce compiled SQL without executing — invaluable for debugging Jinja
dbt deps                                  # install/update packages
dbt seed                                  # load CSV seeds
dbt snapshot                              # run all snapshots (separate from dbt run by design)

dbt build should be the default. dbt run is a sharp tool for development iterations. dbt compile is your debugging best friend — when Jinja is doing something weird, compile and read the output.

Selectors you’ll actually use

--select my_model                         # exact match
--select my_model+                        # plus all descendants
--select +my_model                        # plus all ancestors
--select +my_model+                       # full lineage around it
--select 1+my_model                       # only one hop upstream
--select staging.stripe                   # everything in models/staging/stripe/
--select tag:nightly                      # everything tagged
--select source:stripe+                   # everything downstream of any stripe source
--select state:modified+                  # what changed (vs prior manifest), plus descendants
--select result:fail+                     # rerun failed nodes from last run plus descendants
--exclude tag:slow                        # exclude something
--select "tag:nightly,config.materialized:incremental"  # AND (intersection)
--select tag:nightly tag:weekly           # OR (union, space-separated)

The + operator is the workhorse. The state:modified+ selector with --defer --state prod-artifacts is the foundation of Slim CI.

Important config keys

In a model’s {{ config() }} block:

{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    on_schema_change='append_new_columns',
    cluster_by=['customer_id', 'order_date'],     -- Snowflake / BigQuery
    partition_by={'field': 'order_date', 'data_type': 'date'},  -- BigQuery
    tags=['finance', 'critical'],
    schema='analytics_finance'
) }}

In dbt_project.yml, directory-level configs (apply to all models in a folder):

models:
  jaffle_shop:
    staging:
      +materialized: view
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      finance:
        +schema: finance
        +tags: ['finance']

Folder-level config is how you stay DRY. Don’t put materialized: view in every staging model — set it once at the directory level.

Jinja you’ll actually use

{{ ref('model_name') }}                   -- reference another model
{{ ref('package_name', 'model_name') }}   -- reference across packages
{{ source('source_name', 'table_name') }} -- reference raw source
{{ this }}                                -- the current model's relation (table/schema/db)
{{ var('start_date') }}                   -- from dbt_project.yml or --vars
{{ env_var('DBT_DATABASE') }}             -- from environment variables
{{ target.name }}                         -- 'dev', 'prod', etc.
{{ config(...) }}                         -- model config block
{% if is_incremental() %} ... {% endif %} -- incremental gate
{% if execute %} ... {% endif %}          -- only during execution, not parse
{% set my_var = 'value' %}                -- variable assignment
{% for x in some_list %} ... {% endfor %} -- loops (e.g. unioning many tables)
{% macro my_macro(arg1) %} ... {% endmacro %}  -- define a macro
{{ dbt_utils.generate_surrogate_key([...]) }}   -- common dbt-utils macro

You’ll write ref and source thousands of times. You’ll write is_incremental for every incremental model. The rest you’ll reach for as needed.

Useful ad-hoc commands

dbt docs generate && dbt docs serve       # local docs site
dbt ls --select my_model+ --resource-type model  # what would a selector pick?
dbt parse                                 # just parse, don't compile or run
dbt run-operation my_macro --args '{arg1: foo}'  # execute a macro standalone
dbt clean                                 # delete target/ and dbt_packages/
dbt source freshness                      # check if sources are within stale-ness limits

dbt ls is underused — when you’re not sure what a selector would pick, list it before running.


10. How It Breaks

Production dbt failures fall into a small number of categories. Here’s how to recognize them and what to do.

Failure mode 1: SQL compilation error.

Symptoms: Compilation Error in model my_model. The macro 'foo' is undefined. Or Encountered an error: Could not find ref at the parsing stage.

Root cause: the Jinja didn’t resolve. You called a macro that doesn’t exist; ref() points to a non-existent model; YAML is malformed.

Diagnose: dbt parse will reproduce the error without running anything. Read the error message carefully — dbt is generally precise about which file and line.

Fix: usually a typo, a missing package (run dbt deps), or a misnamed model. If it’s a ref() pointing nowhere, check for renames in the same PR.

Failure mode 2: Database error inside a model.

Symptoms: Database Error in model my_model. Column 'X' does not exist. Or syntax errors. Or Insufficient privileges.

Root cause: the SQL itself is wrong, or the warehouse rejected it. dbt did its job; the warehouse said no.

Diagnose: open target/run/<project>/<path>.sql, copy the SQL, paste it into a SQL console, and run it directly. You’ll see the same error in a more familiar tool. Now you can iterate without going through dbt’s wrapper.

Fix: fix the SQL. dbt isn’t involved.

Failure mode 3: Incremental model produces duplicates.

Symptoms: unique test fails on a model that was previously fine. Row counts are slowly growing larger than expected.

Root cause: source data has duplicates, or your incremental filter is overlapping with already-loaded data. A common pattern: filter is created_at > max(created_at) but created_at isn’t strictly monotonic, so the same row gets re-inserted on the next run.

Diagnose: dbt show --select my_model --limit 10 to inspect output. Then write a query: select unique_key, count(*) from my_model group by 1 having count(*) > 1. Look at the duplicates. Are the source rows duplicated? Is the filter wrong?

Fix: dedup inside the model with qualify row_number() over (...) = 1. Make sure your filter uses a column that’s strictly increasing (preferably _loaded_at from your ingestion tool, not source-system updated_at). Add a unique test so this is caught immediately next time.

Failure mode 4: Slim CI doesn’t pick up changes.

Symptoms: CI says “0 models to build” when you definitely changed something.

Root cause: Slim CI compares against a previous manifest. If the manifest is wrong (stale, from a different branch, missing) or the change is something state:modified can’t detect (a var() change that doesn’t appear in compiled SQL), CI will skip.

Diagnose: run dbt ls --select state:modified+ --state prod-artifacts locally with the same artifacts. See what dbt thinks changed. Check prod-artifacts/manifest.json — is it actually the latest prod state?

Fix: refresh the manifest artifact (re-upload from latest prod run). For changes that don’t show up in state:modified (vars, env_vars), do a --full-refresh of the affected models manually.

Failure mode 5: Snapshot mysteriously creates duplicates.

Symptoms: snapshot table has multiple “current” rows (rows where dbt_valid_to is null) for the same unique_key.

Root cause: unique_key isn’t actually unique in the source. Snapshots assume the unique key is unique; if it’s not, dbt’s matching logic breaks and you get parallel histories.

Diagnose: select unique_key, count(*) from source_table group by 1 having count(*) > 1. Find duplicates. They shouldn’t exist.

Fix: dedupe your snapshot input (use an ephemeral CTE inside the snapshot definition, or fix upstream). Add a uniqueness test to the source.

Failure mode 6: dbt run is slow but the warehouse looks idle.

Symptoms: dbt run takes hours; warehouse CPU is low; queries individually look fast.

Root cause: parse time. On large projects (thousands of models), Python’s parsing is the bottleneck — dbt is single-threaded reading and rendering all your YAML and Jinja. Or: threads is set to 1.

Diagnose: dbt parse --no-version-check and time it. If it takes 5+ minutes, you’ve got parse-bound runs. Check dbt_project.yml for threads setting.

Fix: bump threads to match warehouse capacity (8-16 typical for Snowflake; check your warehouse). For parse time, evaluate dbt Fusion (the Rust rewrite parses ~30x faster). For really large projects, consider dbt Mesh to split into smaller per-domain projects.

Failure mode 7: “It works in dev but breaks in prod.”

Symptoms: a model passes all tests in dev, fails in prod.

Root cause: dev typically processes a fraction of the data (a where target.name == 'dev' and created_at > current_date - 7 pattern). Prod sees the full history. Tests that pass on a clean 7-day slice can fail on years of historical data with edge cases.

Diagnose: run the failing model’s logic against full prod data using --defer against prod artifacts. Check the failures.

Fix: depends on the failure. Often it’s null handling for old data. Sometimes it’s a genuine duplicate from a one-time historical batch error. In any case, your dev sample needs to be representative enough to catch this — or your CI needs a step that runs against full data before merging.

The general debug workflow

When something is wrong and you don’t know what:

dbt parse                                                  # Did the project parse?
dbt compile --select problem_model                         # Did it compile? Read target/compiled/.../problem_model.sql
dbt run --select problem_model --debug                     # --debug shows the actual SQL submitted
# Open target/run/<project>/<path>.sql, copy SQL, paste into warehouse console
# Look at logs/dbt.log for warehouse-side errors
# dbt show --select problem_model --limit 50              # inspect output rows
# dbt ls --select state:modified+ --state path             # see what dbt thinks changed

The first thing to do is always read the compiled SQL. Most “dbt bugs” are misunderstandings about what dbt actually compiled to, and the compiled SQL tells you the truth.


11. The Taste Test

You can tell an experienced dbt engineer from a beginner one in five minutes by looking at their project. Here’s what to look for.

Good vs Bad: the staging layer

Bad:

-- stg_orders.sql
select
    *,
    case when status = 'cancelled' then null else amount end as net_amount,
    sum(amount) over (partition by customer_id) as customer_lifetime_value
from {{ source('shop', 'orders') }}
where customer_id not in (select id from {{ ref('internal_test_users') }})

This is wrong on multiple axes. Staging models should not have business logic, should not have joins to other models, should not aggregate, should not filter business-specifically. You’ve made every downstream consumer pay this filter and computation whether they want it or not. Your “staging” layer is actually a marts layer wearing a costume.

Good:

-- stg_shop__orders.sql
select
    id              as order_id,
    customer_id,
    status,
    amount_cents / 100.0 as amount,
    created_at,
    updated_at
from {{ source('shop', 'orders') }}

Renames, casts, light cleanup. That’s it. Anything else lives in marts where it belongs.

Good vs Bad: incremental logic

Bad:

{{ config(materialized='incremental', unique_key='order_id') }}

select * from {{ ref('stg_orders') }}
where created_at > '2024-01-01'

There’s no is_incremental() gate. The filter applies on every run, including the first full build. The unique_key is set, but there’s no actual deduplication of the source; you’re trusting it. No updated_at filter, so any row updated post-load is missed forever.

Good:

{{ config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='append_new_columns',
    incremental_strategy='merge'
) }}

with new_data as (
    select * from {{ ref('stg_orders') }}
    {% if is_incremental() %}
      -- look back 3 days to capture late-arriving updates
      where updated_at > (select dateadd(day, -3, max(updated_at)) from {{ this }})
    {% endif %}
),

deduplicated as (
    select * from new_data
    qualify row_number() over (
        partition by order_id order by updated_at desc
    ) = 1
)

select * from deduplicated

Everything is explicit: what schema changes are allowed, what strategy to merge with, an explicit lookback to handle late-arriving data, an explicit dedup so duplicates from upstream don’t propagate. This is what production-grade incremental looks like.

Good vs Bad: project structure

Bad:

models/
  customers.sql
  orders.sql
  daily_revenue.sql
  ml_features_v2.sql
  ml_features_v2_FINAL.sql
  ml_features_v2_FINAL_REAL.sql
  test_dont_use.sql

No structure. Models named like office documents. No staging/marts separation. The deltas of human chaos are visible in the filenames.

Good:

models/
  staging/
    shop/
      _shop__sources.yml
      _shop__models.yml
      stg_shop__customers.sql
      stg_shop__orders.sql
    stripe/
      stg_stripe__charges.sql
  intermediate/
    finance/
      int_orders_with_refunds.sql
  marts/
    finance/
      _finance__models.yml
      orders.sql
      customers.sql
    marketing/
      campaign_attribution.sql

Source-system staging folders, business-domain marts folders, a clear staging → intermediate → marts flow. Naming follows stg_<source>__<entity>.sql, int_<entity>_<verb>.sql, mart names match business concepts.

Good vs Bad: tests

Bad:

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - not_null

One test on one column. The model has 30 columns, complex business logic, joins, aggregations. None of it is verified.

Good:

models:
  - name: orders
    description: "One row per customer order. Source: Shop's `orders` table."
    columns:
      - name: order_id
        description: "Primary key."
        tests:
          - unique
          - not_null
      - name: customer_id
        description: "FK to customers."
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'cancelled']
      - name: amount
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Primary key tested for uniqueness and not-null. Foreign key tested for referential integrity. Enums tested for accepted values. Numeric invariants tested. Documentation present. This is the bare minimum for a mart model.

Good vs Bad: the dbt_project.yml

Bad:

models:
  my_project:
    +materialized: view

Every model is a view. No layered defaults. Every model that needs to be a table will repeat that config inline.

Good:

models:
  jaffle_shop:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      finance:
        +schema: finance
      marketing:
        +schema: marketing

Defaults are set at the directory level. Marts go to per-domain schemas automatically. No model-level config needed unless deviating from the default. This is what “DRY at the project level” looks like.

The experienced engineer’s smell test

Look at a project you don’t own. In the first five minutes, ask:

  1. Are staging models trivial (rename + cast only)? If they have joins or business logic, the project’s layering is broken.
  2. Does every primary key have unique and not_null tests? If not, foundational data quality is missing.
  3. Is the dbt_project.yml doing folder-level config? If every model has its own materialization config, the project will be unmaintainable.
  4. Are incremental models gated with is_incremental() and dedupe logic? If not, you’re going to find duplicates eventually.
  5. Are there hardcoded table references like analytics.staging.foo? If so, the DAG is lying.
  6. Is dbt build the production command, or is it dbt run? If run, tests aren’t enforced.
  7. Is CI using state:modified+ and --defer? If full builds, you’re burning compute.

Seven questions. They tell you almost everything about the project’s maturity.


12. Where to Go Deeper

The best resources, ranked by what you’ll actually learn from them.

1. The dbt official docs at docs.getdbt.com

Read the “How we structure our dbt projects” guide first. Then “Best practices for materializations.” Then “Best practices for workflows.” These three pages contain more applied wisdom than most paid courses. The reference docs (Jinja functions, configuration keys) are also genuinely good — most are written by maintainers who use the tool in production. Read these when you have a specific question; they’re well-indexed and search works.

2. The Jaffle Shop sample project

The canonical dbt example project (github.com/dbt-labs/jaffle_shop). Clone it, set it up against DuckDB locally (no warehouse account needed), and read every file. It’s small enough to read end-to-end in an hour and demonstrates every layering convention dbt Labs recommends. When you’re confused about how a real project fits together, this is your reference.

3. Claire Carroll’s original “How we structure our dbt projects” post

The seminal essay that defined staging/intermediate/marts as the dominant pattern. The current docs page is an update of this. The original is shorter and sharper. Search “Carroll dbt structure” — it’s still findable on the dbt blog and in archived form. Read it for the why behind the patterns the docs now describe as orthodoxy.

4. The dbt Discourse forum (discourse.getdbt.com)

Specifically the “On the Limits of Incrementality” thread and similar deep-dive discussions. Higher signal than Stack Overflow, less noise than Slack. Search by topic — common pitfalls have already been litigated by hundreds of people. The “Coalesce 2022” and later conference talks (videos on YouTube) include the best practitioner-oriented content.

5. The dbt_project_evaluator package

Install it (dbt-labs/dbt_project_evaluator on dbt Hub), run it against your project, read every issue it flags. It’s a linter codifying dbt’s best practices. Going through its complaints on a real project will teach you more than any course — every flag corresponds to a real anti-pattern, and fixing them grows your taste.

6. Tristan Handy’s blog (roundup.getdbt.com)

The dbt Labs CEO’s writing. More strategic than tactical, but the “What is analytics engineering?” essay defines the role dbt is built for. Read it once — it’ll calibrate how you think about the broader role of the transformation layer in an org.

7. The SQLMesh docs and intro talks

Even if you stay on dbt, reading SQLMesh’s design docs sharpens your understanding of dbt — by contrast. Their “virtual data environments,” stateful execution, and column-level lineage are exactly the things dbt Core is not, and seeing the alternatives makes the dbt design choices visible. (Plus, dbt Fusion is partly a response to this competitive pressure, so understanding SQLMesh helps you evaluate Fusion.)

8. Build a real small project

The hands-on test: pick a small public dataset (NYC taxi trips, OpenAQ air quality, your own apartment electricity bills). Set up dbt against DuckDB or BigQuery sandbox. Build staging models, two or three marts, tests on every primary key, a snapshot, an incremental model, and Slim CI in GitHub Actions. The whole thing should take a weekend. You’ll learn more than reading any guide because you’ll hit the gotchas in this document personally.


Final note: dbt’s surface area looks large but its core is small. SQL files in folders, with ref() between them, compiled and submitted to a warehouse. Materializations decide how it lands. Tests assert it’s correct. Snapshots track history. Selectors and Slim CI make it operable at scale. Once those four core ideas in Section 5 are internalized, the rest of dbt is filling in details. Build the mental model first; everything else follows.


The ideas are mine. The writing is AI assisted