Name: Dataform Engineering Fundamentals
Author: ihistand

Dataform Engineering Fundamentals | Skills Pool

# WRONG: Testing in production
dataform run --actions my_table

# CORRECT: Test in dev first
dataform run --schema-suffix dev --actions my_table

# Check compilation
dataform compile

# Validate SQL without executing
dataform run --schema-suffix dev --dry-run --actions my_table

# Only then execute
dataform run --schema-suffix dev --actions my_table

-- This will break dependency tracking
FROM `project_id.external_schema.table_name`

-- definitions/sources/external_system/table_name.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_schema",
  name: "table_name"
}

-- Then reference it
FROM ${ref("table_name")}

-- NEVER do this
FROM `project.external_schema.table_name`
FROM `project.reporting_schema.customer_metrics`
SELECT * FROM project.source_schema.customers

-- Create source declaration first, then reference
FROM ${ref("table_name")}
FROM ${ref("customer_metrics")}
SELECT * FROM ${ref("customers")}

FROM ${ref("external_schema", "sales_order")}

FROM ${ref("sales_order")}

# Check row counts
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM \`project.schema_dev.my_table\`"

# Check for nulls in critical fields
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM \`project.schema_dev.my_table\`
   WHERE key_field IS NULL"

definitions/
  sources/          # External data declarations
  intermediate/     # Transformations and business logic
  output/           # Final tables for consumption
    reports/        # Reporting tables
    marts/          # Data marts for specific use cases

config {
  type: "incremental",
  uniqueKey: "order_id",
  bigquery: {
    partitionBy: "DATE(order_date)",
    clusterBy: ["customer_id", "product_id"]
  }
}

config {
  type: "table",
  assertions: {
    uniqueKey: ["call_id"],
    nonNull: ["customer_phone_number", "start_time"],
    rowConditions: ["duration >= 0"]
  }
}

-- definitions/sources/external_system/table_name.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_schema",
  name: "table_name",
  columns: {
    id: "Unique identifier for records",
    // ... more columns
  }
}

// definitions/sources/legacy_declarations.js (existing file)
declare({
  database: "project_id",
  schema: "source_schema",
  name: "customers"
});

-- CORRECT
config {
  type: "operations",
  tags: ["daily"]
}

-- WRONG
config {
  type: "operations",
  schema: "dataform",  // DON'T specify schema
  tags: ["daily"]
}

-- CORRECT
config {
  type: "assertion",
  description: "Check for duplicates"
}

-- WRONG
config {
  type: "assertion",
  schema: "dataform_assertions",  // DON'T specify schema
  description: "Check for duplicates"
}

config {
  type: "table",
  schema: "reporting"
}

SELECT customer_id, total_revenue FROM ${ref("orders")}

config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier from source system",
    total_revenue: "Sum of all order amounts in USD, excluding refunds"
  }
}

SELECT customer_id, total_revenue FROM ${ref("orders")}

config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier from ERP system",
    customer_name: "Customer legal business name",
    account_group: "Customer classification code for account management",
    credit_limit: "Maximum allowed credit in USD"
  }
}

-- definitions/sources/external_api/events.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_api",
  name: "events",
  description: "Event records from external API with enriched data",
  columns: {
    event_id: "Unique event identifier from API",
    user_id: "User identifier who triggered the event",
    event_type: "Type of event (click, view, purchase, etc.)",
    timestamp: "UTC timestamp when event occurred",
    properties: "JSON object containing event-specific properties"
  }
}

1. Write SQLX transformation
2. Test manually with bq query
3. "It works, ship it"

1. Write data quality assertions first
2. Write unit tests for business logic
3. Run tests - they should FAIL (table doesn't exist yet)
4. Write SQLX transformation
5. Run tests - they should PASS
6. Refactor transformation if needed

config {
  type: "assertion",
  description: "Customer metrics must have valid data"
}

-- This WILL fail initially (table doesn't exist)
SELECT 'Duplicate customer_id' AS test
FROM ${ref("customer_metrics")}
GROUP BY customer_id
HAVING COUNT(*) > 1

UNION ALL

SELECT 'Negative lifetime value' AS test
FROM ${ref("customer_metrics")}
WHERE lifetime_value < 0

dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# ERROR: Table customer_metrics does not exist ✓ EXPECTED

config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier",
    lifetime_value: "Total revenue from customer in USD"
  }
}

SELECT
  customer_id,
  SUM(order_total) AS lifetime_value
FROM ${ref("orders")}
GROUP BY customer_id

dataform run --schema-suffix dev --actions customer_metrics
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# No rows returned ✓ TESTS PASS

Task	Command	Notes
Compile only	`dataform compile`	Check syntax, no BigQuery execution
Dry run	`dataform run --schema-suffix dev --dry-run --actions table_name`	Validate SQL, estimate cost
Test in dev	`dataform run --schema-suffix dev --actions table_name`	Safe execution in dev environment
Run with dependencies	`dataform run --schema-suffix dev --include-deps --actions table_name`	Run upstream dependencies first
Run by tag	`dataform run --schema-suffix dev --tags looker`	Run all tables with tag
Production deploy	`dataform run --actions table_name`	Only after dev testing succeeds

Excuse	Reality	Fix
"Too urgent to test in dev"	Production failures waste MORE time than dev testing	3 minutes testing saves 60 minutes debugging
"It's just a quick report"	"Quick" reports become permanent tables	Use proper architecture from start
"Business is waiting"	Broken output wastes stakeholder time	Correct results delivered 10 minutes later > wrong results now
"Hardcoding table path is faster than ${ref()}"	Breaks dependency tracking, creates maintenance nightmare	Create source declaration, use ${ref()} (30 seconds)
"I'll refactor it later"	Technical debt rarely gets fixed	Do it right the first time (saves time overall)
"Correctness over elegance"	Architecture = maintainability, not elegance	Proper structure IS correctness
"I'll add tests after"	After = never	Write tests FIRST (TDD), then implementation
"I'll add documentation after"	After = never	Add columns: {} in config block immediately
"Working late, just need it working"	Exhaustion causes mistakes	Discipline matters MORE when tired
"Column docs are optional for internal tables"	All tables become external eventually	Document everything, always
"Tests after achieve same result"	Tests-after = checking what it does; tests-first = defining what it should do	TDD catches design flaws early

-- WRONG: Direct table reference
FROM `project.external_schema.contacts`

-- CORRECT: Declare source first
FROM ${ref("contacts")}

-- WRONG: When source exists
FROM ${ref("dataset_name", "table_name")}

-- CORRECT
FROM ${ref("table_name")}

FROM `project.external_api.events`
SELECT * FROM project.source_schema.customers

FROM ${ref("events")}
SELECT * FROM ${ref("customers")}

config {
  type: "operations",
  schema: "dataform",  // Wrong!
}

Dataform Engineering Fundamentals

Overview

Dataform Engineering Fundamentals

Overview

When to Use

Non-Negotiable Safety Practices

1. Always Use --schema-suffix dev for Testing

2. Always Use --dry-run Before Execution

3. Source Declarations Before ref()

4. ALWAYS Use ${ref()} - NEVER Hardcoded Table Paths

5. Proper ref() Syntax

6. Basic Validation Queries

Architecture Patterns (Not Optional)

Layered Structure

Incremental vs Full Refresh

Dataform Assertions

Source Declarations: Prefer .sqlx Files

Schema Configuration Rules

Documentation Standards (Non-Negotiable)

columns: {} Requirement

Where to Get Column Descriptions

Source Declarations Should Include columns: {}

Test-Driven Development (TDD) Workflow

TDD for Dataform Tables

Example TDD Workflow

Why TDD Matters in Dataform

TDD Red Flags

Quick Reference

Common Rationalizations (And Why They're Wrong)

Red Flags - STOP Immediately

Common Mistakes

Mistake 1: Using tables before declaring sources

Mistake 2: Mixing ref() with manual schema qualification

Mistake 3: Skipping dev testing under pressure

Mistake 4: Creating monolithic transformations

Mistake 5: Missing columns: {} documentation

Mistake 6: Writing implementation before tests

Mistake 7: Using .js files for NEW source declarations

Mistake 8: Hardcoded table paths instead of ${ref()}

Mistake 9: Adding schema: config to operations or tests

Time Pressure Protocol

Troubleshooting Dataform Errors

"Table not found" errors

Dependency cycle errors

Timeout errors

Real-World Impact

Xurl

Acp Router

Coding Standards

Api Design

Mcp Server Patterns

Backend Patterns

1. Always Use `--schema-suffix dev` for Testing

2. Always Use `--dry-run` Before Execution