Use when developing BigQuery Dataform transformations, SQLX files, source declarations, or troubleshooting pipelines - enforces TDD workflow (tests first), ALWAYS use ${ref()} never hardcoded table paths, comprehensive columns:{} documentation, safety practices (--schema-suffix dev, --dry-run), proper ref() syntax, .sqlx for new declarations, no schema config in operations/tests, and architecture patterns that prevent technical debt under time pressure
Core principle: Safety practices and proper architecture are NEVER optional in Dataform development, regardless of time pressure or business urgency.
REQUIRED FOUNDATION: This skill builds upon superpowers:test-driven-development. All TDD principles from that skill apply to Dataform development. This skill adapts TDD specifically for BigQuery Dataform SQLX files.
Official Documentation: For Dataform syntax, configuration options, and API reference, see https://cloud.google.com/dataform/docs
Best Practices Guide: For repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories
Time pressure does not justify skipping safety checks or creating technical debt. The time "saved" by shortcuts gets multiplied into hours of debugging, broken dependencies, and production issues.
Use this skill for ANY Dataform work:
Especially use when:
Related Skills:
These are ALWAYS required. No exceptions for deadlines, urgency, or "simple" tasks:
--schema-suffix dev for Testing# WRONG: Testing in production
dataform run --actions my_table
# CORRECT: Test in dev first
dataform run --schema-suffix dev --actions my_table
Why: Writes to schema_dev.my_table instead of schema_prod.my_table (or adds _dev suffix based on your configuration). Allows safe testing without impacting production data or dashboards.
--dry-run Before Execution# Check compilation
dataform compile
# Validate SQL without executing
dataform run --schema-suffix dev --dry-run --actions my_table
# Only then execute
dataform run --schema-suffix dev --actions my_table
Why: Catches SQL errors, missing dependencies, and cost estimation before using BigQuery slots.
WRONG: Using tables without source declarations
-- This will break dependency tracking
FROM `project_id.external_schema.table_name`
CORRECT: Create source declaration first
-- definitions/sources/external_system/table_name.sqlx
config {
type: "declaration",
database: "project_id",
schema: "external_schema",
name: "table_name"
}
-- Then reference it
FROM ${ref("table_name")}
WRONG: Hardcoded table paths
-- NEVER do this
FROM `project.external_schema.table_name`
FROM `project.reporting_schema.customer_metrics`
SELECT * FROM project.source_schema.customers
CORRECT: Always use ${ref()}
-- Create source declaration first, then reference
FROM ${ref("table_name")}
FROM ${ref("customer_metrics")}
SELECT * FROM ${ref("customers")}
Why:
Exception: None. There is NO valid reason to use hardcoded table paths in SQLX files.
WRONG: Including schema in ref() unnecessarily
FROM ${ref("external_schema", "sales_order")}
CORRECT: Use single argument when source declared
FROM ${ref("sales_order")}
When to use two-argument ref():
Why:
Always verify your output:
# Check row counts
bq query --use_legacy_sql=false \
"SELECT COUNT(*) FROM \`project.schema_dev.my_table\`"
# Check for nulls in critical fields
bq query --use_legacy_sql=false \
"SELECT COUNT(*) FROM \`project.schema_dev.my_table\`
WHERE key_field IS NULL"
Why: Catches silent failures (empty tables, null values, bad joins) immediately.
Even for "quick" work, follow these patterns:
Reference: For detailed guidance on repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories
definitions/
sources/ # External data declarations
intermediate/ # Transformations and business logic
output/ # Final tables for consumption
reports/ # Reporting tables
marts/ # Data marts for specific use cases
Don't: Create monolithic queries directly in output layer
Do: Break into intermediate steps for reusability and testing
config {
type: "incremental",
uniqueKey: "order_id",
bigquery: {
partitionBy: "DATE(order_date)",
clusterBy: ["customer_id", "product_id"]
}
}
When to use incremental: Tables that grow daily (events, transactions, logs)
When to use full refresh: Small dimension tables, aggregations with lookback windows
config {
type: "table",
assertions: {
uniqueKey: ["call_id"],
nonNull: ["customer_phone_number", "start_time"],
rowConditions: ["duration >= 0"]
}
}
Why: Catches data quality issues automatically during pipeline runs.
STRONGLY PREFER: .sqlx files for ALL new declarations
-- definitions/sources/external_system/table_name.sqlx
config {
type: "declaration",
database: "project_id",
schema: "external_schema",
name: "table_name",
columns: {
id: "Unique identifier for records",
// ... more columns
}
}
ACCEPTABLE (legacy only): .js files for existing declarations
// definitions/sources/legacy_declarations.js (existing file)
declare({
database: "project_id",
schema: "source_schema",
name: "customers"
});
Rule: ALL NEW source declarations MUST be .sqlx files. Existing .js declarations can remain but should be migrated to .sqlx when modifying them.
Why: .sqlx files support column documentation, are more maintainable, and integrate better with Dataform's dependency tracking.
Operations: Files in definitions/operations/ should NOT include schema: config
-- CORRECT
config {
type: "operations",
tags: ["daily"]
}
-- WRONG
config {
type: "operations",
schema: "dataform", // DON'T specify schema
tags: ["daily"]
}
Tests/Assertions: Files in definitions/test/ should NOT include schema: config
-- CORRECT
config {
type: "assertion",
description: "Check for duplicates"
}
-- WRONG
config {
type: "assertion",
schema: "dataform_assertions", // DON'T specify schema
description: "Check for duplicates"
}
Why: Operations live in the default dataform schema and assertions live in dataform_assertions schema (configured in workflow_settings.yaml). Specifying schema explicitly can cause conflicts.
All tables with type: "table" MUST include comprehensive columns: {} documentation in the config block.
Writing Clear Documentation: When writing column descriptions, commit messages, or any prose that humans will read, use elements-of-style:writing-clearly-and-concisely to ensure clarity and conciseness.
WRONG: Table without column documentation
config {
type: "table",
schema: "reporting"
}
SELECT customer_id, total_revenue FROM ${ref("orders")}
CORRECT: Complete column documentation
config {
type: "table",
schema: "reporting",
columns: {
customer_id: "Unique customer identifier from source system",
total_revenue: "Sum of all order amounts in USD, excluding refunds"
}
}
SELECT customer_id, total_revenue FROM ${ref("orders")}
Column descriptions should be derived from:
Example with ERP source documentation:
config {
type: "table",
schema: "reporting",
columns: {
customer_id: "Unique customer identifier from ERP system",
customer_name: "Customer legal business name",
account_group: "Customer classification code for account management",
credit_limit: "Maximum allowed credit in USD"
}
}
When applicable, source declarations should also document columns:
-- definitions/sources/external_api/events.sqlx
config {
type: "declaration",
database: "project_id",
schema: "external_api",
name: "events",
description: "Event records from external API with enriched data",
columns: {
event_id: "Unique event identifier from API",
user_id: "User identifier who triggered the event",
event_type: "Type of event (click, view, purchase, etc.)",
timestamp: "UTC timestamp when event occurred",
properties: "JSON object containing event-specific properties"
}
}
Why document sources: Downstream tables inherit and extend these descriptions, creating documentation consistency across the pipeline.
REQUIRED BACKGROUND: You MUST understand and follow superpowers:test-driven-development
BEFORE TDD: When creating NEW features with unclear requirements, use superpowers:brainstorming FIRST to refine rough ideas into clear designs. Only start TDD once you have a clear understanding of what needs to be built.
When creating NEW features or tables in Dataform, apply the TDD cycle:
The superpowers:test-driven-development skill provides the foundational TDD principles. This section adapts those principles specifically for Dataform tables and SQLX files.
WRONG: Implementation-first approach
1. Write SQLX transformation
2. Test manually with bq query
3. "It works, ship it"
CORRECT: Test-first approach
1. Write data quality assertions first
2. Write unit tests for business logic
3. Run tests - they should FAIL (table doesn't exist yet)
4. Write SQLX transformation
5. Run tests - they should PASS
6. Refactor transformation if needed
Step 1: Write assertions first (definitions/assertions/assert_customer_metrics.sqlx)
config {
type: "assertion",
description: "Customer metrics must have valid data"
}
-- This WILL fail initially (table doesn't exist)
SELECT 'Duplicate customer_id' AS test
FROM ${ref("customer_metrics")}
GROUP BY customer_id
HAVING COUNT(*) > 1
UNION ALL
SELECT 'Negative lifetime value' AS test
FROM ${ref("customer_metrics")}
WHERE lifetime_value < 0
Step 2: Run tests - watch them fail
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# ERROR: Table customer_metrics does not exist ✓ EXPECTED
Step 3: Write minimal implementation (definitions/output/reports/customer_metrics.sqlx)
config {
type: "table",
schema: "reporting",
columns: {
customer_id: "Unique customer identifier",
lifetime_value: "Total revenue from customer in USD"
}
}
SELECT
customer_id,
SUM(order_total) AS lifetime_value
FROM ${ref("orders")}
GROUP BY customer_id
Step 4: Run tests - watch them pass
dataform run --schema-suffix dev --actions customer_metrics
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# No rows returned ✓ TESTS PASS
If you're thinking:
All of these mean: You're skipping TDD. Write tests first, then implementation.
See also: The superpowers:test-driven-development skill contains additional TDD rationalizations and red flags that apply universally to all code, including Dataform SQLX files.
| Task | Command | Notes |
|---|---|---|
| Compile only | dataform compile | Check syntax, no BigQuery execution |
| Dry run | dataform run --schema-suffix dev --dry-run --actions table_name | Validate SQL, estimate cost |
| Test in dev | dataform run --schema-suffix dev --actions table_name | Safe execution in dev environment |
| Run with dependencies | dataform run --schema-suffix dev --include-deps --actions table_name | Run upstream dependencies first |
| Run by tag | dataform run --schema-suffix dev --tags looker | Run all tables with tag |
| Production deploy | dataform run --actions table_name | Only after dev testing succeeds |
| Excuse | Reality | Fix |
|---|---|---|
| "Too urgent to test in dev" | Production failures waste MORE time than dev testing | 3 minutes testing saves 60 minutes debugging |
| "It's just a quick report" | "Quick" reports become permanent tables | Use proper architecture from start |
| "Business is waiting" | Broken output wastes stakeholder time | Correct results delivered 10 minutes later > wrong results now |
| "Hardcoding table path is faster than ${ref()}" | Breaks dependency tracking, creates maintenance nightmare | Create source declaration, use ${ref()} (30 seconds) |
| "I'll refactor it later" | Technical debt rarely gets fixed | Do it right the first time (saves time overall) |
| "Correctness over elegance" | Architecture = maintainability, not elegance | Proper structure IS correctness |
| "I'll add tests after" | After = never | Write tests FIRST (TDD), then implementation |
| "I'll add documentation after" | After = never | Add columns: {} in config block immediately |
| "Working late, just need it working" | Exhaustion causes mistakes | Discipline matters MORE when tired |
| "Column docs are optional for internal tables" | All tables become external eventually | Document everything, always |
| "Tests after achieve same result" | Tests-after = checking what it does; tests-first = defining what it should do | TDD catches design flaws early |
If you're thinking any of these thoughts, STOP and follow the skill:
--schema-suffix dev this once"--dry-run"All of these mean: You're about to create problems. Follow the non-negotiable practices.
-- WRONG: Direct table reference
FROM `project.external_schema.contacts`
-- CORRECT: Declare source first
FROM ${ref("contacts")}
Fix: Create source declaration in definitions/sources/ before using in queries.
-- WRONG: When source exists
FROM ${ref("dataset_name", "table_name")}
-- CORRECT
FROM ${ref("table_name")}
Fix: Use single-argument ref() when source declaration exists. Dataform handles full path resolution.
Symptom: "I'll deploy directly to production because it's urgent"
Fix: --schema-suffix dev takes 30 seconds longer than production deploy. Production failures take hours to fix.
Symptom: 200-line SQLX file with 5 CTEs doing multiple transformations
Fix: Break into intermediate tables. Each table should do ONE transformation clearly.
Symptom: Table config without column descriptions
Fix: Add comprehensive columns: {} block to EVERY table with type: "table". Get descriptions from source docs, upstream tables, or business logic.
Symptom: Creating SQLX file, then adding assertions afterward (or never)
Fix: Follow TDD cycle - write assertions first, watch them fail, write implementation, watch tests pass.
Symptom: Creating NEW definitions/sources/sources.js files with declare() functions
Fix: Create .sqlx files in definitions/sources/[system]/[table].sqlx with proper config blocks and column documentation. Existing .js files can remain until they need modification.
Symptom: Using backtick-quoted table paths in queries
FROM `project.external_api.events`
SELECT * FROM project.source_schema.customers
Fix: ALWAYS use ${ref()} after creating source declarations
FROM ${ref("events")}
SELECT * FROM ${ref("customers")}
Why critical: Hardcoded paths break dependency tracking, prevent --schema-suffix from working, and make refactoring impossible.
Symptom: Operations or test files with explicit schema configuration
config {
type: "operations",
schema: "dataform", // Wrong!
}
Fix: Remove schema: config - operations and tests use default schemas from workflow_settings.yaml
When under extreme time pressure (board meeting in 2 hours, production down, stakeholder waiting):
The bottom line: Safety practices save time. Skipping them wastes time. Even under pressure.
RECOMMENDED APPROACH: When encountering ANY bug, test failure, or unexpected behavior, use superpowers:systematic-debugging before attempting fixes. For errors deep in execution or cascading failures, use superpowers:root-cause-tracing to identify the original trigger.
Official Reference: For Dataform-specific errors, configuration issues, or syntax questions, consult https://cloud.google.com/dataform/docs
Quick fixes:
definitions/sources/dataform compile to see resolved SQLIf issue persists: Use superpowers:systematic-debugging for structured root cause investigation.
Quick fixes:
${ref("table_name")} not direct table referencesIf issue persists: Use superpowers:root-cause-tracing to trace the dependency chain back to the source of the cycle.
Quick fixes:
If issue persists: Use superpowers:systematic-debugging to investigate query performance systematically.
Scenario: "Quick" report created without source declarations, skipping dev testing.
Cost:
With proper practices:
Takeaway: Discipline is faster than shortcuts.