Audit and improve data observability coverage across your pipeline including monitoring, alerting, freshness, and test coverage gaps. Use when assessing observability maturity, responding to data incidents, or implementing a monitoring strategy. Triggers: 'observability audit', 'data reliability', 'monitor data', 'data health check', 'monitoring coverage', 'data downtime', 'pipeline reliability'.
I'll audit your current observability setup, identify gaps, and produce a prioritized remediation plan.
Read .claude/data-stack-context.md. Key inputs: observability tool, dbt tests in place, alerting channels, recent incidents.
Run these three CLI tools first to get a complete picture before recommending changes:
node tools/clis/manifest-coverage.js --manifest target/manifest.json — identify test gaps across all models.node tools/clis/source-freshness.js --results target/sources.json — identify stale sources.node tools/clis/test-results.js --results target/run_results.json — review recent test failures.If target/manifest.json doesn't exist, run dbt compile first to generate it.
| Level | What you have |
|---|
| What's missing |
|---|
| 0 — Reactive | No monitoring; issues found by users | Everything |
| 1 — Basic | Source freshness checks, PK tests | Volume, distribution, alerting |
| 2 — Proactive | Automated tests + Slack alerts | Anomaly detection, root cause tools |
| 3 — Predictive | Anomaly detection, lineage-aware alerts | ML-based forecasting, incident correlation |
| 4 — Automated | Self-healing pipelines, auto-triage | Rare; requires significant investment |
Most teams should target Level 2-3.
Answer these to assess your current state:
# Count test coverage by model
dbt ls --select "resource_type:test" --output json \
| python3 -c "
import json, sys, collections
counts = collections.defaultdict(int)
for line in sys.stdin:
try:
obj = json.loads(line)
parent = obj.get('depends_on', {}).get('nodes', [''])[0]
model = parent.replace('model.my_project.', '')
counts[model] += 1
except:
pass
for model, count in sorted(counts.items(), key=lambda x: x[1]):
print(f'{count:3d} tests: {model}')
"
# Models with zero tests (critical gap)
dbt ls --select "resource_type:model" --output json \
| python3 -c "
import json, sys
for line in sys.stdin:
try:
obj = json.loads(line)
if not obj.get('config', {}).get('contract', {}).get('enforced') \
and 'marts' in obj.get('fqn', []):
print(obj['name'], '— no contract enforced')
except:
pass
"
What to check:
unique + not_null on primary keyrelationships testaccepted_values testsaccepted_range testsQuick audit SQL (Snowflake):
-- Models missing primary key tests
-- (Run in your warehouse against information_schema)
select
table_name,
column_name
from information_schema.columns
where table_schema = 'MARTS'
and column_name like '%_id'
and column_name not in (
-- List your tested PK columns here
select column_name from your_dbt_test_results
where test_type = 'unique'
)
What to check:
sources.yml have freshness + loaded_at_fielddbt source freshness runs in CI/CDFix: