Validate daily orchestration pipeline health
You are performing a comprehensive daily validation of the NBA stats scraper pipeline. This is NOT a rigid script - you should investigate issues intelligently and adapt based on what you find.
Validate that the daily orchestration pipeline is healthy and ready for predictions. Check all phases (2-5), run data quality spot checks, investigate any issues found, and provide a clear, actionable summary.
Important: For yesterday's results validation, data spans TWO calendar dates:
┌─────────────────────────────────────────────────────────────┐
│ Jan 25th (GAME_DATE) │ Jan 26th (PROCESSING_DATE) │
├────────────────────────────────┼────────────────────────────┤
│ • Games played (7-11 PM) │ • Box score scrapers run │
│ • Player performances │ • Phase 3 analytics run │
│ • Predictions made (pre-game) │ • Predictions graded │
│ │ • Cache updated │
│ │ • YOU RUN VALIDATION │
└────────────────────────────────┴────────────────────────────┘
Use the correct date for each query:
GAME_DATEPROCESSING_DATEIf the user invoked the skill without specific parameters, ask them what they want to check:
Use the AskUserQuestion tool to gather preferences:
Question 1: "What would you like to validate?"
Options:
- "Today's pipeline (pre-game check)" - Check if today's data is ready before games start
- "Yesterday's results (post-game check)" - Verify yesterday's games processed correctly
- "Specific date" - Validate a custom date
- "Quick health check only" - Just run health check script, no deep investigation
Question 2: "How thorough should the validation be?"
Options:
- "Standard (Recommended)" - Priority 1 + Priority 2 checks
- "Quick" - Priority 1 only (critical checks)
- "Comprehensive" - All priorities including spot checks
Based on their answers, determine scope:
| Mode | Thoroughness | Checks Run |
|---|---|---|
| Today pre-game | Standard | Health check + validation + spot checks |
| Today pre-game | Quick | Health check only |
| Yesterday results | Standard | P1 (box scores, grading) + P2 (analytics, cache) |
| Yesterday results | Quick | P1 only (box scores, grading) |
| Yesterday results | Comprehensive | P1 + P2 + P3 (spot checks, accuracy) |
If the user already provided parameters (e.g., specific date in their message), skip the questions and proceed with those parameters.
After determining what to validate, set the target dates:
If "Today's pipeline (pre-game check)":
GAME_DATE = TODAY (games scheduled for tonight)PROCESSING_DATE = TODAY (data should be ready now)If "Yesterday's results (post-game check)":
GAME_DATE = YESTERDAY (games that were played)PROCESSING_DATE = TODAY (scrapers ran after midnight)If "Specific date":
GAME_DATE = USER_PROVIDED_DATEPROCESSING_DATE = DAY_AFTER(USER_PROVIDED_DATE)# Set dates in bash for queries
GAME_DATE=$(date -d "yesterday" +%Y-%m-%d) # For yesterday's results
PROCESSING_DATE=$(date +%Y-%m-%d) # Today (when processing ran)
# Or for pre-game check
GAME_DATE=$(date +%Y-%m-%d) # Today's games
PROCESSING_DATE=$(date +%Y-%m-%d) # Today
Critical: Use GAME_DATE for game data queries, PROCESSING_DATE for processing status queries.
First: Determine current time and game schedule context
Key Timing Rules:
IMPORTANT: Check BigQuery quotas FIRST to prevent cascading failures.
# Check current quota usage for partition modifications
bq show --format=prettyjson nba-props-platform | grep -A 10 "quotaUsed"
# Or check recent quota errors in logs
gcloud logging read "resource.type=bigquery_resource AND protoPayload.status.message:quota" \
--limit=10 --format="table(timestamp,protoPayload.status.message)"
What to look for:
Common cause: pipeline_logger writing too many events to partitioned run_history table
IMPORTANT: Verify all critical services are deployed with latest code.
Why this matters: Bug fixes committed but not deployed cause issues to persist. Sessions 82, 81, and 64 had critical bugs that were fixed in code but not deployed for hours/days.
Real Examples:
What to check:
# Run deployment drift check
./bin/check-deployment-drift.sh --verbose
Expected Result:
✓ Up to date or have acceptable drift (<24 hours)If STALE DEPLOYMENT detected:
| Commits Behind | Severity | Action |
|---|---|---|
| 1-2 commits | P2 | Deploy when convenient |
| 3-5 commits | P1 | Deploy today |
| 6+ commits | P0 CRITICAL | Deploy immediately |
Critical Services (must be up-to-date):
prediction-worker - Generates predictionsprediction-coordinator - Orchestrates predictionsnba-grading-service - Grades predictionsnba-phase3-analytics-processors - Analytics processingnba-scrapers - Data collectionInvestigation if drift detected:
# See what changed since deployment
SERVICE="prediction-worker"
DEPLOYED_SHA=$(gcloud run services describe $SERVICE --region=us-west2 \
--format="value(metadata.labels.commit-sha)")
git log --oneline $DEPLOYED_SHA..HEAD -- predictions/worker/
# If critical fixes found, deploy immediately
./bin/deploy-service.sh $SERVICE
Reference: Sessions 82, 81, 64 handoffs
Purpose: Verify the live-export Cloud Function has its BDL_API_KEY env var intact. On Feb 22, a --set-env-vars deployment wiped BDL_API_KEY, causing live-grading to silently regenerate stale data for an entire evening.
Why this matters: live-export is a Cloud Function (not Cloud Run), so standard deployment drift checks don't cover it. Without BDL_API_KEY, BDL live box score lookups fail silently and the live-grading JSON gets regenerated every 3 minutes with all-pending, zero-actual data.
What to check:
# Check BDL_API_KEY is present in live-export env vars
gcloud functions describe live-export \
--region=us-west2 \
--project=nba-props-platform \
--format="json" | jq '.environmentVariables | keys'
# Run the full env var drift detector (supports Cloud Functions)
./bin/monitoring/verify-env-vars-preserved.sh live-export
Expected result:
BDL_API_KEY present and non-empty in env varsGCP_PROJECT presentverify-env-vars-preserved.sh shows ALL REQUIRED VARIABLES PRESENTAlert thresholds:
If BDL_API_KEY missing:
# Immediate fix: re-add BDL_API_KEY from Secret Manager
BDL_API_KEY=$(gcloud secrets versions access latest --secret=BDL_API_KEY --project=nba-props-platform)
gcloud functions deploy live-export \
--region=us-west2 \
--project=nba-props-platform \
--update-env-vars="BDL_API_KEY=$BDL_API_KEY"
Also check live-freshness-monitor:
./bin/monitoring/verify-env-vars-preserved.sh live-freshness-monitor
Reference: Session 302 (BDL_API_KEY wiped by --set-env-vars deployment)
IMPORTANT: Check Firestore heartbeat collection for document proliferation.
Why this matters: Heartbeat documents should be ONE per processor (one doc gets updated). If processors create NEW documents for each run, the collection grows unbounded (100k+ docs), causing performance degradation and Firestore costs.
What to check:
# Check Firestore heartbeat document count
python3 -c "
from google.cloud import firestore
db = firestore.Client(project='nba-props-platform')
docs = list(db.collection('processor_heartbeats').stream())
bad = [d for d in docs if '_None_' in d.id or '_202' in d.id]
total = len(docs)
bad_count = len(bad)
print(f'Total heartbeat documents: {total}')
print(f'Bad format (old pattern): {bad_count}')
print(f'Expected: ~30-50 documents')
if bad_count > 0:
print(f'\n⚠️ WARNING: {bad_count} old format documents detected!')
print('Sample bad documents:')
for doc in bad[:5]:
print(f' {doc.id}')
print('\nAction: Run bin/cleanup-heartbeat-docs.py')
if total > 100:
print(f'\n⚠️ WARNING: Too many documents ({total})!')
print('Expected: ~30-50 (one per active processor)')
print('Possible cause: Heartbeat code creating new docs instead of updating')
print('Action: Check shared/monitoring/processor_heartbeat.py')
"
Expected result:
If issues detected:
| Issue | Severity | Action |
|---|---|---|
| Bad format docs > 0 | P2 | Run bin/cleanup-heartbeat-docs.py to clean up |
| Total docs > 100 | P1 | Investigate which service creating bad docs, redeploy |
| Total docs > 500 | P0 CRITICAL | Immediate cleanup + service fix |
Cleanup command:
# Preview cleanup
python bin/cleanup-heartbeat-docs.py --dry-run
# Execute cleanup
python bin/cleanup-heartbeat-docs.py
Investigation command (if proliferation detected):
# Find which processors created docs in last hour
python3 -c "
from google.cloud import firestore
from datetime import datetime, timedelta
db = firestore.Client(project='nba-props-platform')
docs = list(db.collection('processor_heartbeats').stream())
cutoff = datetime.now() - timedelta(hours=1)
recent_bad = []
for doc in docs:
data = doc.to_dict()
last_hb = data.get('last_heartbeat')
if '_None_' in doc.id or '_202' in doc.id:
if last_hb and hasattr(last_hb, 'replace') and last_hb.replace(tzinfo=None) > cutoff:
recent_bad.append(doc.id)
if recent_bad:
print(f'⚠️ {len(recent_bad)} bad documents created in last hour')
print('Offending processors:')
for doc_id in set([d.split('_None_')[0].split('_202')[0] for d in recent_bad]):
print(f' {doc_id}')