Debug the Lightdash app using PM2 logs, Spotlight traces, and browser automation. Use when investigating issues, tracking down bugs, understanding request flow, or correlating frontend actions with backend behavior.
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST.
Do not edit application code until you have a confirmed root cause hypothesis backed by evidence (logs, traces, reproduction). Fixing symptoms creates whack-a-mole debugging.
Before forming any hypothesis, collect facts.
Read error messages, stack traces, and reproduction steps. If the user hasn't provided enough context, ask ONE clarifying question.
git log --oneline -20 -- <affected-files>
Was this working before? A regression means the root cause is in the diff.
pnpm exec pm2 logs lightdash-api --lines 50 --nostream
Then use Spotlight MCP:
mcp__spotlight__search_errors with {"timeWindow": 300} for recent errors with stack tracesmcp__spotlight__search_traces with {"timeWindow": 300} for recent request tracesmcp__spotlight__get_traces with a trace ID for full span breakdownCan you trigger the bug deterministically? Use browser automation or curl to reproduce. If you can't reproduce, gather more evidence before proceeding.
Trace the code path from the symptom back to potential causes. Use Grep to find all references, Read to understand the logic.
Output: State your root cause hypothesis — a specific, testable claim about what is wrong and why.
Check if the bug matches a known Lightdash pattern:
| Pattern | Signature | Where to look |
|---|---|---|
| Permission error | 403 Forbidden | CASL abilities in projectMemberAbility.ts, organizationMemberAbility.ts |
| Slow query / N+1 | Response >1s, many db spans | Spotlight span breakdown — count db.* spans, check for loops |
| Stale explore cache | Shows old columns/metrics | cached_explores table, CompileService |
| Migration issue | Column not found, schema mismatch | packages/backend/src/database/migrations/, check rollback |
| Scheduler job failure | Job not running, stuck | PM2 scheduler logs, graphile_worker.jobs table |
| Frontend stale data | UI shows old data after mutation | TanStack Query cache invalidation, check queryClient.invalidateQueries |
| Race condition | Intermittent, timing-dependent | Concurrent access to shared state, missing await |
| Cross-service file issue | File not found in worker/headless | Must use S3 via FileStorageClient, not local filesystem |
| Config drift | Works locally, fails in CI/staging | Env vars, .env.development.local vs container env |
| Null propagation | TypeError, cannot read property | Missing guards on optional values, Knex .first() returning undefined |
Also check:
git log for prior fixes in the same area — recurring bugs in the same files are an architectural smellBefore writing ANY fix, verify your hypothesis.
Confirm: Add a temporary log, assertion, or use Spotlight/PM2 to check the suspected root cause. Reproduce. Does evidence match?
If wrong: Return to Phase 1. Gather more evidence. Do not guess.
3-strike rule: If 3 hypotheses fail, STOP. Ask the user:
Once root cause is confirmed:
Fix the root cause, not the symptom. Smallest change that eliminates the actual problem.
Minimal diff. Fewest files touched, fewest lines changed. Do not refactor adjacent code.
Write a regression test that fails without the fix and passes with it.
Run the relevant test suite. Paste output.
Blast radius check: If fix touches >5 files, pause and confirm with the user before proceeding.
Reproduce the original bug scenario and confirm it's fixed. This is not optional.
Output a structured debug report:
DEBUG REPORT
════════════════════════════════════════
Symptom: [what the user observed]
Root cause: [what was actually wrong]
Fix: [what was changed, with file:line references]
Evidence: [test output, trace showing fix works]
Regression test: [file:line of the new test]
Status: DONE | DONE_WITH_CONCERNS | BLOCKED
════════════════════════════════════════
Status definitions:
Ensure the development environment is running:
/docker-dev # Start Docker services (postgres, minio, etc.)
pnpm pm2:start # Start all PM2 processes including Spotlight
pnpm pm2:status # Verify all processes are online
Services:
pnpm pm2:logs # Stream all logs (Ctrl+C to exit)
pnpm pm2:logs:api # API server logs only
pnpm pm2:logs:scheduler # Background job scheduler logs
pnpm pm2:logs:frontend # Vite dev server logs
pnpm pm2:logs:spotlight # Spotlight sidecar logs
For non-blocking log viewing (last N lines):
pnpm exec pm2 logs lightdash-api --lines 20 --nostream
Lightdash logs include a 32-character trace ID for correlation:
[700aa55e784a437aa97de9b8c5c3ed3a] GET /api/v1/health 200 - 37 ms
[4d40816e5a7d433e88f99008de5f4be5] GET /api/v1/user/login-options 200 - 2 ms
Use the first 8 characters (e.g., 700aa55e) to look up traces in Spotlight.
Use the Spotlight MCP tools to query telemetry programmatically:
mcp__spotlight__search_traces with filters: {"timeWindow": 300}
Returns summary of recent traces with trace ID, endpoint, duration, span count.
mcp__spotlight__get_traces with traceId: "700aa55e"
Returns the full span tree with timing breakdown:
GET /api/v1/health [816224a9 · 39ms]
├─ select * from "knex_migrations" [db · 8ms]
├─ select * from "organizations" [db · 2ms]
├─ session [middleware.express · 1ms]
└─ /api/v1/health [request_handler.express · 0ms]
mcp__spotlight__search_errors with filters: {"timeWindow": 300}
Returns runtime errors and exceptions with stack traces.
mcp__spotlight__search_logs with filters: {"timeWindow": 300}
Returns application log entries.
Use the Chrome DevTools MCP tools for browser automation:
mcp__chrome-devtools__new_page with url: "http://localhost:3000/login"
mcp__chrome-devtools__navigate_page with url: "http://localhost:3000", type: "url"
mcp__chrome-devtools__take_snapshot
Returns a text snapshot of the page based on the accessibility tree with unique uid identifiers for each element.
mcp__chrome-devtools__click with uid: "1_5"
mcp__chrome-devtools__fill with uid: "1_4", value: "[email protected]"
mcp__chrome-devtools__hover with uid: "1_3"
mcp__chrome-devtools__take_screenshot
mcp__chrome-devtools__take_screenshot with fullPage: true
mcp__chrome-devtools__take_screenshot with filePath: "/tmp/debug.png"
mcp__chrome-devtools__list_console_messages
mcp__chrome-devtools__list_network_requests
mcp__chrome-devtools__get_network_request with reqid: 123
mcp__chrome-devtools__list_pages
mcp__chrome-devtools__select_page with pageId: 1
mcp__chrome-devtools__close_page with pageId: 2
# Check table schema
psql -c "\d <table_name>"
# Query data directly
psql -c "SELECT * FROM <table> WHERE <condition> LIMIT 5;"
# Check scheduler jobs
psql -c "SELECT id, task_identifier, attempts, last_error FROM graphile_worker.jobs WHERE last_error IS NOT NULL LIMIT 10;"
Traces contain contextual attributes beyond timing:
| Attribute | Example | Description |
|---|---|---|
http.route | /api/v1/projects/:projectUuid | Route pattern |
http.status_code | 200 | Response status |
http.method | GET | HTTP method |
sentry.op | db, http.server | Operation type |
db.system | postgresql | Database type |
| Symptom | What to check |
|---|---|
| 401 Unauthorized | Trace auth middleware, check session/JWT spans |
| 403 Forbidden | Check user ability/permissions in trace attributes, review CASL abilities |
| 404 Not Found | Verify route exists, check resource lookup spans |
| 400 Bad Request | Look for validation errors in trace/error logs |
| Slow response | Check span breakdown for slow db.* or external calls, count db spans for N+1 |
| Empty results | Verify query parameters, check db query spans |
| 500 Server Error | Use search_errors for stack trace and context |
| UI not updating | Check TanStack Query devtools, verify cache invalidation after mutations |
| Scheduler not running | Check pnpm pm2:logs:scheduler, query graphile_worker.jobs table |
| Action | Command |
|---|---|
| View all logs | pnpm pm2:logs |
| View API logs (non-blocking) | pnpm exec pm2 logs lightdash-api --lines 20 --nostream |
| Check process status | pnpm pm2:status |
| Restart API | pnpm pm2:restart:api |
| Recent traces | mcp__spotlight__search_traces {"timeWindow": 300} |
| Trace details | mcp__spotlight__get_traces "<8-char-prefix>" |
| Recent errors | mcp__spotlight__search_errors {"timeWindow": 300} |
| Browser snapshot | mcp__chrome-devtools__take_snapshot |
| Open Spotlight UI | http://localhost:8969 |
Use another AI agent to validate your findings, challenge your conclusions, and provide independent evidence. This is not just for when you're stuck — actively seek validation throughout the debugging process to ensure your analysis is sound.
which claude 2>/dev/null && echo "HAS_CLAUDE=true" || echo "HAS_CLAUDE=false"
which codex 2>/dev/null && echo "HAS_CODEX=true" || echo "HAS_CODEX=false"
codex exec "Given this context from a Lightdash debugging session:
<context>
[paste relevant logs, traces, error messages, or code snippets]
</context>
<question>
[your specific question — e.g., 'Does this trace confirm that the query cache is being bypassed?' or 'Given this stack trace, is my conclusion that the middleware short-circuits before auth correct?']
</question>"
claude -p "Given this context from a Lightdash debugging session:
<context>
[paste relevant logs, traces, error messages, or code snippets]
</context>
<question>
[your specific question]
</question>"
codex exec via the Bash tool, set the Bash timeout to 300000ms (5 minutes) to avoid premature termination. Do NOT pass --timeout to codex exec itself — it doesn't support that flag.For testing authenticated flows:
Prefer this method over using fetch() via Chrome DevTools MCP for API calls. curl with the PAT is faster, more reliable, and doesn't require a browser session or page context. Reserve browser MCP tools for UI interaction and visual debugging only.
A dev PAT is auto-provisioned by the seed data. Use it for direct API calls without browser login:
# Source the env file to get LDPAT and LIGHTDASH_API_URL
source .env.development.local
# Example: list projects
curl -s -H "Authorization: ApiKey $LDPAT" "$LIGHTDASH_API_URL/api/v1/org/projects" | jq
# Example: get user info
curl -s -H "Authorization: ApiKey $LDPAT" "$LIGHTDASH_API_URL/api/v1/user" | jq
The token (ldpat_deadbeefdeadbeefdeadbeefdeadbeef) is defined in SEED_PAT from @lightdash/common and inserted during database seeding. It belongs to the admin user ([email protected]) and never expires.