Debug stuck Hawk/Inspect AI evaluations. Use when user mentions "stuck eval", "eval not progressing", "eval hanging", "samples not completing", "eval set frozen", "runner stuck", "500 errors in eval", "retry loop", "eval timeout", or asks why an evaluation isn't finishing.
hawk auth access-token > /dev/null || echo "Run 'hawk login' first"hawk status <eval-set-id> - JSON report with pod state, logs, metricshawk logs <eval-set-id> or hawk logs -f for follow modehawk list samples <eval-set-id> - see completion status| Log Pattern | Meaning | Resolution |
|---|---|---|
[uuid task/id/epoch model] Retrying request to /responses | OpenAI SDK retry with sample context | Test API directly with curl to see real error |
[uuid task/id/epoch model] -> model retry N ... [ErrorType code] | Inspect retry with error summary | Check error type; use curl for full details |
500 - Internal server error | API issue | Download buffer, find failing request, test through middleman AND directly to provider |
400 - invalid_request_error | Token/context limit exceeded | Check message count and model context window |
Pod UID mismatch | Sandbox pod was killed and restarted | No fix needed—sample errored out, Inspect will retry |
Empty output, pending: true | API returned malformed response | Restart eval (buffer resumes) |
| OOMKilled in pod status | Memory exhaustion | Increase pod memory limits |
[sample_uuid task/sample_id/epoch model] prefix. Inspect's own retries also include a compact error summary suffix like [RateLimitError 429 rate_limit_exceeded]. The OpenAI SDK's internal retry messages still don't show the actual error — use curl for full details..buffer/ from S3 rather than accessing the runner pod directly.from inspect_ai.log import read_eval_log instead of manually extracting zips.Middleman is the auth proxy. If middleman fails but direct provider calls work, it's a middleman issue.
TOKEN=$(hawk auth access-token)
# Test through middleman
curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}]}'
# Test OpenAI-compatible
curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100}'
# Delete stuck eval and restart
hawk delete <eval-set-id>
hawk eval-set <config.yaml>
The sample buffer in S3 allows Inspect to resume from where it left off (unless you use --no-resume).
Task progress logs include "HTTP retries: X". High retry counts indicate API instability even while tasks complete.
Severity: Retry count × wait time = stuck duration. E.g., 45 retries × 1800s = 22+ hours stuck.
See docs/debugging-stuck-evals.md for: