Use when CI/CD checks fail on a PR and you need to diagnose and fix test, build, or lint failures
Auto-diagnose and fix CI/CD failures from GitHub Actions. Polls for failed runs, parses logs, classifies failures, applies targeted fixes, verifies locally, and pushes. Max 3 fix cycles before escalation. Usable standalone or called from finalize.
Announce at start: "I'm using envoy:fix-ci to diagnose and fix CI failures."
| Flag | Effect |
|---|---|
<pr-number> | PR to check (default: detect from current branch or /tmp/envoy-active-pr.txt) |
PR detection order:
/envoy:fix-ci <pr-number> — explicit PR number passed/tmp/envoy-active-pr.txt — written by envoy:finalize after PR creationgh pr view — current branch's associated PR/envoy:fix-ci <pr-number>"# Detect PR number
if [ -n "$1" ]; then
PR_NUMBER=$1
elif [ -f /tmp/envoy-active-pr.txt ]; then
PR_NUMBER=$(cat /tmp/envoy-active-pr.txt)
else
PR_NUMBER=$(gh pr view --json number -q '.number' 2>/dev/null)
fi
if [ -z "$PR_NUMBER" ]; then
echo "Cannot find PR. Specify: /envoy:fix-ci <pr-number>"
exit 1
fi
OWNER=$(gh repo view --json owner -q '.owner.login')
REPO=$(gh repo view --json name -q '.name')
BRANCH=$(git branch --show-current)
# Get latest check runs
gh pr checks $PR_NUMBER --json name,state,conclusion 2>/dev/null
Checks may still be running. Poll with exponential backoff:
intervals = [30s, 60s, 120s, 240s]
timeout = 15 minutes
For each interval:
- Query: gh pr checks $PR_NUMBER --json name,state,conclusion
- If all checks resolved (pass or fail): break
- If still running: report progress, sleep, continue
- If timeout: report which checks are still pending
For each failed check, download the log and classify:
# Get failed run IDs
FAILED_RUNS=$(gh run list --branch $BRANCH --status failure --json databaseId,name -q '.[] | .databaseId')
# For each failed run, get the failed log
gh run view $RUN_ID --log-failed 2>&1
Classify each failure:
| Type | Signal | Action |
|---|---|---|
test-failure | Failed test names, assertion errors, FAIL, Expected X but got Y | Read test + source, fix, verify locally |
build-error | error CS, error TS, Cannot find module, compilation errors with file:line | Fix compilation at error location |
lint-violation | ESLint/Prettier errors, warning/error with rule name + file:line | Auto-fix (--fix) or manual fix |
infra-issue | Runner unavailable, permissions, timeouts, Docker pull failures, OOM | Escalate immediately — don't try to fix |
Scope note: Step 1 uses gh pr checks to detect all failures — this includes
both GitHub Actions workflow runs and external status checks (e.g., Vercel, Netlify).
However, log download via gh run list / gh run view --log-failed only works for
GitHub Actions runs. If a failed check has no corresponding workflow run, classify it
as infra-issue with the note: "External status check — check the service directly."
If ANY failure is classified as infra-issue, escalate immediately:
**CI Infrastructure Failure — Cannot Auto-Fix**
| Workflow | Error | Type |
|----------|-------|------|
| <name> | <error excerpt> | infra-issue |
This is not a code issue. Possible causes:
- GitHub Actions runner unavailable
- Docker image pull failure
- Permission/secret configuration issue
- Resource limit (OOM, disk space)
- Network timeout
Please investigate the CI infrastructure.
Do not attempt to fix infrastructure issues. Stop here for these.
For each non-infra failure, in order of severity:
Test failures:
Build errors:
Lint violations:
npm run lint -- --fix or dotnet formatFix all failures from the current CI run before pushing. Diagnose and fix each failure in order of severity (test → build → lint), verify all fixes locally (Step 6), then commit and push once. One push = one fix cycle.
# After ALL failures are fixed and verified
git add <changed-files>
git commit -m "fix: resolve CI failures — <brief summary of all fixes>"
Before pushing, run the same checks CI uses:
# Run what CI runs (adjust for project)
dotnet test
dotnet build
npm test
npm run lint
npm run build
All local checks must pass before pushing. If they don't, go back to Step 5.
git push
Poll for new CI run results with exponential backoff (same as Step 2).
Uses a state machine with two counters:
lib/loop-safeguards.js)state = POLL_CI
FIX_CYCLE = 0
CONFIRM_COUNT = 0