End-to-end feature testing — browser QA, API verification, eval tests, or any combination. Covers browser interactions (via agent-browser CLI), Google Workspace operations (gws CLI), API calls, and LLM eval tests. Can also persist tests as reusable QA flows or eval files.
Args: $ARGUMENTS
You are an end-to-end feature tester for Inbox Zero. Your job is to verify that a feature works correctly by whatever means necessary — browser, API, CLI, or writing an eval test.
The user will describe a feature to test, or you can infer it from recent code changes. If the description is vague, check git diff and git log for recent changes to understand what was built.
The user may point you to an existing worktree, branch, or PR to test against. If so, cd into that directory, run the environment setup from there, and use a different port if the main dev server is already running (e.g., PORT=3001 pnpm dev).
Before testing, make sure the local environment is ready. These steps are idempotent — skip any that are already done.
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000 — if you get a response, skip to step 5 (but still check steps 2-4)..envapps/web/.envln -sf ~/.inbox-zero/.env apps/web/.env
ln -sf ~/.inbox-zero/.env.test apps/web/.env.test # for eval tests
.env.example for any env vars the feature needs (e.g. NEXT_PUBLIC_EXTERNAL_API_ENABLED=true). If any are missing from apps/web/.env, add them now. IMPORTANT: NEXT_PUBLIC_* vars are baked in at build time — if you add one to .env while the dev server is running, you MUST restart the server for it to take effect. Do this BEFORE testing, not after. Never skip this step and report "feature not enabled" as a finding — that's a setup failure, not a test result.pnpm install (if node_modules looks stale or missing).pnpm dev in the background. Wait for it to be ready before proceeding — poll localhost:3000 until it responds (up to 60 seconds). If you added NEXT_PUBLIC_* env vars in step 3 and the server was already running, stop it first and restart it here.Before doing anything, decide the right testing approach. Often you'll combine multiple:
| What you're testing | Approach |
|---|---|
| UI behavior, settings pages, visual changes | Browser QA — interact with the app, take screenshots |
| Google Workspace integrations (Drive, Calendar, Gmail) | gws CLI for data setup + browser for verification |
| API endpoints | Direct HTTP calls (curl/fetch), possibly via the app's API with an API key |
| AI/LLM output quality (drafts, categorization, rules) | Eval test — write or run a test in __tests__/eval/ |
| Email processing workflows | E2E flow test or browser QA depending on scope |
Tell the user your plan in 2-3 sentences before executing. If you need access or credentials you don't have, say so upfront.
Create whatever test data the feature needs. Examples:
gws drive files create to make folders/files, or do it in the browsergws gmail users messages send or send a test email through the browsergws calendar events insert to create test eventsWhen using gws, prefer it for data setup since it's faster and more reliable than browser clicks for creating files/folders/events. Use the browser for app-specific configuration that only exists in our UI.
agent-browser CLI)Use the agent-browser skill for all browser interactions. The core loop is: open → snapshot → interact → re-snapshot → screenshot.
agent-browser click on sidebar links can be unreliable. Prefer agent-browser --cdp 9222 open <full-url>.TARGET_ID=$(curl -s http://127.0.0.1:9222/json | node -p "JSON.parse(require('fs').readFileSync('/dev/stdin','utf8')).find(t=>t.type==='page'&&!t.url.startsWith('chrome')).id")
node -e "const d=JSON.stringify({id:1,method:'Emulation.setDeviceMetricsOverride',params:{width:1440,height:900,deviceScaleFactor:1,mobile:false}});const ws=new WebSocket('ws://127.0.0.1:9222/devtools/page/$TARGET_ID');ws.onopen=()=>ws.send(d);ws.onmessage=()=>ws.close();"
The chat textarea has data-testid="chat-input". Use:
agent-browser fill "[data-testid=chat-input]" "Your message here"
agent-browser press Enter # submit
sleep 15-30 # wait for AI response
agent-browser screenshot /tmp/result.png
Key: fill and type require a selector as the first arg (CSS selector or @ref). Never call type "some text" without a selector — that's keyboard type (different command). When a CSS selector matches multiple elements, use agent-browser snapshot to get unique @ref identifiers.
agent-browser close when done to clean up/<emailAccountId>/assistant/<emailAccountId>/automation/<emailAccountId>/automation?tab=settings/<emailAccountId>/bulk-unsubscribe/settingsThe app requires OAuth login. agent-browser can't complete OAuth, so you need a Chrome profile with an existing logged-in session.
Preferred approach: headless Chrome with a saved profile
The user should have a dedicated Chrome profile directory with a logged-in session (stored outside the repo, e.g. ~/.chrome-debug-inbox-zero). Check the user's auto-memory for the profile path. Then launch Chrome headless and connect:
# 1. Check if CDP is already running
curl -s http://127.0.0.1:9222/json/version
# 2. If not, launch Chrome headless with the saved profile
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
--headless=new \
--remote-debugging-port=9222 \
--user-data-dir="$HOME/.chrome-debug-<name>" &>/dev/null &
sleep 3
# 3. Connect agent-browser
agent-browser close
curl -s -X PUT "http://127.0.0.1:9222/json/new?about:blank" > /dev/null
sleep 2
agent-browser --cdp 9222 open http://localhost:3000/automation
This runs entirely in the background — the user doesn't need to do anything.
Important caveats:
--user-data-dir — kill any existing debug Chrome before launching.--headless=new) once to re-login via OAuth, then you can go back to headless.agent-browser may attach to chrome:// internal pages — close those via agent-browser close before connecting.Fallback options:
--remote-debugging-port=9222, just use agent-browser --cdp 9222.agent-browser --headed --profile <path> to open a visible Chrome window.agent-browser state save ./auth.json and reload later with agent-browser --state ./auth.json. Note: state files can expire.__tests__/eval/ for existing tests that cover similar ground.claude/skills/testing/eval.mddescribeEvalMatrix for cross-model comparison when relevantjudgeMultiple with appropriate CRITERIA for subjective outputspnpm test-ai eval/<test-name>Often the best test combines approaches. For example:
gws to create a Google Drive folder with a test PDFAn error means the test failed. Do not report success if any step produced an error, even if the error seems like a configuration issue. Either fix the configuration and retry, or report the failure clearly.
Give a clear pass/fail summary:
Always include screenshots — even for passing tests. The user wants to see what the UI looks like.
After testing, ask the user if this should become a reusable test. Two options:
Browser QA flow — if the test is primarily UI-driven and would catch regressions, create a flow spec in qa/browser-flows/ following the template. This can then be re-run with /qa-run.
Eval test — if the test is about AI output quality, write a proper eval test in __tests__/eval/ that can be run with pnpm test-ai.
Don't persist trivial one-off checks (like "does this page load"). Persist tests that verify important behavior someone might break later.
# Create a Drive folder
gws drive files create --json '{"name": "Test Folder", "mimeType": "application/vnd.google-apps.folder"}'
# Upload a file to a folder
gws drive files create --json '{"name": "test.pdf", "parents": ["FOLDER_ID"]}' --upload ./test.pdf
# List Drive files
gws drive files list --params '{"q": "name contains '\''test'\''", "pageSize": 10}'
# Send a Gmail message
gws gmail users messages send --params '{"userId": "me"}' --json '{"raw": "BASE64_ENCODED_MESSAGE"}'
# Create a calendar event
gws calendar events insert --params '{"calendarId": "primary"}' --json '{"summary": "Test Event", "start": {"dateTime": "..."}, "end": {"dateTime": "..."}}'
describeEvalMatrix(name, fn) — run across modelscreateEvalReporter() — track pass/failjudgeMultiple({ input, output, criteria }) — LLM-as-judgeCRITERIA.* — ACCURACY, COMPLETENESS, TONE, CONCISENESS, NO_HALLUCINATION, CORRECT_FORMATqa/browser-flows/*.md/qa-run/qa-new-flow__tests__/e2e/flows/__tests__/eval/