This skill should be used when the user asks to "test the harness", "run integration tests", "validate features with real API", "test with real model calls", "run agent loop tests", "verify end-to-end", or needs to verify OpenHarness features on a real codebase with actual LLM calls.
Validate OpenHarness features by running real agent loops against an unfamiliar codebase with actual LLM API calls. Every test exercises the full stack: API client → model → tool calls → execution → result.
Clone an unfamiliar project (do not use OpenHarness):
git clone https://github.com/HKUDS/AutoAgent /tmp/eval-workspace
export ANTHROPIC_API_KEY=sk-xxx
export ANTHROPIC_BASE_URL=https://api.moonshot.cn/anthropic # or any provider
export ANTHROPIC_MODEL=kimi-k2.5
For long-running real evals, do not artificially lower max_turns. Use the product default (200) unless the user explicitly wants a tighter bound.
If the task is validating sandbox behavior, install and verify the actual runtime before running agent loops:
npm install -g @anthropic-ai/sandbox-runtime
sudo apt-get update
sudo apt-get install -y bubblewrap ripgrep
which srt
which bwrap
which rg
srt --version
Then run a minimal smoke check through OpenHarness, not just raw srt, so you verify the real adapter path:
from pathlib import Path
from openharness.config.settings import Settings, SandboxSettings, save_settings
from openharness.tools.bash_tool import BashTool
cfg = Path("/tmp/openharness-sandbox-settings.json")
save_settings(Settings(sandbox=SandboxSettings(enabled=True, fail_if_unavailable=True)), cfg)
# Point config loader at this file, then run BashTool on a tiny command such as `pwd`.
If sandbox dependencies are missing, treat that as an environment/setup failure, not a feature regression.
Each test follows this pattern:
engine = make_engine(system_prompt="...", cwd=UNFAMILIAR_PROJECT)
evs1 = [ev async for ev in engine.submit_message("Read X, analyze Y")]
r1 = collect(evs1) # text, tools, turns, tokens
evs2 = [ev async for ev in engine.submit_message("Based on what you found...")]
r2 = collect(evs2)
assert "grep" in r1["tools"] # verify tools ran
For detailed code templates and the make_engine/collect helpers, consult references/test-patterns.md.
For meaningful end-to-end validation, prefer unfamiliar-repo tasks that force multiple turns, context reuse, and mixed tool usage.
Recommended pattern:
AutoAgentmax_turns=200240-600sRecommended long-horizon scenarios:
architecture_multiturn
bash, glob, grep, read_file all appear; no timeout; no MaxTurnsExceededhook_block_and_recover
bashglob/grep/read_filesandbox_multiturn
fail_if_unavailable=truepwd && ls -labash executes via sandbox, non-shell tools continue the task, and the agent recovers from incidental repo errorsWhen a scenario fails, classify it before changing code:
MaxTurnsExceeded: likely eval harness misconfiguration if max_turns was manually loweredtimeout: task is too broad or per-prompt timeout is too smallsrt, bwrap, or rgpython tests/test_merged_prs_on_autoagent.py # PR feature tests
python tests/test_real_large_tasks.py # large multi-step tasks
python tests/test_hooks_skills_plugins_real.py # hooks/skills/plugins
python -m pytest tests/ -q -k "not autoagent" # unit tests (no API)
For ad hoc long-horizon validation, it is acceptable to run a temporary Python driver script as long as it:
| Result | Meaning | Action |
|---|---|---|
| PASS with tool calls | Feature works end-to-end | Done |
| PASS without tool calls | Model answered from knowledge | Rewrite prompt to force tool use |
| FAIL with exception | Code bug | Read traceback |
| FAIL with wrong output | Model behavior issue | Check system prompt and tool schemas |
| Timeout | Task too complex | Increase max_turns or simplify prompt |
For long-running real evals, refine the timeout guidance:
max_turns was manually set too lowmax_turns=200 and the run still fails, the next suspect is wall-clock timeout, not turn countsrt/bwrap/rg is an eval environment issuemax_turns during real evals — can create false failures that do not reflect product defaultsWORKSPACE variable, skip in CI with pytest.mark.skipifsrt — verify the OpenHarness adapter path tooreferences/test-patterns.md — Complete code templates for make_engine, collect, and each feature categoryreferences/feature-matrix.md — Detailed test cases for every OpenHarness moduleWorking test suites in the repo:
tests/test_merged_prs_on_autoagent.py — PR feature validationtests/test_real_large_tasks.py — Large multi-step taskstests/test_hooks_skills_plugins_real.py — Hooks/skills/plugins in agent loopstests/test_untested_features.py — Module-level integration tests