Run one iteration of AgentHarness evolution for Terminal-Bench 2.
Run ONE iteration of agent scaffold evolution.
You do NOT run benchmarks. You analyze results + failed trajectories, propose agent variants, and implement them. The outer loop (meta_harness.py) handles benchmarking.
You are evolving the AgentHarness agent scaffold for Terminal-Bench 2. It is located in agents/baseline_kira.py.
The search space is arbitrary Python code. You can override any method, call any library, make raw API calls, add new tools, change how the LLM is called, rewrite command execution, intercept and transform observations -- anything that's expressible in Python is fair game. The only constraint is that the agent must subclass harbor.agents.terminus_2.terminus_2.Terminus2 in the same way as baseline_kira.py does (for compatibility with the eval harness).
Model: Claude Opus 4.6. Evaluation uses the official TB2 hard split. meta_harness.py chooses the trial count; the shipped smoke/default path uses 2 trials per task. The released reference run starts from a 28.1% KIRA baseline and reaches 46.5% on this split.
Key files to read:
agents/baseline_kira.py - the full baseline implementation. Read to understand overridable methods.You should copy over one of the agents in agents/ as a starting point. You can rewrite or override ANY method. Never import from other candidate agents. Copy any code you want to reuse.
Key methods:
_call_llm_with_tools - makes the litellm API call. Override to change tools, add parameters, adjust retries._parse_tool_calls - converts raw tool call dicts to commands. Override to add new tools._execute_commands - runs commands on tmux. Override to change execution behavior._run_agent_loop - main episode loop. Override for structural changes._get_completion_confirmation_message - what to ask on the "are you sure?" step._get_prompt_template_path - path to system prompt. Override to use a custom prompt._summarize_context - summarizes history on context overflow. Override to change summarization._execute_image_read - handles image_read tool. Override to change multimodal behavior.Each candidate is a single Python file at agents/<name>.py containing the full agent class. No subdirectories needed.
The agent class must be named AgentHarness and subclass harbor.agents.terminus_2.terminus_2.Terminus2.
The agent will be loaded and evaluated through Harbor via --agent-import-path "agents.<name>:AgentHarness".
agents/<name>.py file freely.prompt-templates/<name>.txt and point _get_prompt_template_path to it.meta_harness.py, or claude_wrapper.py.Launch ONE Agent subagent (subagent_type: "general-purpose"). It should:
Read state files:
frontier_val.json - current best agent per task (path given in task prompt)evolution_summary.jsonl - what's been tried, what worked/didn't, plus rollout_metrics (path given in task prompt)Deep-read failed AND successful trajectories. Most important step.
Use the current run outputs under jobs/ and logs/ as the primary source of truth.
Read agent implementations in agents/*.py
Return:
STATE: <5-line summary: current scores, what's been tried, avg tokens/turns/cost>
HYPOTHESIS: "<falsifiable claim about what will improve scores>"
CANDIDATE: name=<snake_case>, changes="<specific changes>", prediction="<expected pass rate improvement AND expected token/turn impact>"
Launch 1 Agent subagent (subagent_type: "general-purpose"). The prompt must include the candidate name, class name, specific changes from Step 1, and the working directory.
The subagent should:
agents/ as a starting point: agents/<snake_case_name>.py. Your final agent should be a subclass of harbor.agents.terminus_2.terminus_2.Terminus2.uv run python -c "from agents.<name> import *; print('OK')")Write pending_eval.json to the path specified in the task prompt:
{
"iteration": <N>,
"candidates": [
{
"name": "<name>",
"import_path": "agents.<name>:AgentHarness",
"hypothesis": "<falsifiable claim>",
"changes": "<what was changed>",
"expected_efficiency": "<expected token/turn impact>"
}
]
}
Output: CANDIDATES: <name1>
AgentHarness in candidate files. The import path is always agents.<name>:AgentHarness.AgentHarness are async. Use await when calling super() methods.AgentHarness calls litellm.acompletion directly -- it does NOT use harbor's Chat class for LLM calls. The Chat object is passed in but only used for message history / token counting.TOOLS constant defines the native tool schema. You can extend it with new tools by overriding _call_llm_with_tools to pass tools=TOOLS + [new_tool] and _parse_tool_calls to handle the new tool.{instruction} and {terminal_state} placeholders (standard Python .format())._limit_output_length truncates at 30KB (KIRA's limit, vs 10KB in Terminus2).