Harness Eval — End-to-End Feature Validation

Validate OpenHarness features by running real agent loops against an unfamiliar codebase with actual LLM API calls. Every test exercises the full stack: API client → model → tool calls → execution → result.

Core Principles

Test on an unfamiliar project — never test on OpenHarness itself (the agent modifies its own code). Clone a real project as the workspace.
Use real API calls — no mocks. Configure a real LLM endpoint.
Multi-turn conversations — always test 2+ turns where the model needs prior context.
Combine features — test hooks+skills+agent loop together, not in isolation.
Verify tool execution — inspect tool call lists and output files, not just model text.

Workflow

1. Prepare Workspace

Clone an unfamiliar project (do not use OpenHarness):

Harness Eval — End-to-End Feature Validation

Core Principles

Test on an unfamiliar project — never test on OpenHarness itself (the agent modifies its own code). Clone a real project as the workspace.
Use real API calls — no mocks. Configure a real LLM endpoint.
Multi-turn conversations — always test 2+ turns where the model needs prior context.
Combine features — test hooks+skills+agent loop together, not in isolation.
Verify tool execution — inspect tool call lists and output files, not just model text.

Workflow

1. Prepare Workspace

Clone an unfamiliar project (do not use OpenHarness):

Result	Meaning	Action
PASS with tool calls	Feature works end-to-end	Done
PASS without tool calls	Model answered from knowledge	Rewrite prompt to force tool use
FAIL with exception	Code bug	Read traceback
FAIL with wrong output	Model behavior issue	Check system prompt and tool schemas
Timeout	Task too complex	Increase `max_turns` or simplify prompt

Harness Eval

Harness Eval — End-to-End Feature Validation

Core Principles

Workflow

1. Prepare Workspace

Harness Eval

Harness Eval — End-to-End Feature Validation

Core Principles

Workflow

1. Prepare Workspace

2. Configure Environment

3. Prepare Real Sandbox Runtime When Relevant

4. Design Tests

5. Prefer Long-Horizon, Real Agent Loops

6. Run Tests

7. Interpret Results

Feature Coverage Checklist

Common Pitfalls

Additional Resources

Reference Files

Existing Test Files

Test

Feature Flags

Unit Tests

Integration Tests

Write Frontend Tests

Golang Testing