Methodology for analyzing factory telemetry and proposing evidence-based improvements. Query patterns, evidence standards, and risk classification. Injected into Oracle's context.
This guides how you analyze the factory's performance and propose changes. Every proposal must be backed by evidence from the telemetry database — not intuition, not best practices, not "I think this would be better."
The telemetry database is at eval/factory.db (SQLite). Use Bash to query it.
Agent failure rates:
SELECT agent,
COUNT(*) as total_runs,
SUM(CASE WHEN verdict='fail' THEN 1 ELSE 0 END) as failures,
ROUND(100.0 * SUM(CASE WHEN verdict='fail' THEN 1 ELSE 0 END) / COUNT(*), 1) as fail_rate
FROM agent_runs
GROUP BY agent
ORDER BY fail_rate DESC;
Token usage by agent (cost optimization):
SELECT agent, model,
AVG(output_tokens) as avg_tokens,
MIN(output_tokens) as min_tokens,
MAX(output_tokens) as max_tokens
FROM agent_runs
GROUP BY agent, model;
Duration trends:
SELECT agent,
AVG(duration_ms) as avg_duration,
MAX(duration_ms) as max_duration
FROM agent_runs
GROUP BY agent
ORDER BY avg_duration DESC;
Failed run transcripts (for root cause analysis):
SELECT ar.agent, ar.verdict, at.prompt_text, at.response_text
FROM agent_runs ar
JOIN agent_transcripts at ON ar.id = at.agent_run_id
WHERE ar.verdict = 'fail'
ORDER BY ar.started_at DESC;
Every proposal must reference specific data:
When metrics show a pattern, read the transcripts to understand WHY:
Changes that can only help, never hurt:
Changes that could affect other agents:
Changes that weaken safety:
Your PR should be structured for easy human review:
Title: "Oracle: [N] improvements based on [M] factory runs"
Body:
## Applied Changes (safe)
- [Change 1]: [rationale] — Evidence: [citation]
- [Change 2]: [rationale] — Evidence: [citation]
## Proposed Changes (needs review)
- [Change 3]: [rationale] — Evidence: [citation]
## Flagged Concerns (dangerous)
- [Change 4]: [rationale] — Evidence: [citation]
## Telemetry Summary
- Runs analyzed: [N]
- Overall pass rate: [X]%
- Most frequent failures: [agent] ([rate]%)
Your output must conform to .claude/schemas/improvement.schema.json:
{
"run_count_analyzed": 8,
"patterns_detected": ["description of each pattern"],
"proposals": [
{
"target_agent": "wonder-woman",
"change_type": "prompt",
"current_value": "current text",
"proposed_value": "proposed text",
"rationale": "why this change helps",
"evidence": ["run #3: ...", "run #5: ..."],
"risk_level": "safe"
}
]
}