Run one iteration of memory system evolution. Called by meta_harness.py or interactively via /meta-harness.
Run ONE iteration of memory system evolution. Do all work in the main session — do NOT delegate to subagents. Constraints get lost when you delegate, leading to parameter-only changes and skipped prototyping.
You do NOT run benchmarks. You analyze results + prediction traces, prototype changes, and implement new systems. The outer loop (meta_harness.py) handles benchmarking separately.
The most common failure mode is creating systems that are just parameter variants of existing ones. Check evolution_summary.jsonl for what's been tried — parameter sweeps (pool sizes, retrieval counts, context budgets, similarity metrics) almost always regress or tie.
Good candidates change a fundamental mechanism:
Bad candidates just tune numbers. If the logic in predict() and learn_from_batch() is identical to the base except for constants, it's a parameter variant. Rewrite with a genuinely novel mechanism.
Combining systems is valid. Take the retrieval strategy from system A and the memory format from system B, or draw on published approaches (DSPy, OPRO, Reflexion, CEIL, etc.).
Exploitation axes: A=Prompt template, B=Memory content, C=Selection algorithm, D=Memory sizing, E=Learning trigger, F=LLM usage in learning. If last 3 iterations explored the same axis, pick different ones.
Do ALL steps yourself in the main session.
Check the reports directory (path in the task prompt's "Run directories" section). For each past iteration that has results in evolution_summary.jsonl but NO report, write one. Each report should be <=30 lines covering: what changed, which datasets improved/regressed and why, and a takeaway for future iterations.
Read all state files:
evolution_summary.jsonl — what's been tried (one JSON per candidate)frontier_val.json — current best per dataset (val accuracy)config.yaml for current datasets and baselineslogs/<dataset>/<agent>/<model>/log.jsonl traces if they existFormulate 3 hypotheses — each must be falsifiable and target a different mechanism.
You MUST prototype your mechanism before writing the final system. Do NOT skip this step. Candidates that skip prototyping tend to have bugs or produce no improvement.
For each candidate:
/tmp/ that exercises the core retrieval/learning logic in isolation.logs/<dataset>/<memory>/<model>/log.jsonl to test against.For each of the 3 candidates:
agents/<name>.py, then make targeted modifications. This copy-then-edit approach ensures correct imports and proven patterns.predict() and learn_from_batch() is identical to the base except for numbers, REWRITE with a truly novel mechanism.uv run python -c "from text_classification.agents.<name> import *; print('OK')"Do not edit config.yaml just to register candidates. The benchmark auto-discovers files in agents/.
Write to the path specified in the task prompt (NOT hardcoded — it may be in a run-specific subdirectory):
{
"iteration": <N>,
"candidates": [
{
"name": "<snake_case_name>",
"file": "agents/<name>.py",
"hypothesis": "<falsifiable claim>",
"axis": "exploitation|exploration",
"base_system": "<what it builds on>",
"components": ["tag1", "tag2", "..."]
}
]
}
Output: CANDIDATES: <name1>, <name2>, <name3>
class MemorySystem(ABC):
def __init__(self, llm: LLMCallable): ...
def predict(self, input: str) -> tuple[str, dict[str, Any]]: ...
def learn_from_batch(self, batch_results: list[dict[str, Any]]) -> None: ...
def get_state(self) -> str: ... # JSON-serializable
def set_state(self, state: str) -> None: ...
MemorySystem from ..memory_systemLLMCallable from ..llm, extract_json_field from ..memory_systemextract_json_field(response, "final_answer") for answer extraction (NOT custom regex)self.call_llm(prompt) for LLM calls (NOT self._llm directly)predict must work without any prior learning (cold start)learn_from_batch receives list of dicts with keys: input, prediction, ground_truth, was_correct, metadatalogs/<dataset>/<memory>/<model>/val.json (accuracy field)logs/<dataset>/<memory>/<model>/log.jsonllogs/<dataset>/<memory>/<model>/memory.jsonresults/<dataset>/<memory>/<model>/test.json (separate dir, never exposed during evolution)One JSON object per line, one line per evaluated candidate:
{"iteration": 1, "system": "example_system", "avg_val": 45.0, "axis": "exploitation", "hypothesis": "...", "delta": +2.1, "outcome": "45.0% (+2.1)", "components": ["tag1", "tag2", "tag3"]}
Treat evolution_summary.jsonl, frontier_val.json, and recent training traces as the only shipped history sources in this trimmed repo.