Build production-style multi-agent systems that plan, implement, test, and evaluate autonomously. Use when building orchestrated agent pipelines, breaking product ideas into milestones, or creating systems with planning, execution, testing, and evaluation layers.
Build production-grade multi-agent systems with orchestration, planning, implementation, testing, and evaluation layers. Transform product ideas into working prototypes through autonomous agent collaboration.
+-------------------+
| Orchestrator |
| (milestone loop) |
+--------+----------+
|
+-------------------+-------------------+
| | | | |
+----v---+ +---v----+ +-v------+ +v-------+ +v----------+
| Spec | | Arch | | Impl | | Test | | Evaluate |
| Agent | | Agent | | Agent | | Agent | | Agent |
+--------+ +--------+ +-------+ +--------+ +-----------+
Each agent has a single responsibility. The Orchestrator sequences them, handles retries, and decides when a milestone is complete.
The Orchestrator is the control loop. It manages the pipeline from start to finish.
class Orchestrator:
def __init__(self, product_idea: str, max_retries: int = 3):
self.product_idea = product_idea
self.max_retries = max_retries
self.milestones = []
self.results = {}
def run(self):
# Phase 1: Break idea into milestones
self.milestones = self.plan_milestones(self.product_idea)
for milestone in self.milestones:
self.execute_milestone(milestone)
return self.compile_final_report()
def execute_milestone(self, milestone: dict):
retries = 0
while retries < self.max_retries:
spec = SpecAgent().run(milestone)
arch = ArchAgent().run(spec)
impl = ImplAgent().run(arch)
test_results = TestAgent().run(impl)
evaluation = EvalAgent().run(impl, test_results)
if evaluation["pass"]:
self.results[milestone["id"]] = {
"status": "complete",
"code": impl,
"tests": test_results,
"score": evaluation["score"],
}
return
retries += 1
milestone["feedback"] = evaluation["feedback"]
self.results[milestone["id"]] = {"status": "failed_after_retries"}
Each milestone is a self-contained deliverable:
{
"id": "milestone-1-mvp",
"title": "Core API with authentication",
"description": "REST API with user registration, login, and JWT auth",
"acceptance_criteria": [
"POST /register creates a user",
"POST /login returns a JWT",
"Protected endpoints reject invalid tokens",
],
"dependencies": [],
"feedback": None, # Populated on retry
}
The Spec Agent transforms a milestone into a detailed specification.
You are a product specification agent. Given a milestone description
and acceptance criteria, produce a detailed specification.
Milestone: {milestone.title}
Description: {milestone.description}
Acceptance Criteria: {milestone.acceptance_criteria}
Previous Feedback: {milestone.feedback or "None (first iteration)"}
Output a specification document with these sections:
1. PRD (2-3 paragraphs)
2. User Stories (3-5 stories in standard format)
3. API Contracts (endpoint, method, request body, response body, status codes)
4. Data Model (entities with fields and types)
5. Non-Functional Requirements (performance, security, scalability)
The Arch Agent designs the system structure from the specification.
project/
src/
api/
routes.py # Endpoint definitions
middleware.py # Auth, rate limiting, error handling
schemas.py # Request/response Pydantic models
core/
services.py # Business logic
models.py # Database models
exceptions.py # Custom exception classes
db/
connection.py # Database connection management
migrations/ # Schema migration files
tests/
test_api.py # Endpoint integration tests
test_services.py # Business logic unit tests
conftest.py # Shared fixtures
config.py # Environment and app configuration
main.py # Application entry point
requirements.txt # Dependencies
The Impl Agent writes production-quality code from the architecture.
The Test Agent generates and runs tests against the implementation.
Unit Tests: Test individual functions and methods in isolation
Integration Tests: Test component interactions
Contract Tests: Verify implementation matches the spec
{
"total": 24,
"passed": 22,
"failed": 2,
"coverage": 87.3,
"failures": [
{
"test": "test_login_with_expired_token",
"error": "Expected 401, got 500",
"file": "tests/test_api.py:45",
},
],
}
The Eval Agent scores the implementation and decides if the milestone passes.
| Dimension | Weight | What It Measures |
|---|---|---|
| Correctness | 30% | All tests pass, acceptance criteria met |
| Code Quality | 20% | Readable, modular, well-named, documented |
| Security | 20% | Auth, input validation, no secrets, injection-safe |
| Maintainability | 15% | Easy to modify, extend, and debug |
| Scalability | 15% | Handles growth in data and traffic |
Each dimension is scored 1-5:
Pass threshold: Weighted average >= 3.5 AND no dimension below 2.
{
"pass": True,
"score": 4.1,
"breakdown": {
"correctness": 5,
"code_quality": 4,
"security": 4,
"maintainability": 3,
"scalability": 4,
},
"feedback": [
"Add input length validation on the /register endpoint",
"Extract database connection string to environment variable",
"Add index on users.email for login query performance",
],
}
When a milestone fails evaluation:
evaluation.feedback back into the milestoneThe Impl Agent receives both the previous code and the feedback, so it makes incremental corrections rather than starting from scratch.
Every agent call should be logged for debugging and auditability:
import logging
from datetime import datetime, timezone
logger = logging.getLogger("multi_agent")
def log_agent_call(agent_name: str, milestone_id: str, input_summary: str, output_summary: str, duration_ms: int):
logger.info(
"agent_call",
extra={
"agent": agent_name,
"milestone": milestone_id,
"input": input_summary[:200],
"output": output_summary[:200],
"duration_ms": duration_ms,
"timestamp": datetime.now(timezone.utc).isoformat(),
},
)
When the Orchestrator completes all milestones, deliver: