judge-with-debate

<task> Evaluate solutions through multi-agent debate where independent judges analyze, challenge each other's assessments, and iteratively refine their evaluations until reaching consensus or maximum rounds. </task> <context> This command implements the Multi-Agent Debate pattern for high-quality evaluation where multiple perspectives and rigorous argumentation improve assessment accuracy. Unlike single-pass evaluation, debate forces judges to defend their positions with evidence and consider counter-arguments.

Key benefits:

Structured evaluation - Meta-judge produces tailored rubrics and criteria before judging begins
Multiple perspectives - Three independent judges reduce individual bias
Evidence-based debate - Judges defend positions with specific evidence from the solution and evaluation specification
Iterative refinement - Up to 3 debate rounds drive convergence on accurate scores
Shared specification - Meta-judge runs once; all judges across all rounds share the same evaluation specification </context>

Pattern: Debate-Based Evaluation

<output> The command produces: </output>

judge-with-debate

Key benefits:

Structured evaluation - Meta-judge produces tailored rubrics and criteria before judging begins
Multiple perspectives - Three independent judges reduce individual bias
Evidence-based debate - Judges defend positions with specific evidence from the solution and evaluation specification
Iterative refinement - Up to 3 debate rounds drive convergence on accurate scores
Shared specification - Meta-judge runs once; all judges across all rounds share the same evaluation specification </context>

Pattern: Debate-Based Evaluation

<output> The command produces: </output>

### Consensus Check After each debate round, check for consensus: **Consensus achieved if:** - All judges' overall scores within 0.5 points of each other - No criterion has >1 point disagreement across any two judges - All judges explicitly state they accept the consensus **If no consensus after 3 rounds:** - Report persistent disagreements - Provide all judge reports for human review - Flag that automated evaluation couldn't reach consensus **Orchestration Instructions:** **Step 1: Dispatch Meta-Judge (Phase 0.5)** 1. Launch meta-judge agent 2. Wait for meta-judge to complete 3. Extract the evaluation specification YAML from meta-judge output **Step 2: Run Independent Analysis (Phase 1)** 1. Launch 3 judge agents in parallel (Judge 1, 2, 3) with the evaluation specification YAML 2. Each writes their independent assessment to `.specs/reports/{solution-name}-{date}.[1|2|3].md` 3. Wait for all 3 agents to complete **Step 3: Check for Consensus** Let's work through this systematically to ensure accurate consensus detection. Read all three reports and extract: - Each judge's overall weighted score - Each judge's score for every criterion Check consensus step by step: 1. First, extract all overall scores from each report and list them explicitly 2. Calculate the difference between the highest and lowest overall scores - If difference <= 0.5 points -> overall consensus achieved - If difference > 0.5 points -> no consensus yet 3. Next, for each criterion, list all three judges' scores side by side 4. For each criterion, calculate the difference between highest and lowest scores - If any criterion has difference > 1.0 point -> no consensus on that criterion 5. Finally, verify consensus is achieved only if BOTH conditions are met: - Overall scores within 0.5 points - All criterion scores within 1.0 point **Step 4: Decision Point** - **If consensus achieved**: Go to Step 6 (Generate Consensus Report) - **If no consensus AND round < 3**: Go to Step 5 (Run Debate Round) - **If no consensus AND round = 3**: Go to Step 7 (Report No Consensus) **Step 5: Run Debate Round** 1. Increment round counter (round = round + 1) 2. Launch 3 judge agents in parallel with the same evaluation specification YAML 3. Each agent reads: - Their own previous report from filesystem - Other judges' reports from filesystem - Original solution 4. Each agent appends "Debate Round {R}" section to their own report file 5. Wait for all 3 agents to complete 6. Go back to Step 3 (Check for Consensus) **Step 6: Reply with Report** Let's synthesize the evaluation results step by step. 1. Read all final reports carefully 2. Before generating the report, analyze the following: - What is the consensus status (achieved or not)? - What were the key points of agreement across all judges? - What were the main areas of disagreement, if any? - How did the debate rounds change the evaluations? 3. Reply to user with a report that contains: - If there is consensus: - Consensus scores (average of all judges) - Consensus strengths/weaknesses - Number of rounds to reach consensus - Final recommendation with clear justification - If there is no consensus: - All judges' final scores showing disagreements - Specific criteria where consensus wasn't reached - Analysis of why consensus couldn't be reached - Flag for human review 4. Command complete **Step 7: Report No Consensus** - Report persistent disagreements - Provide all judge reports for human review - Flag that automated evaluation couldn't reach consensus ### Phase 3: Consensus Report If consensus achieved, synthesize the final report by working through each section methodically: ```markdown # Consensus Evaluation Report Let's compile the final consensus by analyzing each component systematically. ## Consensus Scores First, let's consolidate all judges' final scores: | Criterion | Judge 1 | Judge 2 | Judge 3 | Final | |-----------|---------|---------|---------|-------| | {Name} | {X}/5 | {X}/5 | {X}/5 | {X}/5 | ... **Consensus Overall Score**: {avg}/5.0 ## Consensus Strengths [Review each judge's identified strengths and extract the common themes that all judges agreed upon] ## Consensus Weaknesses [Review each judge's identified weaknesses and extract the common themes that all judges agreed upon] ## Debate Summary Let's trace how consensus was reached: - Rounds to consensus: {N} - Initial disagreements: {list with specific criteria and score gaps} - How resolved: {for each disagreement, explain what evidence or argument led to resolution} ## Final Recommendation Based on the consensus scores and the key strengths/weaknesses identified: {Pass/Fail/Needs Revision with clear justification tied to the evidence}

Saddjudge With Debate

judge-with-debate

Pattern: Debate-Based Evaluation

Saddjudge With Debate

judge-with-debate

Pattern: Debate-Based Evaluation

Process

Setup: Create Reports Directory

Phase 0.5: Dispatch Meta-Judge

Phase 1: Independent Analysis

Output File

Instructions

Output File

Instructions

Best Practices

Meta-Judge + Judge Verification

Common Pitfalls

Do This

Example Usage

Evaluating an API Implementation

Prose

Golang Patterns

Audiocraft Audio Generation

Pokemon Player

Ideation

Storybook Upgrade