Research Judging Pipeline

Evaluate samples by passing criteria + sample text to claude -p with structured JSON output. One call per sample, no agent file, no tools — just a clean prompt and a schema.

Core Pattern

Each judgment is a single claude -p call:

System prompt: your evaluation criteria, passed via --system-prompt-file
User message: the sample text to evaluate (piped via stdin)
Output: schema-validated JSON

echo "$SAMPLE_TEXT" | env -u CLAUDECODE -u ANTHROPIC_API_KEY \
  claude -p --model haiku \
  --setting-sources local \
  --no-session-persistence \
  --tools "" \
  --strict-mcp-config \
  --system-prompt-file judging/criteria.md \
  --output-format json \
  --json-schema "$(cat judging/schema.json)" \
  | jq '.structured_output'

Impartiality: The judge must be blind to experimental conditions. Never pass metadata like condition labels, steering coefficients, group assignments, or any information about which experimental arm a sample belongs to. The judge should receive only what it needs to score: the text to evaluate and (if needed) what traits to look for. Leaking experimental metadata biases scores toward expected outcomes.

Research Judging Pipeline

Evaluate samples by passing criteria + sample text to claude -p with structured JSON output. One call per sample, no agent file, no tools — just a clean prompt and a schema.

Core Pattern

Each judgment is a single claude -p call:

System prompt: your evaluation criteria, passed via --system-prompt-file
User message: the sample text to evaluate (piped via stdin)
Output: schema-validated JSON

echo "$SAMPLE_TEXT" | env -u CLAUDECODE -u ANTHROPIC_API_KEY \
  claude -p --model haiku \
  --setting-sources local \
  --no-session-persistence \
  --tools "" \
  --strict-mcp-config \
  --system-prompt-file judging/criteria.md \
  --output-format json \
  --json-schema "$(cat judging/schema.json)" \
  | jq '.structured_output'

import json import subprocess import time from concurrent.futures import ThreadPoolExecutor, as_completed from pathlib import Path MAX_CONCURRENT = 30 TIMEOUT_S = 60 MAX_RETRIES = 5 # Resolve env once at startup ENV = { "PATH": subprocess.check_output(["bash", "-c", "echo $PATH"], text=True).strip(), "HOME": str(Path.home()), } def judge_one(sample_path: Path, criteria: str, schema: str) -> tuple[Path, dict | None]: """Judge a single sample with timeout and exponential backoff retry.""" text = sample_path.read_text() for attempt in range(MAX_RETRIES): try: result = subprocess.run( [ "claude", "-p", "--model", "haiku", "--setting-sources", "local", "--no-session-persistence", "--tools", "", "--strict-mcp-config", "--system-prompt-file", criteria, "--output-format", "json", "--json-schema", schema, ], input=text, capture_output=True, text=True, timeout=TIMEOUT_S, env=ENV, ) if result.returncode != 0: raise RuntimeError(f"exit {result.returncode}: {result.stderr[:200]}") envelope = json.loads(result.stdout) return sample_path, envelope.get("structured_output", envelope) except (subprocess.TimeoutExpired, RuntimeError, json.JSONDecodeError) as e: wait = 2 ** attempt # 1s, 2s, 4s print(f" RETRY {attempt+1}/{MAX_RETRIES} ({e.__class__.__name__}) {sample_path.name}, waiting {wait}s") time.sleep(wait) print(f" FAIL (all retries exhausted): {sample_path.name}") return sample_path, None samples = sorted(Path("samples").glob("*.txt")) schema_str = Path("judging/schema.json").read_text() # Skip already-completed judgments done = {p.stem for p in Path("judgments").glob("*.json")} remaining = [s for s in samples if s.stem not in done] with ThreadPoolExecutor(max_workers=MAX_CONCURRENT) as pool: futures = { pool.submit(judge_one, s, "judging/criteria.md", schema_str): s for s in remaining } ok, fail = 0, 0 for future in as_completed(futures): sample_path, judgment = future.result() if judgment: out = Path("judgments") / f"{sample_path.stem}.json" out.write_text(json.dumps(judgment, indent=2)) print(f"OK {sample_path.name}: {judgment['scores']}") ok += 1 else: fail += 1 print(f"\nDone: {ok} ok, {fail} failed, {len(done)} skipped (already done)")

Flag	Purpose
`env -u CLAUDECODE -u ANTHROPIC_API_KEY`	Required when calling from inside Claude Code. `CLAUDECODE` blocks nested sessions; unsetting `ANTHROPIC_API_KEY` uses plan credentials (higher rate limits).
`--setting-sources local`	Suppresses loading of `~/.claude/CLAUDE.md`. Without this, global instructions bias the judge.
`--no-session-persistence`	Don't persist sessions to disk. Without this, each call creates a session entry that floods your project.
`--tools ""`	Removes all tools. The judge only needs to produce structured output, no tool use.
`--strict-mcp-config`	Disables MCP servers. Prevents any configured MCPs from loading.
`--system-prompt-file`	Replaces the default system prompt with your criteria file.
`--output-format json`	Returns structured JSON envelope (use with `--json-schema`).
`--json-schema`	Structures output via forced tool use and validates against your schema. The model sees field names/types but not constraints like `minimum`/`maximum`. Extract the result with `jq '.structured_output'`.
`--model haiku`	Fast/cheap default. Use `sonnet` for nuanced judging.

Research Judging

Research Judging Pipeline

Core Pattern

Research Judging

Research Judging Pipeline

Core Pattern

Model Choice

Judging Folder

criteria.md

schema.json

How `--json-schema` Works

Running from Claude Code's Bash tool

CLI Flags Reference

Parallelism

Bash: parallel with background jobs

Python: parallel with ThreadPoolExecutor

Audit-First Workflow

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

Research Judging

Research Judging Pipeline

Core Pattern

Research Judging

Research Judging Pipeline

Core Pattern

Model Choice

Judging Folder

criteria.md

schema.json

How --json-schema Works

Running from Claude Code's Bash tool

CLI Flags Reference

Parallelism

Bash: parallel with background jobs

Python: parallel with ThreadPoolExecutor

Audit-First Workflow

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

How `--json-schema` Works