Name: Agent Testing
Author: JNZader

Core Principle — The Agent Testing Pyramid

        /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\
       /    Evaluation     \     ← Expensive, comprehensive, nightly
      /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\
     /     Scenarios         \    ← Medium cost, workflow coverage, on PR
    /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\
   /      Unit Tests           \   ← Fast, focused, many, every push
  /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\

Layer	Tests	Cost	Speed	LLM Calls
Unit	Prompt rendering, output parsing, tool selection, schema validation	Low	Fast	None (mocked)
Scenario	Conversation replay, state machines, error recovery, handoffs	Medium	Moderate	Cheap model
Evaluation	Quality scoring, regression detection, A/B comparison, safety	High

Core Principle — The Agent Testing Pyramid

        /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\
       /    Evaluation     \     ← Expensive, comprehensive, nightly
      /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\
     /     Scenarios         \    ← Medium cost, workflow coverage, on PR
    /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\
   /      Unit Tests           \   ← Fast, focused, many, every push
  /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\

Layer	Tests	Cost	Speed	LLM Calls
Unit	Prompt rendering, output parsing, tool selection, schema validation	Low	Fast	None (mocked)
Scenario	Conversation replay, state machines, error recovery, handoffs	Medium	Moderate	Cheap model
Evaluation	Quality scoring, regression detection, A/B comparison, safety	High

What to Test	How
Prompt rendering	Assert variables substituted, no unresolved `{placeholders}`
Output parsing	JSON extraction, markdown fence handling, malformed input
Template injection	Parametrized payloads (DAN, system override, role escape)
Tool selection	Query → tool mapping, max tools, no-tools for simple chat
Schema validation	Pydantic/Zod — valid, missing fields, out-of-range values

What to Test	How
Conversation replay	JSON fixtures with turns, tool call assertions, content checks
State machine	Transitions: IDLE → THINKING → TOOL_CALLING → RESPONDING → FALLBACK
Error recovery	Retry on failure, graceful degradation, invalid LLM output, context overflow
Agent handoffs	Multi-agent orchestration, data passing, rejection → rewrite loops

What to Test	How
Quality	Judge scores responses on accuracy, clarity, completeness, relevance, safety
Regression	Compare against known-good baselines, detect >10% score drops
A/B testing	Compare prompt versions, new must win/tie >= 60%
Safety	Harmful request refusal, PII handling, role-breaking, benign edge cases

Job	Trigger	Model	Cost
Unit tests	Every push	None (mocked)	Free
Scenario tests	On PR (after unit pass)	gpt-4o-mini	~$2/PR
Evaluation suite	Nightly / manual	gpt-4o	~$50/night
Prompt change detection	On PR	None	Free

Agent Testing

Core Principle — The Agent Testing Pyramid

Agent Testing

Core Principle — The Agent Testing Pyramid

Unit Testing Checklist

Scenario Testing Checklist

Evaluation (LLM-as-Judge)

Prompt Versioning

CI/CD Integration

Anti-Patterns

Project Structure

Commands

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope