Name: Agent Sop Eval
Author: japurcell

Skills suchen.../

Agent Sop Eval | Skills Pool

<agent-evaluation-project>/       # Example name - can be any name for user's evaluation project
├── <target-agent-folder>/        # Example name - this is the agent you are evaluating
│   └── [agent source code]     # Existing agent code
└── eval/                       # All evaluation files go here (sibling to target-agent-folder)
    ├── eval-plan.md
    ├── test-cases.jsonl
    ├── results/
    ├── run_evaluation.py
    ├── eval-report.md
    └── README.md

Parse user request: Extract agent path, evaluation focus, and specific requirements from natural language

Navigate to evaluation project directory:

cd <your-evaluation-project>  # Navigate to the directory containing both agent folder and eval/

Create evaluation directory structure:
```
mkdir -p eval
```
Follow this execution flow:
1. Parse user evaluation requirements from user input
2. Analyze agent and user requirements:
  - Parse specific evaluation requirements, scenarios, and constraints from user input
  - Scan codebase for agent architecture and capabilities
  - Check for existing test cases and evaluation files
3. Design evaluation strategy:
  - Define evaluation areas and metrics (user-request-driven with agent-aware defaults)
  - Identify test data requirements
  - Define file structure
  - Select technology stack
Write the complete evaluation plan to eval/eval-plan.md using the template structure (see Appendix A: Evaluation Plan Template), replacing placeholders with concrete details derived from the analysis while preserving section order and headings.
Report completion with evaluation plan file path, and suggest next step: "Would you like me to generate test cases based on this plan?"

Parse user request: Extract any specific requirements (e.g., "focus on edge cases", "10 test cases")

Navigate to evaluation project directory:

cd <your-evaluation-project>  # Navigate to the directory containing both agent folder and eval/

Load the current evaluation plan (eval/eval-plan.md) to understand evaluation areas and test data requirements.
Follow this execution flow:
1. Parse user context from user input (if provided)
2. Validate that the evaluation plan contains test data requirements; update the evaluation plan if it does not align with the user's input (if provided); add entry to User Requirements Log in eval-plan.md
3. Generate proper test cases covering all scenarios and meeting all requirements
4. Structure test cases in JSONL format
5. Save test cases to eval/test-cases.jsonl
6. Update Evaluation Progress section in eval-plan.md with completion status
Report completion with test case count, coverage summary, and suggest next step: "Would you like me to run the evaluation with these test cases?"

Parse user request: Extract any specific requirements (e.g., "run on subset", "verbose output")

Navigate to evaluation project directory:

cd <your-evaluation-project>  # Navigate to the directory containing both agent folder and eval/

Load the current evaluation plan (eval/eval-plan.md) to understand evaluation requirements and agent architecture.
Follow this execution flow:
1. Parse user context from user input (if provided)
2. Review evaluation plan to understand requirements; update the evaluation plan if it does not align with the user's input (if provided); add entry to User Requirements Log in eval-plan.md
3. Implement Strands Evals SDK evaluation pipeline: IMPORTANT: Always navigate to repository root before any operation in the following process to avoid path errors.
  - Create requirements.txt: Detect existing dependencies and consolidate into unified requirements.txt at repository root, adding Strands Evals SDK dependencies
  - Set up environment: Use uv to create virtual environment, activate it, and install requirements.txt
  - Implement run_evaluation.py: Create eval/run_evaluation.py using Strands Evals SDK patterns with Case objects, Experiment class, and appropriate evaluators
  - Create agent integration: Implement agent execution logic within the evaluation framework
  - Execute evaluation: Run the experiment to generate evaluation results
  - Save results: Store evaluation results in eval/results/ directory
  - Create documentation: Create eval/README.md with running instructions for users
4. Update Evaluation Progress section in eval-plan.md with completion status
Report completion with evaluation results summary and suggest next step: "Would you like me to analyze these results and provide recommendations?"

cd <your-evaluation-project>
git clone https://github.com/strands-agents/evals strands-agents-evals-source

Check Existing Requirements: Verify requirements.txt exists in repository root
```
# Check if requirements.txt exists
ls requirements.txt
```

Add Strands Evals SDK Dependencies: Update existing requirements.txt with Strands evaluation dependencies

# Add Strands Evals SDK and related dependencies
grep -q "strands-agents-evals" requirements.txt || echo "strands-agents-evals" >> requirements.txt
# Add other evaluation-specific dependencies as needed based on evaluation plan

Installation: Use uv for dependency management

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

Parse user request: Extract any specific analysis focus (e.g., "focus on failures", "prioritize critical issues")

Navigate to evaluation project directory:

cd <your-evaluation-project>  # Navigate to the directory containing both agent folder and eval/

Load and analyze the evaluation results from eval/results/
Follow this execution flow:
1. Parse user context from user input (if provided); add entry to User Requirements Log in eval-plan.md
2. Load and validate evaluation results data
3. Perform comprehensive results analysis
4. Identify patterns, strengths, and weaknesses
5. Generate actionable improvement recommendations
6. Create detailed advisory report with evidence
7. Provide prioritized action items for agent enhancement
8. Update Evaluation Progress section in eval-plan.md with completion status
Results Analysis Process:

a. Data Validation: Ensure results are from real execution:
- Load evaluation results from the specified path
- Validate that results come from actual agent execution (not simulation)
- Verify data completeness and format consistency
b. Results Analysis: Analyze evaluation outcomes:
- Success Rate: Calculate overall success/failure rates
- Quality Scores: Evaluation metric performance across test cases
- Failure Patterns: Common error types and their frequency
- Strengths & Weaknesses: Areas of strong vs. poor performance
c. Insights Generation: Identify key findings:
- Root Causes: Why certain metrics underperform
- Improvement Opportunities: Specific areas for enhancement
- Quality Trends: Patterns in evaluation scores and response quality
Improvement Recommendations: Generate specific, actionable recommendations:

a. Prioritized Recommendations: Based on evaluation findings:

Critical Issues (Immediate attention required)
- Address high failure rates or low quality scores
- Fix systematic errors in reasoning or response generation
Quality Improvements (Medium-term enhancements)
- Improve consistency across test cases
- Enhance response completeness and accuracy
Enhancement Opportunities (Future improvements)
- Handle edge cases more effectively
- Improve response clarity and formatting
b. Evidence-Based Recommendations: All recommendations must cite specific data:
- Issue: Clear problem statement with evaluation metrics
- Evidence: Specific data points from results
- Recommended Actions: Specific improvement suggestions
- Expected Impact: Predicted improvements in evaluation scores
Advisory Report Generation: Create focused report using the template structure (see Appendix B: Evaluation Report Template) with:
- Executive summary with key findings
- Evaluation results analysis
- Prioritized improvement recommendations with evidence
IMPORTANT: Follow all HTML comment instructions () in the template when generating content, then remove these comment instructions from the final report - they are template guidance only and should not appear in the generated report.
Report completion with key findings and ask: "Would you like me to help implement any of these recommendations?"

You MUST ensure eval/README.md exists with complete instructions
You MUST verify all evaluation artifacts are within eval/

You MUST confirm the folder structure matches:

<your-evaluation-project>/    # Your chosen project name
├── <your-agent-folder>/      # Your chosen agent folder name
│   └── [agent source code]
└── eval/
    ├── eval-plan.md
    ├── test-cases.jsonl
    ├── results/
    ├── run_evaluation.py
    ├── eval-report.md
    └── README.md

You MUST verify no evaluation files were created inside the agent folder
You SHOULD create a summary of key findings and recommendations
You MAY suggest next steps for continuous evaluation

agent_path: "./chatbot-agent"
evaluation_focus: "response quality and tool calling accuracy"

Phase 1 - Planning:
User: "I need to evaluate my customer support chatbot at ./chatbot-agent. Focus on response quality and tool calling accuracy."
Assistant: "I'll create an evaluation plan for your customer support chatbot..."
[Creates eval/eval-plan.md with 2 key metrics and 3 test scenarios]

Phase 2 - Data Generation:
User: "Yes, generate 5 test cases"
Assistant: "I'll generate 5 test cases covering the scenarios..."
[Creates eval/test-cases.jsonl with 2 basic queries, 2 tool-calling scenarios, 1 edge case]

Phase 3 - Evaluation Execution:
User: "Run the evaluation"
Assistant: "I'll implement and execute the evaluation using Strands Evals SDK..."
[Creates eval/run_evaluation.py, runs evaluation]
Results: Overall success rate: 80%, Response Quality: 4.2/5, Tool Call Accuracy: 75%

Phase 4 - Analysis:
User: "Yes, analyze the results"
Assistant: "I'll analyze the evaluation results and generate recommendations..."
[Creates eval/eval-report.md]
Key findings: Strong performance on basic queries (100% success), Tool calling needs improvement (25% failure rate)

User: "Create an evaluation plan for my agent at ./my-agent"
Assistant: [Creates initial plan in eval/eval-plan.md]

User: "Add more focus on error handling"
Assistant: "I'll update the evaluation plan to include error handling metrics..."
[Updates eval/eval-plan.md]

User: "Generate test cases with more edge cases"
Assistant: "I'll generate test cases with additional edge case coverage..."
[Updates eval/test-cases.jsonl]

<your-evaluation-project>/      # Your chosen project name (e.g., my-chatbot-eval)
├── <your-agent-folder>/        # Your chosen agent folder name (e.g., chatbot-agent)
│   └── [existing agent files]
└── eval/                       # All evaluation files (sibling to agent folder)
    ├── eval-plan.md            # Complete evaluation specification and plan
    ├── test-cases.jsonl        # Generated test scenarios
    ├── README.md              # Running instructions and usage examples
    ├── run_evaluation.py      # Strands Evals SDK evaluation implementation
    ├── results/               # Evaluation outputs
    │   └── [timestamp]/       # Timestamped evaluation results
    └── eval-report.md         # Analysis and recommendations

# Evaluation Plan for [AGENT NAME]

## 1. Evaluation Requirements

<!--
ACTION REQUIRED: User input and interpreted evaluation requirements. Defaults to 1-2 key metrics if unspecified.
-->

- **User Input:** `"$ARGUMENTS"` or "No Input"
- **Interpreted Evaluation Requirements:** [Parsed from user input - highest priority]

---

## 2. Agent Analysis

| **Attribute**         | **Details**                                                 |
| :-------------------- | :---------------------------------------------------------- |
| **Agent Name**        | [Agent name]                                                |
| **Purpose**           | [Primary purpose and use case in 1-2 sentences]             |
| **Core Capabilities** | [Key functionalities the agent provides]                    |
| **Input**             | [Short description, Data types, schemas]                    |
| **Output**            | [Short description, Response types, schemas]                |
| **Agent Framework**   | [e.g., CrewAI, LangGraph, AutoGen, Custom/None]             |
| **Technology Stack**  | [Programming language, frameworks, libraries, dependencies] |

**Agent Architecture Diagram:**

[Mermaid diagram illustrating:

- Agent components and their relationships
- Data flow between components
- External integrations (APIs, databases, tools)
- User interaction points]

**Key Components:**

- **[Component Name 1]:** [Brief description of purpose and functionality]
- **[Component Name 2]:** [Brief description of purpose and functionality]
- [Additional components as needed]

**Available Tools:**

- **[Tool Name 1]:** [Purpose and usage]
- **[Tool Name 2]:** [Purpose and usage]
- [Additional tools as needed]

**Observability Status**

- **Tracing Framework** [Fully/Partially/Not Instrumented, Framework name, version]
- **Custom Attributes** [Yes/No, Key custom attributes if present]

---

## 3. Evaluation Metrics

<!--
ACTION REQUIRED: If no specific user requirements are provided, use a minimal number of metrics (1-2 metrics) focusing on the most critical aspects of agent performance.
-->

### [Metric Name 1]

- **Evaluation Area:** [Final response quality/tool call accuracy/...]
- **Description:** [What is measured and why]
- **Method:** [Code-based | LLM-as-Judge ]

### [Metric Name 2]

[Repeat for each metric]

---

## 4. Test Data Generation

<!--
  ACTION REQUIRED: Keep scenarios minimal and focused. Do not propose more than 3 scenarios.
-->

- **[Test Scenario 1]**: [Description and purpose, complexity]
- **[Test scenario 2]**: [Description and purpose, complexity]
- **Total number of test cases**: [SHOULD NOT exceed 3]

---

## 5. Evaluation Implementation Design

### 5.1 Evaluation Code Structure

<!--
ACTION REQUIRED: The code structure below will be adjusted based on your evaluation requirements and existing agent codebase. This is the recommended starting structure. Only adjust it if necessary.
-->

./ # Repository root directory
├── requirements.txt # Consolidated dependencies
├── .venv/ # Python virtual environment (created by uv)
│
└── eval/ # Evaluation workspace
├── README.md # Running instructions and usage examples (always present)
├── run_evaluation.py # Strands Evals SDK evaluation implementation (always present)
├── results/ # Evaluation outputs (always present)
├── eval-plan.md # This evaluation specification and plan (always present)
└── test-cases.jsonl # Generated test cases (from evalkit.data)

### 5.2 Recommended Evaluation Technical Stack

| **Component**            | **Selection**                                                 |
| :----------------------- | :------------------------------------------------------------ |
| **Language/Version**     | [e.g., Python 3.11, Node.js 18+]                              |
| **Evaluation Framework** | [Strands Evals SDK (default)]                                 |
| **Evaluators**           | [OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator] |
| **Agent Integration**    | [e.g., Direct import, API]                                    |
| **Results Storage**      | [e.g., JSON files (default)]                                  |

---

## 6. Progress Tracking

### 6.1 User Requirements Log

| **Timestamp**      | **Phase** | **Requirement**                                                      |
| :----------------- | :-------- | :------------------------------------------------------------------- |
| [YYYY-MM-DD HH:MM] | Planning  | [User input from $ARGUMENTS, or "No specific requirements provided"] |

### 6.2 Evaluation Progress

| **Timestamp**      | **Component**    | **Status**                      | **Notes**                                      |
| :----------------- | :--------------- | :------------------------------ | :--------------------------------------------- |
| [YYYY-MM-DD HH:MM] | [Component name] | [In Progress/Completed/Blocked] | [Technical details, blockers, or achievements] |

# Agent Evaluation Report for [AGENT NAME]

## Executive Summary

<!--
ACTION REQUIRED: Provide high-level evaluation results and key findings. Focus on actionable insights for stakeholders.
-->

- **Test Scale**: [N] test cases
- **Success Rate**: [XX.X%]
- **Status**: [Excellent/Good/Poor]
- **Strengths**: [Specific capability or metric] [Performance highlight] [Reliability aspect]
- **Critical Issues**: [Blocking issue + impact] [Performance bottleneck] [Safety/compliance concern]
- **Action Priority**: [Critical fixes] [Improvements] [Enhancements]

---

## Evaluation Results

### Test Case Coverage

<!--
ACTION REQUIRED: List all test scenarios that were evaluated, providing context for the results.
-->

- **[Test Scenario 1]**: [Description and coverage]
- **[Test Scenario 2]**: [Description and coverage]
- [Additional scenarios as needed]

### Results

| **Metric**      | **Score** | **Target** | **Status**  |
| :-------------- | :-------- | :--------- | :---------- |
| [Metric Name 1] | [XX.X%]   | [XX%]      | [Pass/Fail] |
| [Metric Name 2] | [X.X/5]   | [4.0+]     | [Pass/Fail] |
| [Metric Name 3] | [XX.X%]   | [95%+]     | [Pass/Fail] |

### Results Summary

[Brief description of overall performance and findings across metrics]

---

## Agent Success Analysis

<!--
ACTION REQUIRED: Focus on what the agent does well. Provide specific evidence and contributing factors for successful performance.
-->

### Strengths

- **[Strength Name 1]**: [What the agent does exceptionally well]
- **Evidence**: [Specific metrics and examples]
- **Contributing Factors**: [Why this works well]

- **[Strength Name 2]**: [What the agent does exceptionally well]
- **Evidence**: [Specific metrics and examples]
- **Contributing Factors**: [Why this works well]

[Repeat pattern for additional strengths]

### High-Performing Scenarios

- **[Scenario Type 1]**: [Category of tasks where agent excels]
- **Key Characteristics**: [What makes these scenarios successful]

- **[Scenario Type 2]**: [Category of tasks where agent excels]
- **Key Characteristics**: [What makes these scenarios successful]

[Repeat pattern for additional scenarios]

---

## Agent Failure Analysis

<!--
ACTION REQUIRED: Analyze failures systematically. Provide root cause analysis and specific improvement recommendations with expected impact.
-->

### Issue 1 - [Priority Level]

- **Issue**: [Clear problem statement with evaluation metrics]
- **Root Cause**: [Technical analysis of why this occurred — path/to/file.py:START-END]
- **Evidence**: [Specific data points from results]
- **Impact**: [Effect on overall performance]
- **Priority Fixes**:
  - P1 — [Fix name]: [One-line solution] → Expected gain: [Metric +X]
  - P2 — [Fix name]: [One-line solution] → Expected gain: [Metric improvement]

### Issue 2 - [Priority Level]

[Repeat structure for additional issues]

---

## Action Items & Recommendations

<!--
ACTION REQUIRED: Provide specific, implementable tasks with clear steps. Prioritize by impact and effort required.
-->

### [Item Name] - Priority [Number] ([Critical/Enhancement])

- **Description**: [Description of this item]
- **Actions**:
  - [ ] [Specific task with implementation steps]
  - [ ] [Specific task with implementation steps]
  - [ ] [Additional tasks as needed]

### [Additional Item Name] - Priority [Number] ([Critical/Enhancement])

[Repeat structure for additional action items]

---

## Artifacts & Reproduction

### Reference Materials

- **Agent Code**: [Path to agent implementation]
- **Test Cases**: [Path to test cases]
- **Traces**: [Path to traces]
- **Results**: [Path to results files]
- **Evaluation Code**: [Path to evaluation implementation]

---

## Evaluation Limitations and Improvement

<!--
ACTION REQUIRED: Identify limitations in the current evaluation approach and suggest improvements for future iterations.
-->

### Test Data Improvement

- **Current Limitations**: [Evaluation scope limitations]
- **Recommended Improvements**: [Specific suggestions for test data enhancement]

### Evaluation Code Enhancement

- **Current Limitations**: [Limitations of evaluation implementation and metrics]
- **Recommended Improvements**: [Specific suggestions for evaluation code improvement]

### [Additional Improvement Area]

[Repeat structure for other evaluation improvement areas]

Agent Sop Eval

EvalKit

Overview

Parameters

Agent Sop Eval

EvalKit

Overview

Parameters

Steps

1. Setup and Initialization

2. Planning Phase

Evaluation Planning Guidelines

Design Principles

Metrics Guidelines

Architecture Principles

Technology Selection Defaults

3. Test Data Generation Phase

Data Generation Guidelines

4. Evaluation Implementation and Execution Phase

Implementation Guidelines

Strands Evals SDK Integration

Environment Setup Guidelines

Common Pitfalls to Avoid

5. Analysis and Reporting Phase

Analysis and Reporting Guidelines

Analysis Principles

Red Flags for Simulation

Quality Standards for Recommendations

Report Quality Standards

6. Completion and Documentation

Examples

Example Input

Example Output

Example: Iterative Refinement

Example Output Structure

Conversation Flow Management

Phase Dependencies

Handling Missing Prerequisites

Conversational Guidance

Troubleshooting

Common Issues and Solutions

Appendix A: Evaluation Plan Template

Appendix B: Evaluation Report Template

Taskflow Inbox Triage

Accessibility

Open a Pull Request

Investor Materials

Continuous Agent Loop

Configure Ecc