Evaluate an agent's performance on a specific task or set of tasks, providing actionable feedback and improvement recommendations.
EvalKit is a conversational evaluation framework for AI agents that guides you through creating robust evaluations using the Strands Evals SDK. Through natural conversation, you can plan evaluations, generate test data, execute evaluations, and analyze results.
./chatbot-agent, /path/to/my-agent)Constraints for parameter acquisition:
When a user requests evaluation (any phase), first validate the environment:
Folder Structure:
All evaluation artifacts MUST be created in the eval/ folder at the same level as the target agent folder:
<agent-evaluation-project>/ # Example name - can be any name for user's evaluation project
├── <target-agent-folder>/ # Example name - this is the agent you are evaluating
│ └── [agent source code] # Existing agent code
└── eval/ # All evaluation files go here (sibling to target-agent-folder)
├── eval-plan.md
├── test-cases.jsonl
├── results/
├── run_evaluation.py
├── eval-report.md
└── README.md
Note:
eval/ folder is a sibling directory to user's agent folder, not nested inside itagent-evaluation-project and target-agent-folder are placeholder names - user may use any names that fit their projectConstraints:
When to Trigger: User requests evaluation planning or mentions creating/designing an evaluation
User Intent Recognition:
Execution Flow:
Parse user request: Extract agent path, evaluation focus, and specific requirements from natural language
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/
Create evaluation directory structure:
mkdir -p eval
Follow this execution flow:
Write the complete evaluation plan to eval/eval-plan.md using the template structure (see Appendix A: Evaluation Plan Template), replacing placeholders with concrete details derived from the analysis while preserving section order and headings.
Report completion with evaluation plan file path, and suggest next step: "Would you like me to generate test cases based on this plan?"
High-Level Design (What & Why):
Low-Level Implementation (How):
Evaluation metrics must be:
Key Principles:
eval/ directory structureExamples of reasonable defaults:
Constraints:
When to Trigger: User requests test case generation or mentions creating test data
User Intent Recognition:
Execution Flow:
Parse user request: Extract any specific requirements (e.g., "focus on edge cases", "10 test cases")
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/
Load the current evaluation plan (eval/eval-plan.md) to understand evaluation areas and test data requirements.
Follow this execution flow:
eval/test-cases.jsonlReport completion with test case count, coverage summary, and suggest next step: "Would you like me to run the evaluation with these test cases?"
Constraints:
When to Trigger: User requests evaluation execution or mentions running tests
User Intent Recognition:
Execution Flow:
Parse user request: Extract any specific requirements (e.g., "run on subset", "verbose output")
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/
Load the current evaluation plan (eval/eval-plan.md) to understand evaluation requirements and agent architecture.
Follow this execution flow:
requirements.txt at repository root, adding Strands Evals SDK dependenciesuv to create virtual environment, activate it, and install requirements.txteval/run_evaluation.py using Strands Evals SDK patterns with Case objects, Experiment class, and appropriate evaluatorseval/results/ directoryeval/README.md with running instructions for usersReport completion with evaluation results summary and suggest next step: "Would you like me to analyze these results and provide recommendations?"
CRITICAL: Always Create Minimal Working Version: Implement the most basic version that works
CRITICAL REQUIREMENT - Getting Latest Documentation: Before implementing evaluation code, you MUST retrieve the latest Strands Evals SDK documentation and API usage examples. This is NOT optional. You MUST NOT proceed with implementation without either context7 access or the source code. This ensures you're using the most current patterns and avoiding deprecated APIs.
Step 1: Check Context7 MCP Availability: First, check if context7 MCP server is available by attempting to use it. If you receive an error indicating context7 is not available, proceed to Step 3.
Step 2: Primary Method - Using Context7 (If Available):
Step 3: Fallback Method - REQUIRED If Context7 Is Not Available: If context7 MCP is not installed or doesn't have Strands Evals SDK documentation, you MUST STOP and prompt the user to take one of these actions:
REQUIRED USER ACTION - Choose ONE of the following:
Option 1: Install Context7 MCP Server (Recommended)
Please install the context7 MCP server in your coding assistant to access the latest Strands Evals SDK documentation. Installation steps vary by assistant:
@upstash/context7-mcpNote: If you're unsure how to install MCP servers in your coding assistant, please consult your assistant's support resources or choose Option 2 below (clone source code).
After installation, you'll be able to query: "Get documentation for strands-agents-evals focusing on Case, Experiment, and Evaluator classes"
Option 2: Clone Strands Evals SDK Source Code
If you cannot install context7 MCP or prefer to work with source code directly:
cd <your-evaluation-project>
git clone https://github.com/strands-agents/evals strands-agents-evals-source
IMPORTANT: You MUST NOT proceed with implementation until the user has completed one of these options. Do NOT attempt to implement evaluation code using only the reference examples in Appendix C, as they may be outdated.
After the user confirms they've completed one of the above options:
If Context7 was installed:
If source code was cloned:
strands-agents-evals-source/src/strands_evals/strands-agents-evals-source/examples/Core Components:
Check Existing Requirements: Verify requirements.txt exists in repository root
# Check if requirements.txt exists
ls requirements.txt
Add Strands Evals SDK Dependencies: Update existing requirements.txt with Strands evaluation dependencies
# Add Strands Evals SDK and related dependencies
grep -q "strands-agents-evals" requirements.txt || echo "strands-agents-evals" >> requirements.txt
# Add other evaluation-specific dependencies as needed based on evaluation plan
Installation: Use uv for dependency management
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
Constraints:
When to Trigger: User requests results analysis or mentions generating a report
User Intent Recognition:
Execution Flow:
Parse user request: Extract any specific analysis focus (e.g., "focus on failures", "prioritize critical issues")
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/
Load and analyze the evaluation results from eval/results/
Follow this execution flow:
Results Analysis Process:
a. Data Validation: Ensure results are from real execution:
b. Results Analysis: Analyze evaluation outcomes:
c. Insights Generation: Identify key findings:
Improvement Recommendations: Generate specific, actionable recommendations:
a. Prioritized Recommendations: Based on evaluation findings:
Critical Issues (Immediate attention required)
Quality Improvements (Medium-term enhancements)
Enhancement Opportunities (Future improvements)
b. Evidence-Based Recommendations: All recommendations must cite specific data:
Advisory Report Generation: Create focused report using the template structure (see Appendix B: Evaluation Report Template) with:
IMPORTANT: Follow all HTML comment instructions (<!-- ACTION REQUIRED: ... -->) in the template when generating content, then remove these comment instructions from the final report - they are template guidance only and should not appear in the generated report.
Report completion with key findings and ask: "Would you like me to help implement any of these recommendations?"
Always check for these indicators of simulated results:
Good Recommendations:
Poor Recommendations:
Ensure your advisory report:
Evaluation Report Template: See Appendix B: Evaluation Report Template
Constraints:
Finalize the evaluation and prepare deliverables.
Constraints:
<your-evaluation-project>/ # Your chosen project name
├── <your-agent-folder>/ # Your chosen agent folder name
│ └── [agent source code]
└── eval/
├── eval-plan.md
├── test-cases.jsonl
├── results/
├── run_evaluation.py
├── eval-report.md
└── README.md
agent_path: "./chatbot-agent"
evaluation_focus: "response quality and tool calling accuracy"
Complete Evaluation Flow:
Phase 1 - Planning:
User: "I need to evaluate my customer support chatbot at ./chatbot-agent. Focus on response quality and tool calling accuracy."
Assistant: "I'll create an evaluation plan for your customer support chatbot..."
[Creates eval/eval-plan.md with 2 key metrics and 3 test scenarios]
Phase 2 - Data Generation:
User: "Yes, generate 5 test cases"
Assistant: "I'll generate 5 test cases covering the scenarios..."
[Creates eval/test-cases.jsonl with 2 basic queries, 2 tool-calling scenarios, 1 edge case]
Phase 3 - Evaluation Execution:
User: "Run the evaluation"
Assistant: "I'll implement and execute the evaluation using Strands Evals SDK..."
[Creates eval/run_evaluation.py, runs evaluation]
Results: Overall success rate: 80%, Response Quality: 4.2/5, Tool Call Accuracy: 75%
Phase 4 - Analysis:
User: "Yes, analyze the results"
Assistant: "I'll analyze the evaluation results and generate recommendations..."
[Creates eval/eval-report.md]
Key findings: Strong performance on basic queries (100% success), Tool calling needs improvement (25% failure rate)
User: "Create an evaluation plan for my agent at ./my-agent"
Assistant: [Creates initial plan in eval/eval-plan.md]
User: "Add more focus on error handling"
Assistant: "I'll update the evaluation plan to include error handling metrics..."
[Updates eval/eval-plan.md]
User: "Generate test cases with more edge cases"
Assistant: "I'll generate test cases with additional edge case coverage..."
[Updates eval/test-cases.jsonl]
After running all phases, your agent repository will have the following structure:
<your-evaluation-project>/ # Your chosen project name (e.g., my-chatbot-eval)
├── <your-agent-folder>/ # Your chosen agent folder name (e.g., chatbot-agent)
│ └── [existing agent files]
└── eval/ # All evaluation files (sibling to agent folder)
├── eval-plan.md # Complete evaluation specification and plan
├── test-cases.jsonl # Generated test scenarios
├── README.md # Running instructions and usage examples
├── run_evaluation.py # Strands Evals SDK evaluation implementation
├── results/ # Evaluation outputs
│ └── [timestamp]/ # Timestamped evaluation results
└── eval-report.md # Analysis and recommendations
Note:
<your-evaluation-project>, <your-agent-folder>) are placeholders - use any names that fit your projectEvalKit automatically manages phase dependencies:
If a user requests a phase without prerequisites:
Example: User says "run the evaluation" but no test cases exist
Response: "I don't see any test cases yet. Would you like me to:
After completing each phase, suggest the logical next step:
Issue: User requests evaluation but no agent path provided
Issue: Evaluation plan doesn't exist when user requests test generation
Issue: Test cases don't exist when user requests evaluation
Issue: Test data generation fails
python -m json.tool < eval/test-cases.jsonlIssue: Evaluation implementation fails with Strands Evals SDK errors
Issue: Import errors for evaluation dependencies
uv pip install -r requirements.txtsource .venv/bin/activateIssue: Agent execution fails during evaluation
Issue: User is unsure what to do next
The following template is used for creating eval-plan.md:
# Evaluation Plan for [AGENT NAME]
## 1. Evaluation Requirements
<!--
ACTION REQUIRED: User input and interpreted evaluation requirements. Defaults to 1-2 key metrics if unspecified.
-->
- **User Input:** `"$ARGUMENTS"` or "No Input"
- **Interpreted Evaluation Requirements:** [Parsed from user input - highest priority]
---
## 2. Agent Analysis
| **Attribute** | **Details** |
| :-------------------- | :---------------------------------------------------------- |
| **Agent Name** | [Agent name] |
| **Purpose** | [Primary purpose and use case in 1-2 sentences] |
| **Core Capabilities** | [Key functionalities the agent provides] |
| **Input** | [Short description, Data types, schemas] |
| **Output** | [Short description, Response types, schemas] |
| **Agent Framework** | [e.g., CrewAI, LangGraph, AutoGen, Custom/None] |
| **Technology Stack** | [Programming language, frameworks, libraries, dependencies] |
**Agent Architecture Diagram:**
[Mermaid diagram illustrating:
- Agent components and their relationships
- Data flow between components
- External integrations (APIs, databases, tools)
- User interaction points]
**Key Components:**
- **[Component Name 1]:** [Brief description of purpose and functionality]
- **[Component Name 2]:** [Brief description of purpose and functionality]
- [Additional components as needed]
**Available Tools:**
- **[Tool Name 1]:** [Purpose and usage]
- **[Tool Name 2]:** [Purpose and usage]
- [Additional tools as needed]
**Observability Status**
- **Tracing Framework** [Fully/Partially/Not Instrumented, Framework name, version]
- **Custom Attributes** [Yes/No, Key custom attributes if present]
---
## 3. Evaluation Metrics
<!--
ACTION REQUIRED: If no specific user requirements are provided, use a minimal number of metrics (1-2 metrics) focusing on the most critical aspects of agent performance.
-->
### [Metric Name 1]
- **Evaluation Area:** [Final response quality/tool call accuracy/...]
- **Description:** [What is measured and why]
- **Method:** [Code-based | LLM-as-Judge ]
### [Metric Name 2]
[Repeat for each metric]
---
## 4. Test Data Generation
<!--
ACTION REQUIRED: Keep scenarios minimal and focused. Do not propose more than 3 scenarios.
-->
- **[Test Scenario 1]**: [Description and purpose, complexity]
- **[Test scenario 2]**: [Description and purpose, complexity]
- **Total number of test cases**: [SHOULD NOT exceed 3]
---
## 5. Evaluation Implementation Design
### 5.1 Evaluation Code Structure
<!--
ACTION REQUIRED: The code structure below will be adjusted based on your evaluation requirements and existing agent codebase. This is the recommended starting structure. Only adjust it if necessary.
-->
./ # Repository root directory
├── requirements.txt # Consolidated dependencies
├── .venv/ # Python virtual environment (created by uv)
│
└── eval/ # Evaluation workspace
├── README.md # Running instructions and usage examples (always present)
├── run_evaluation.py # Strands Evals SDK evaluation implementation (always present)
├── results/ # Evaluation outputs (always present)
├── eval-plan.md # This evaluation specification and plan (always present)
└── test-cases.jsonl # Generated test cases (from evalkit.data)
### 5.2 Recommended Evaluation Technical Stack
| **Component** | **Selection** |
| :----------------------- | :------------------------------------------------------------ |
| **Language/Version** | [e.g., Python 3.11, Node.js 18+] |
| **Evaluation Framework** | [Strands Evals SDK (default)] |
| **Evaluators** | [OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator] |
| **Agent Integration** | [e.g., Direct import, API] |
| **Results Storage** | [e.g., JSON files (default)] |
---
## 6. Progress Tracking
### 6.1 User Requirements Log
| **Timestamp** | **Phase** | **Requirement** |
| :----------------- | :-------- | :------------------------------------------------------------------- |
| [YYYY-MM-DD HH:MM] | Planning | [User input from $ARGUMENTS, or "No specific requirements provided"] |
### 6.2 Evaluation Progress
| **Timestamp** | **Component** | **Status** | **Notes** |
| :----------------- | :--------------- | :------------------------------ | :--------------------------------------------- |
| [YYYY-MM-DD HH:MM] | [Component name] | [In Progress/Completed/Blocked] | [Technical details, blockers, or achievements] |
The following template is used for creating eval-report.md:
# Agent Evaluation Report for [AGENT NAME]
## Executive Summary
<!--
ACTION REQUIRED: Provide high-level evaluation results and key findings. Focus on actionable insights for stakeholders.
-->
- **Test Scale**: [N] test cases
- **Success Rate**: [XX.X%]
- **Status**: [Excellent/Good/Poor]
- **Strengths**: [Specific capability or metric] [Performance highlight] [Reliability aspect]
- **Critical Issues**: [Blocking issue + impact] [Performance bottleneck] [Safety/compliance concern]
- **Action Priority**: [Critical fixes] [Improvements] [Enhancements]
---
## Evaluation Results
### Test Case Coverage
<!--
ACTION REQUIRED: List all test scenarios that were evaluated, providing context for the results.
-->
- **[Test Scenario 1]**: [Description and coverage]
- **[Test Scenario 2]**: [Description and coverage]
- [Additional scenarios as needed]
### Results
| **Metric** | **Score** | **Target** | **Status** |
| :-------------- | :-------- | :--------- | :---------- |
| [Metric Name 1] | [XX.X%] | [XX%] | [Pass/Fail] |
| [Metric Name 2] | [X.X/5] | [4.0+] | [Pass/Fail] |
| [Metric Name 3] | [XX.X%] | [95%+] | [Pass/Fail] |
### Results Summary
[Brief description of overall performance and findings across metrics]
---
## Agent Success Analysis
<!--
ACTION REQUIRED: Focus on what the agent does well. Provide specific evidence and contributing factors for successful performance.
-->
### Strengths
- **[Strength Name 1]**: [What the agent does exceptionally well]
- **Evidence**: [Specific metrics and examples]
- **Contributing Factors**: [Why this works well]
- **[Strength Name 2]**: [What the agent does exceptionally well]
- **Evidence**: [Specific metrics and examples]
- **Contributing Factors**: [Why this works well]
[Repeat pattern for additional strengths]
### High-Performing Scenarios
- **[Scenario Type 1]**: [Category of tasks where agent excels]
- **Key Characteristics**: [What makes these scenarios successful]
- **[Scenario Type 2]**: [Category of tasks where agent excels]
- **Key Characteristics**: [What makes these scenarios successful]
[Repeat pattern for additional scenarios]
---
## Agent Failure Analysis
<!--
ACTION REQUIRED: Analyze failures systematically. Provide root cause analysis and specific improvement recommendations with expected impact.
-->
### Issue 1 - [Priority Level]
- **Issue**: [Clear problem statement with evaluation metrics]
- **Root Cause**: [Technical analysis of why this occurred — path/to/file.py:START-END]
- **Evidence**: [Specific data points from results]
- **Impact**: [Effect on overall performance]
- **Priority Fixes**:
- P1 — [Fix name]: [One-line solution] → Expected gain: [Metric +X]
- P2 — [Fix name]: [One-line solution] → Expected gain: [Metric improvement]
### Issue 2 - [Priority Level]
[Repeat structure for additional issues]
---
## Action Items & Recommendations
<!--
ACTION REQUIRED: Provide specific, implementable tasks with clear steps. Prioritize by impact and effort required.
-->
### [Item Name] - Priority [Number] ([Critical/Enhancement])
- **Description**: [Description of this item]
- **Actions**:
- [ ] [Specific task with implementation steps]
- [ ] [Specific task with implementation steps]
- [ ] [Additional tasks as needed]
### [Additional Item Name] - Priority [Number] ([Critical/Enhancement])
[Repeat structure for additional action items]
---
## Artifacts & Reproduction
### Reference Materials
- **Agent Code**: [Path to agent implementation]
- **Test Cases**: [Path to test cases]
- **Traces**: [Path to traces]
- **Results**: [Path to results files]
- **Evaluation Code**: [Path to evaluation implementation]
---
## Evaluation Limitations and Improvement
<!--
ACTION REQUIRED: Identify limitations in the current evaluation approach and suggest improvements for future iterations.
-->
### Test Data Improvement
- **Current Limitations**: [Evaluation scope limitations]
- **Recommended Improvements**: [Specific suggestions for test data enhancement]
### Evaluation Code Enhancement
- **Current Limitations**: [Limitations of evaluation implementation and metrics]
- **Recommended Improvements**: [Specific suggestions for evaluation code improvement]
### [Additional Improvement Area]
[Repeat structure for other evaluation improvement areas]