Use this when you need to IMPROVE or OPTIMIZE an existing LLM agent's performance - including improving tool selection accuracy, answer quality, reducing costs, or fixing issues where the agent gives wrong/incomplete responses. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 4 steps:
Always use uv run for MLflow and Python commands:
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
# WRONG - mixes progress bars and logs with JSON output
uv run mlflow traces evaluate ... --output json > results.json
# CORRECT - separates stderr from JSON output
uv run mlflow traces evaluate ... --output json 2>/dev/null > results.json
# ALTERNATIVE - save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
When to separate streams:
--output json flagjq, grep, etc.)When NOT to separate:
All MLflow documentation must be accessed through llms.txt:
https://mlflow.org/docs/latest/llms.txtThis applies to all steps, especially:
Validate environment before starting:
uv run mlflow --version # Should be >=3.8.0
uv run python -c "import mlflow; print(f'MLflow {mlflow.__version__} installed')"
If MLflow is missing or version is <3.8.0, see Setup Overview below.
Each project has unique structure. Use dynamic exploration instead of assumptions:
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null
# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
# Find autolog calls
grep -r "mlflow.*autolog" . --include="*.py"
# Find trace decorators
grep -r "@mlflow.trace" . --include="*.py"
# Check imports
grep -r "import mlflow" . --include="*.py"
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100
# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
Before evaluation, complete these three setup steps:
references/setup-guide.md Steps 1-2references/tracing-integration.md - the authoritative tracing guidescripts/validate_agent_tracing.py after implementing⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
Validation scripts:
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations
For complete setup instructions: See references/setup-guide.md
Discover built-in scorers using documentation protocol:
https://mlflow.org/docs/latest/llms.txt for "What built-in LLM judges or scorers are available?"mlflow scorers list -b - use documentation instead for accurate informationCheck registered scorers in your experiment:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
Identify quality dimensions for your agent and select appropriate scorers
Register scorers and test on sample trace before full evaluation
For scorer selection and registration: See references/scorers.md
For CLI constraints (yes/no format, template variables): See references/scorers-constraints.md
ALWAYS discover existing datasets first to prevent duplicate work:
Run dataset discovery (mandatory):
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets
uv run python scripts/list_datasets.py --format json # Machine-readable output
uv run python scripts/list_datasets.py --help # All options
Present findings to user:
Ask user about existing datasets:
Create new dataset only if user declined existing ones:
# Generates dataset creation script from test cases file
uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
uv run python scripts/create_dataset_template.py --help # See all options
Generated code uses mlflow.genai.datasets APIs - review and execute the script.
IMPORTANT: Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See references/dataset-preparation.md
Generate traces:
# Generates evaluation script (auto-detects agent module, entry point, dataset)
uv run python scripts/run_evaluation_template.py
uv run python scripts/run_evaluation_template.py --help # Override auto-detection
Generated script uses mlflow.genai.evaluate() - review and execute it.
Apply scorers:
# IMPORTANT: Redirect stderr to avoid mixing logs with JSON output
uv run mlflow traces evaluate \
--trace-ids <comma_separated_trace_ids> \
--scorers <scorer1>,<scorer2>,... \
--output json 2>/dev/null > evaluation_results.json
Analyze results:
# Pattern detection, failure analysis, recommendations
uv run python scripts/analyze_results.py evaluation_results.json
Generates evaluation_report.md with pass rates and improvement suggestions.
Detailed guides in references/ (load as needed):
Scripts are self-documenting - run with --help for usage details.