Buscar habilidades.../

Agent Evaluation with MLflow | Skills Pool

uv run mlflow --version          # MLflow CLI commands
uv run python scripts/xxx.py     # Python script execution
uv run python -c "..."           # Python one-liners

# WRONG - mixes progress bars and logs with JSON output
uv run mlflow traces evaluate ... --output json > results.json

# CORRECT - separates stderr from JSON output
uv run mlflow traces evaluate ... --output json 2>/dev/null > results.json

# ALTERNATIVE - save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log

uv run mlflow --version  # Should be >=3.8.0
uv run python -c "import mlflow; print(f'MLflow {mlflow.__version__} installed')"

# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"

# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null

# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py"  # FastAPI/Flask
grep -r "def.*route" . --include="*.py"

# Find autolog calls
grep -r "mlflow.*autolog" . --include="*.py"

# Find trace decorators
grep -r "@mlflow.trace" . --include="*.py"

# Check imports
grep -r "import mlflow" . --include="*.py"

# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"

# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100

# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null

uv run python scripts/validate_environment.py  # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py         # Test authentication before expensive operations

Discover built-in scorers using documentation protocol:
- Query https://mlflow.org/docs/latest/llms.txt for "What built-in LLM judges or scorers are available?"
- Read scorer documentation to understand their purpose and requirements
- Note: Do NOT use mlflow scorers list -b - use documentation instead for accurate information

Check registered scorers in your experiment:

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID

Identify quality dimensions for your agent and select appropriate scorers
Register scorers and test on sample trace before full evaluation

Run dataset discovery (mandatory):

uv run python scripts/list_datasets.py  # Lists, compares, recommends datasets
uv run python scripts/list_datasets.py --format json  # Machine-readable output
uv run python scripts/list_datasets.py --help  # All options

Present findings to user:
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
Ask user about existing datasets:
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name
- If no: Proceed to step 4

Create new dataset only if user declined existing ones:

# Generates dataset creation script from test cases file
uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
uv run python scripts/create_dataset_template.py --help  # See all options

Generated code uses mlflow.genai.datasets APIs - review and execute the script.

Generate traces:

# Generates evaluation script (auto-detects agent module, entry point, dataset)
uv run python scripts/run_evaluation_template.py
uv run python scripts/run_evaluation_template.py --help  # Override auto-detection

Generated script uses mlflow.genai.evaluate() - review and execute it.

Apply scorers:

# IMPORTANT: Redirect stderr to avoid mixing logs with JSON output
uv run mlflow traces evaluate \
  --trace-ids <comma_separated_trace_ids> \
  --scorers <scorer1>,<scorer2>,... \
  --output json 2>/dev/null > evaluation_results.json

Analyze results:

# Pattern detection, failure analysis, recommendations
uv run python scripts/analyze_results.py evaluation_results.json

Generates evaluation_report.md with pass rates and improvement suggestions.

Agent Evaluation with MLflow

Table of Contents

Agent Evaluation with MLflow

Table of Contents

Quick Start

Command Conventions

Documentation Access Protocol

Pre-Flight Validation

Discovering Agent Structure

Find Agent Entry Points

Find Tracing Integration

Understand Project Structure

Setup Overview

Evaluation Workflow

Step 1: Understand Agent Purpose

Step 2: Define Quality Scorers

Step 3: Prepare Evaluation Dataset

Step 4: Run Evaluation

References

Taskflow Inbox Triage

Accessibility

Open a Pull Request

Investor Materials

Continuous Agent Loop

Configure Ecc