Create a new built-in classification evaluator for Phoenix evals. Use this skill whenever the user asks to create a new eval, build a new metric, add a new builtin evaluator, create an LLM-as-a-judge metric, or add a new classification evaluator to Phoenix.
A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.
Before writing anything, clarify with the user:
{{input}}, {{output}}, {{reference}}, {{tool_definitions}}). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller.promoted_dataset_evaluatorCreate prompts/classification_evaluator_configs/{NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.
Read an existing config to match the current schema. Start with CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml for a simple example, or TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml if your evaluator needs structured span data.
choices — Maps label strings to numeric scores. For binary evaluators, use positive/negative labels (e.g., correct: 1.0 / incorrect: 0.0). The labels you pick here flow through to the Python class, TS factory, and benchmarks.
optimization_direction — Use maximize when the positive label is the desired outcome (most evaluators). Use minimize only if the metric measures something undesirable (e.g., hallucination). This affects how Phoenix displays the metric in the UI.
labels — Optional list. Add promoted_dataset_evaluator only if this evaluator should appear in the dataset experiments UI sidebar.
substitutions — Only needed if the evaluator is a promoted_dataset_evaluator and works with structured span data (tool definitions, tool calls, message arrays). These reference formatter snippets defined in prompts/formatters/server.yaml. Read that file if you need substitutions — it defines what structured data formats are available. Most evaluators that only use simple text fields (input, output, reference) don't need substitutions.
<context>, <output>) for clear data formatting{{placeholder}} (Mustache syntax) for template variablesmake codegen-prompts
This generates code in three places:
packages/phoenix-evals/src/phoenix/evals/__generated__/classification_evaluator_configs/ (Python)src/phoenix/__generated__/classification_evaluator_configs/ (Python, server copy)js/packages/phoenix-evals/src/__generated__/default_templates/ (TypeScript)Verify the generated files look correct before moving on.
Create packages/phoenix-evals/src/phoenix/evals/metrics/{name}.py.
Read correctness.py in that directory — it's the canonical example. Your evaluator follows the same pattern: subclass ClassificationEvaluator, pull constants from the generated config, define a Pydantic input schema with fields matching your template placeholders.
After creating the file, add it to the exports in metrics/__init__.py — both the import and the __all__ list. Read the current __init__.py to see the existing pattern.
Create js/packages/phoenix-evals/src/llm/create{Name}Evaluator.ts.
Read createCorrectnessEvaluator.ts — it's the canonical example. The pattern is a factory function that wraps createClassificationEvaluator with defaults from the generated config.
Then:
js/packages/phoenix-evals/src/llm/index.tscreateFaithfulnessEvaluator.test.ts for the test patterncd js && pnpm build
Fix any TypeScript errors before proceeding.
Create js/benchmarks/evals-benchmarks/src/{name}_benchmark.ts.
Read existing benchmarks in that directory to match the current patterns:
tool_invocation_benchmark.ts — confusion matrix printing, multi-category analysisThe task function must return input and output text in its result so the failed examples printer has access to them.
Consider using a separate agent session for synthetic dataset generation if the examples need realistic domain-specific content — this keeps the dataset creation focused and avoids context-switching.
# Terminal 1: Start Phoenix
PHOENIX_WORKING_DIR=/tmp/phoenix-test phoenix serve
# Terminal 2: Run the benchmark
cd js/benchmarks/evals-benchmarks
pnpm tsx src/{name}_benchmark.ts
Target >80% accuracy. If accuracy is low, look at the failed examples output to decide whether to adjust the prompt (Step 1) or the benchmark examples (Step 6). Iterate until accuracy is acceptable.
Create docs/phoenix/evaluation/pre-built-metrics/{name}.mdx.
Read faithfulness.mdx in that directory — it's the template. Follow the same section structure:
After creating the docs page, update these three files:
docs.json — add the page to the Evaluation > Pre-built Metrics nav groupdocs/phoenix/evaluation/pre-built-metrics.mdx — add a card to the landing page griddocs/phoenix/sitemap.xml — add the new URLRead each file to see the existing pattern before editing.
Before calling it done, verify:
make codegen-prompts ran successfullymetrics/__init__.pyllm/index.tscd js && pnpm build)docs.json nav updatedAfter completing the workflow, verify these instructions matched reality:
make codegen-prompts generate to different locations?If anything drifted, update this SKILL.md before finishing so the next person (or agent) doesn't hit the same surprises.