Name: Phoenix Evals New Metric
Author: Arize-ai

Creating a New Built-in Classification Evaluator

A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.

Step 0: Gather Requirements

Before writing anything, clarify with the user:

What does this evaluator measure? Get a one-sentence description of the quality dimension.
What input data is available? This determines the template placeholders (e.g., {{input}}, {{output}}, {{reference}}, {{tool_definitions}}). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller.
What labels make sense? Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
Should this appear in the dataset experiments UI? If yes, it needs the label. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.

Creating a New Built-in Classification Evaluator

Step 0: Gather Requirements

Before writing anything, clarify with the user:

What does this evaluator measure? Get a one-sentence description of the quality dimension.
What input data is available? This determines the template placeholders (e.g., {{input}}, {{output}}, {{reference}}, {{tool_definitions}}). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller.
What labels make sense? Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
Should this appear in the dataset experiments UI? If yes, it needs the label. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.

Phoenix Evals New Metric

Creating a New Built-in Classification Evaluator

Step 0: Gather Requirements

Phoenix Evals New Metric

Creating a New Built-in Classification Evaluator

Step 0: Gather Requirements

Step 1: Create the YAML Config

Key Decision Points

Prompt Writing Tips

Step 2: Compile Prompts

Step 3: Create the Python Evaluator

Step 4: Create the TypeScript Evaluator

Step 5: Build JS

Step 6: Write the Benchmark

Benchmark Requirements

Step 7: Run the Benchmark

Step 8: Create Documentation

Navigation Updates

Checklist

Retrospection

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns