스킬 파일

Eval Bootstrap — Generate Evaluator Code from Production Traces

Name: Eval Bootstrap — Generate Evaluator Code from Production Traces
Author: managementmaars-art

Bootstrap SDK-based evaluators from production traces. Use when user says "bootstrap evaluators", "generate evaluators", "create evals from traces", "eval bootstrap", "write evaluators", "build eval suite", or wants to generate BaseEvaluator/LLMJudge code from production LLM trace data. Works with ml_app and optional RCA report or failure hypothesis.

managementmaars-art0 스타2026. 4. 16.

카테고리: 테스팅

스킬 내용

Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then generate ready-to-use evaluator code using the Datadog Evals SDK. The output is a .py file containing BaseEvaluator subclasses and/or LLMJudge instances that the user can run in LLM Experiments.

Usage

/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only]

Arguments: $ARGUMENTS

Inputs

Input	Required	Default	Description
`ml_app`	Yes	—	ML application to scope traces
`timeframe`	No	`now-7d`	How far back to look
`rca_report`	No	—	Failure taxonomy from skill, or a free-text failure hypothesis

관련 스킬

Eval Bootstrap — Generate Evaluator Code from Production Traces | Skills Pool

eval-trace-rca

Tool	Purpose
`search_llmobs_spans`	Find spans by eval presence, tags, span kind, query syntax. Paginate with cursor.
`get_llmobs_span_details`	Metadata, evaluations (scores, labels, reasoning), and `content_info` map showing available fields + sizes.
`get_llmobs_span_content`	Actual content for a span field. Supports JSONPath via `path` param for targeted extraction.
`get_llmobs_trace`	Full trace hierarchy as span tree with span counts by kind.
`get_llmobs_agent_loop`	Chronological agent execution timeline (LLM calls, tool invocations, decisions).
`list_llmobs_evals`	List all evaluators (OOTB + custom) configured for `ml_app`, with `enabled` status. Call once in Phase 0 to map existing coverage before proposing new evaluators.
`get_llmobs_eval_config`	Full configuration (prompt, model, structured output) for a custom/BYOP evaluator. Use in Phase 0 to understand what a custom eval measures. Not supported for `source=ootb` — skip those.

Field	Path	What you get
`messages`	`$.messages[0]`	System prompt (first message, usually `system` role)
`messages`	`$.messages[-1]`	Last assistant response
`messages`	(no path)	Full conversation including tool calls
`input` / `output`	—	Span I/O
`documents`	—	Retrieved documents (RAG apps)
`metadata`	—	Custom metadata (prompt versions, feature flags, user segments)

# Core classes
from ddtrace.llmobs._experiment import BaseEvaluator, EvaluatorContext, EvaluatorResult

# LLM-as-judge
from ddtrace.llmobs._evaluators.llm_judge import (
    LLMJudge,
    BooleanStructuredOutput,
    ScoreStructuredOutput,
    CategoricalStructuredOutput,
)

# Built-in evaluators (use only if needed)
from ddtrace.llmobs._evaluators.format import JSONEvaluator, LengthEvaluator
from ddtrace.llmobs._evaluators.string_matching import StringCheckEvaluator, RegexMatchEvaluator

@dataclass(frozen=True)
class EvaluatorContext:
    input_data: dict[str, Any]          # Task inputs (from dataset record, NOT from span)
    output_data: Any                     # Task output (from task function return, NOT from span)
    expected_output: Optional[JSONType] = None  # Ground truth (if available)
    metadata: dict[str, Any] = {}        # Additional metadata
    span_id: Optional[str] = None        # LLMObs span ID
    trace_id: Optional[str] = None       # LLMObs trace ID

EvaluatorResult(
    value=...,                    # Required. JSONType (str, int, float, bool, None, list, dict)
    reasoning="...",              # Optional. Explanation string
    assessment="pass" or "fail",  # Optional. Pass/fail assessment
    metadata={...},              # Optional. Evaluation metadata dict
    tags={...},                  # Optional. Tags dict
)

judge = LLMJudge(
    user_prompt="...",              # Required. Supports {{template_vars}}
    system_prompt="...",            # Optional. Does NOT support template vars
    structured_output=...,          # Optional. Boolean/Score/Categorical output, or a dict for custom JSON schema
    provider="openai",              # "openai" | "anthropic" | "azure_openai" | "vertexai" | "bedrock"
    model="gpt-4o",                # Model identifier
    model_params={"temperature": 0.0},  # Optional. Passed to LLM API
    name="eval_name",              # Optional. Must match ^[a-zA-Z0-9_-]+$
)

BooleanStructuredOutput(
    description="Whether the response is factually accurate",
    reasoning=True,                    # Include reasoning field in LLM response
    reasoning_description=None,        # Optional custom description for reasoning field
    pass_when=True,                    # True → pass when true, False → pass when false, None → no assessment
)

ScoreStructuredOutput(
    description="Helpfulness score",
    min_score=1,                       # Minimum possible score
    max_score=10,                      # Maximum possible score
    reasoning=True,
    reasoning_description=None,
    min_threshold=7,                   # Scores >= 7 pass (optional)
    max_threshold=None,                # Scores <= N pass (optional)
)

CategoricalStructuredOutput(
    categories={
        "correct": "The response correctly answers the question",
        "partially_correct": "The response is partially correct but missing key information",
        "incorrect": "The response is factually wrong or irrelevant",
    },
    reasoning=True,
    reasoning_description=None,
    pass_values=["correct"],           # Which categories count as passing (optional)
)

# Pass a raw dict as structured_output — used as the JSON schema directly
structured_output={
    "type": "object",
    "properties": {
        "relevance": {"type": "boolean", "description": "Whether the response addresses the question"},
        "confidence": {"type": "number", "description": "Confidence score (0.0 to 1.0)"},
        "reasoning": {"type": "string", "description": "Explanation for the evaluation"},
    },
    "required": ["relevance", "confidence", "reasoning"],
    "additionalProperties": False,
}

class MyEvaluator(BaseEvaluator):
    def __init__(self, name=None, ...custom_params...):
        super().__init__(name=name)
        self._param = ...  # Store config as private attrs

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # Access: context.input_data, context.output_data, context.expected_output, context.metadata
        # Must NOT modify self attributes (thread safety)
        passed = ...  # Your logic here
        return EvaluatorResult(
            value=passed,
            reasoning="...",
            assessment="pass" if passed else "fail",
        )

# Validate JSON syntax + optional required keys
JSONEvaluator(required_keys=["name", "age"], output_extractor=None, name=None)

# Validate length (characters, words, or lines)
LengthEvaluator(count_by="words", min_length=10, max_length=500, output_extractor=None, name=None)
# count_by: "characters" | "words" | "lines"

# String matching
StringCheckEvaluator(operation="contains", expected="success", case_sensitive=False, name=None)
# operation: "eq" | "ne" | "contains" | "icontains"

# Regex matching
RegexMatchEvaluator(pattern=r"\d{4}-\d{2}-\d{2}", match_mode="search", name=None)
# match_mode: "search" | "match" | "fullmatch"

Signal	Evaluator Type
Output must be valid JSON	`JSONEvaluator`
Output must match a regex pattern	`RegexMatchEvaluator`
Output has length constraints	`LengthEvaluator`
Output must contain/not contain specific strings	`StringCheckEvaluator`
Semantic quality judgment (tone, accuracy, completeness)	`LLMJudge` + `BooleanStructuredOutput`
Graded quality on a scale	`LLMJudge` + `ScoreStructuredOutput`
Classification into categories	`LLMJudge` + `CategoricalStructuredOutput`
Multi-dimensional judgment (evaluate several aspects at once)	`LLMJudge` + custom JSON schema `dict`
Complex domain logic combining multiple checks	`BaseEvaluator` subclass

Mode	Signal	Behavior
Cold Start	Only `ml_app` provided (no RCA, no hypothesis)	Full open discovery — understand what the app does, identify quality dimensions worth measuring, propose evals for coverage
From RCA	Conversation contains an RCA report or user provides a failure hypothesis	Skip open discovery — use existing failure taxonomy as eval targets

Sample the app: search_llmobs_spans(ml_app=<ml_app>, root_spans_only=true, limit=50, from=<timeframe>, query="@status:ok"). Filter by @status:ok — error spans have no output to evaluate.
Profile the app and identify evaluation target spans: Call get_llmobs_span_details for span_ids grouped by trace_id. Inspect content_info to classify:

Signal App Profile
content_info has messages LLM/chat app
content_info has documents RAG app
Spans include agent kind Agent app
content_info has metadata Has custom metadata

For agent/multi-step apps, also call get_llmobs_trace on 2-3 traces to see the full span hierarchy. Compare content_info between the root span and its sub-spans (especially LLM sub-spans). The root span typically has a summary view (user query → final answer), while LLM sub-spans have the full picture (system prompt, tool call results, reasoning chain). Note which span level has the richest signal for each quality dimension — this determines the evaluation target span for each evaluator.

Extract content and identify targets: Call get_llmobs_span_content for representative spans. Fetch fields based on app profile:

App Profile	Fields to Fetch
LLM/chat	`messages` (`path=$.messages[0]` for system prompt), `output`
RAG	`documents`, `input`, `output`
Agent	`get_llmobs_agent_loop` for the agent span, then `messages` for detail
Any with metadata	`metadata`

Issue all calls in a single message. As you read, note quality patterns: what does "success" look like? What variance exists across outputs? Each observed quality dimension becomes an eval target, with the traces you've just read as evidence. Also look for safety signals — scope violations, sensitive data in outputs, out-of-character responses — and propose a safety evaluator if you find them.

## Proposed Evaluator Suite

**App profile**: {LLM | RAG | Agent | Multi-agent}
**Entry mode**: {cold_start | from_rca}

| # | Name | Type | Measures | Pass Criteria |
|---|------|------|----------|---------------|
| 1 | task_completion | LLMJudge (Boolean) | Whether the task was completed | pass_when=True |
| 2 | ... | ... | ... | ... |

For each evaluator:
- **{name}**: {what it measures}
  - Target span: {which span's data it was designed for}
  - Rationale: {which quality dimension it covers and why}
  - Evidence: [Trace {id_short}](https://app.datadoghq.com/llm/traces?query=trace_id:{full_id})

Ground prompts in traces: LLMJudge system prompts and user prompts must reference patterns actually observed in production traces. Never write generic prompts like "evaluate whether the response is good" — ground them in the app's domain, observed failure patterns, and success criteria.
Keep template variables generic, add comments for context: Use {{input_data}} and {{output_data}} as top-level placeholders in prompts — do NOT reference nested span paths like {{input_data.messages[-1].content}}. The evaluator's data comes from the user's dataset and task function, not directly from spans. Instead, add a comment above each evaluator describing what data it was designed for and what the user should adapt:
```
# Designed for: input_data = user query, output_data = assistant response text
# Observed from: root agent span (input.value → output.value)
# If your dataset uses a different structure, adapt the prompt references below.
```
Use the narrowest evaluator type: If a check can be done with JSONEvaluator, RegexMatchEvaluator, StringCheckEvaluator, or LengthEvaluator, do NOT use an LLMJudge. Code-based evaluators are faster, cheaper, and deterministic.
BaseEvaluator subclasses:
- Call super().__init__(name=name) in __init__
- Return EvaluatorResult from evaluate()
- Do NOT modify instance attributes in evaluate() (thread safety)
Names: Must match ^[a-zA-Z0-9_-]+$. Use snake_case descriptive names.
Imports: Consolidate at the top of the file. Only import classes that are actually used.
Evaluator list: Collect all evaluators into an evaluators list at the bottom of the file.
Anonymize PII: Strip emails, names, and sensitive data from any trace content included in LLMJudge prompts or the header comment.

## Generated Evaluators

Wrote {N} evaluators to `{output_path}`:

| # | Name | Type | Covers |
|---|------|------|--------|
| 1 | ... | ... | ... |

### Next Steps

1. **Review**: Check the generated prompts and criteria match your expectations
2. **Test offline**: Use `LLMObs.experiment(evaluators=evaluators)` to batch-evaluate against a labeled dataset and verify scores

"""
Auto-generated evaluators for {ml_app}
Generated: {YYYY-MM-DD} by eval-bootstrap

App profile: {LLM | RAG | Agent | Multi-agent}

Quality dimensions covered:
  - {target_name}: {description}
    Evidence: https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}
  ...

Usage:
    from ddtrace.llmobs import LLMObs

    experiment = LLMObs.experiment(
        name="my-experiment",
        task=my_task_fn,
        dataset=dataset,
        evaluators=evaluators,
    )
    experiment.run()
"""

{imports — only what is used}


# --- Outcome Evaluators ---

{evaluator code}


# --- Format Evaluators ---

{evaluator code}


# --- Safety Evaluators ---

{evaluator code}


# --- Evaluator Suite ---

evaluators = [
    {eval_1_variable_name},
    {eval_2_variable_name},
    ...
]

{
  "schema_version": "1",
  "generated_at": "<ISO 8601 UTC>",
  "generated_by": "eval-bootstrap",
  "app": {
    "ml_app": "<string>",
    "app_type": "LLM | RAG | Agent | Multi-agent",
    "trace_window": "<timeframe param, e.g. now-7d>",
    "trace_count": "<integer>"
  },
  "evaluators": [
    {
      "name": "snake_case_name",
      "category": "outcome | format | safety",
      "type": "llm_judge | code_check",
      "description": "<1-2 sentence plain-language description>",
      "target_span": "<which span: root, llm sub-span, etc.>",
      "scoring": {
        "scale": "boolean | score_1_10 | categorical",
        "categories": ["<only present when scale=categorical>"],
        "pass_criteria": "<human-readable: true, >= 7, in [correct], etc.>"
      },
      "rubric": "<full prompt text for llm_judge; null for code_check>",
      "implementation_hints": {
        "type_if_code_check": "json_valid | regex | contains | length_words | null",
        "pattern_if_code_check": "<pattern string or null>",
        "notes": "<optional framework-agnostic implementation guidance>"
      },
      "evidence": [
        {
          "trace_id": "<32-char hex>",
          "span_id": "<16-char hex>",
          "url": "https://app.datadoghq.com/llm/traces?query=trace_id:<trace_id>",
          "observation": "<why this trace illustrates the evaluator>"
        }
      ]
    }
  ],
  "sample_records": [
    {
      "trace_id": "<string>",
      "span_id": "<string>",
      "input": {},
      "output": "<string>",
      "suggested_labels": {
        "<evaluator_name>": "pass | fail | <score>"
      }
    }
  ]
}

## Generated Eval Spec

Wrote `./evals/{ml_app}_eval_spec.json`:

- **{N} evaluators** ({outcome_count} outcome, {format_count} format, {safety_count} safety)
- **{M} sample records** with suggested labels

| # | Name | Category | Type | Pass Criteria |
|---|------|----------|------|---------------|
| 1 | ... | ... | ... | ... |

### Next Steps

1. **Review**: Open `./evals/{ml_app}_eval_spec.json` and verify the rubrics match your expectations
2. **Implement**: Use the `rubric` field to configure evaluators in your framework of choice:
   - OpenAI Evals: use `rubric` as a model-graded criterion
   - Braintrust: create an LLM scorer with the rubric text
   - Promptfoo: use as an `llm-rubric` assertion
   - Custom code: call your LLM API with the rubric and parse the structured output
3. **Label**: `suggested_labels` are Claude's best guesses from trace inspection — verify against ground truth before using as training data

Signal	App Profile
`content_info` has `messages`	LLM/chat app
`content_info` has `documents`	RAG app
Spans include `agent` kind	Agent app
`content_info` has `metadata`	Has custom metadata

Eval Bootstrap — Generate Evaluator Code from Production Traces

Usage

Inputs

Eval Bootstrap — Generate Evaluator Code from Production Traces

Usage

Inputs

Available Tools

Key get_llmobs_span_content Patterns

How to Use search_llmobs_spans

Parallelization Rules

Evaluator SDK Reference

Imports

EvaluatorContext (what evaluate() receives)

EvaluatorResult (what evaluate() returns)

LLMJudge — LLM-as-Judge Evaluator

Structured Output Types

LLMJudge Prompt Guidelines

BaseEvaluator — Custom Code-Based Evaluator

Built-in Evaluators

Evaluator Type Decision Matrix

Source Verification

Workflow

Phase 0: Resolve Inputs & Entry Mode

Phase 1: Explore Traces & Identify Eval Targets

Cold Start Path

From RCA Path

Phase 2: Propose Evaluator Suite

Deduplication Against Existing Coverage

MANDATORY CHECKPOINT

Phase 3: Generate Output

Phase 3A: Generate & Write Evaluator Code

Code Generation Rules

Write the file

Output Format

Phase 3B: Generate & Write Eval Spec JSON

JSON Schema

Field Notes

Writing Instructions

Operating Rules

Test

Feature Flags

Unit Tests

Integration Tests

Write Frontend Tests

Golang Testing

Key `get_llmobs_span_content` Patterns

How to Use `search_llmobs_spans`

EvaluatorContext (what `evaluate()` receives)

EvaluatorResult (what `evaluate()` returns)