Name: Bagual Ai Evals Production
Author: rhuanbarros

Skills suchen.../

Bagual Ai Evals Production | Skills Pool

# Dev (synchronous)
@observe(type="agent", metrics=[task_completion, step_efficiency])
def my_agent(input):
    ...

# Production (asynchronous)
@observe(type="agent", metric_collection="agent-task-completion-metrics")
def my_agent(input):
    ...

import os

metrics_kwargs = (
    {"metric_collection": "agent-task-completion-metrics"}
    if os.getenv("ENV") == "production"
    else {"metrics": [task_completion, step_efficiency]}
)

@observe(type="agent", **metrics_kwargs)
def my_agent(input):
    ...

Scope	Where to attach
End-to-end (TaskCompletion, StepEfficiency, PlanQuality, PlanAdherence)	`@observe(type="agent", metric_collection="...")`
Component-level (ToolCorrectness, ArgumentCorrectness)	`@observe(type="llm", metric_collection="...")`

# Component-level — evaluates tool calling decisions on each LLM step
@observe(type="llm", metric_collection="tool-correctness-metrics")
def call_llm(messages):
    ...

# End-to-end — evaluates entire trajectory after task completes
@observe(type="agent", metric_collection="agent-task-completion-metrics")
def travel_agent(user_input):
    ...

from deepeval.tracing import observe, update_current_trace

@observe(
    type="agent",
    available_tools=["search_flights", "book_flight"],
    metric_collection="agent-task-completion-metrics",
)
def travel_agent(user_request: str) -> str:
    update_current_trace(
        tags=["travel-booking", "v3.1"],
        metadata={
            "agent_version": "v3.1",
            "deployment": "us-east",
            "user_segment": "premium",
        },
    )
    # ... agent logic

┌──────────────────────────────────────────────────────────────────┐
│ TIER 1 — Every commit         < 10s    Zero LLM    Blocks       │
│ Schema validation, regex guards, format constraints              │
│ Custom DeepEval JsonCorrectnessMetric, regex assertions          │
│ Catch: format failures, obvious PII leakage, prompt fragments   │
├──────────────────────────────────────────────────────────────────┤
│ TIER 2 — Every PR             2-5min    < $2/run   Flag-not-block│
│ 100-200 test cases, mix of assertions + LLM judges               │
│ Faithfulness, tool correctness, custom product-specific GEval    │
│ Use gpt-4o-mini or Selene Mini as judge                          │
│ Catch: quality regressions in core workflows                     │
├──────────────────────────────────────────────────────────────────┤
│ TIER 3 — Nightly              Hours     Higher     Ship-blocker  │
│ Full benchmark, pass^k (k≥4), complete multi-turn coherence      │
│ End-to-end task completion over 50+ tasks                        │
│ Use gpt-4o or Claude for calibration                             │
│ Catch: drift, subtle regression, reliability issues              │
└──────────────────────────────────────────────────────────────────┘

from dataclasses import dataclass
from typing import Optional

@dataclass
class SessionCostMetrics:
    session_id: str
    total_input_tokens: int
    total_output_tokens: int
    tool_calls_count: int
    resolved: bool  # from downstream system state or human review
    model: str
    
    def cost_usd(self, input_price_per_1m: float, output_price_per_1m: float) -> float:
        return (
            self.total_input_tokens / 1_000_000 * input_price_per_1m +
            self.total_output_tokens / 1_000_000 * output_price_per_1m
        )

def cost_per_resolution(sessions, input_price, output_price):
    resolved = [s for s in sessions if s.resolved]
    if not resolved:
        return None
    total_cost = sum(s.cost_usd(input_price, output_price) for s in resolved)
    return total_cost / len(resolved)

# tests/test_agent_evals.py
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import TaskCompletionMetric, ToolCorrectnessMetric

# Load dataset (pull from Confident AI or load locally)
dataset = EvaluationDataset()
dataset.pull(alias="regression-dataset-v1")

# Generate test cases by running the agent
for golden in dataset.goldens:
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=your_agent(golden.input),
    )
    dataset.add_test_case(test_case)

# Pytest parameterizes over the test cases
@pytest.mark.parametrize("test_case", dataset.test_cases)
def test_agent(test_case: LLMTestCase):
    metric = TaskCompletionMetric(threshold=0.8)
    assert_test(test_case=test_case, metrics=[metric])

deepeval test run tests/test_agent_evals.py

# .github/workflows/eval.yml

Bagual Ai Evals Production

DeepEval Production — Taking Evals to Production

Critical prerequisite

Why production is different from dev

Bagual Ai Evals Production

DeepEval Production — Taking Evals to Production

Critical prerequisite

Why production is different from dev

Core concept: `metric_collection`

Step 1 — Create metric collection in Confident AI

Where to attach metric_collection

Step 2 — Tags and metadata

Step 3 — Validate export

Three-Mode Architecture for spans in production

1. Trace evals (end-to-end)

2. Span evals (component-level)

3. Thread evals (multi-turn / conversation)

⚠️ The Three-Tier Eval Gate Strategy (recommended)

Cost-per-resolution: the efficiency metric that matters

Judge drift monitoring

CI/CD — Regression testing with `deepeval test run`

Test file structure

Run locally

Run in GitHub Actions

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio

Bagual Ai Evals Production

DeepEval Production — Taking Evals to Production

Critical prerequisite

Why production is different from dev

Bagual Ai Evals Production

DeepEval Production — Taking Evals to Production

Critical prerequisite

Why production is different from dev

Core concept: metric_collection

Step 1 — Create metric collection in Confident AI

Where to attach metric_collection

Step 2 — Tags and metadata

Step 3 — Validate export

Three-Mode Architecture for spans in production

1. Trace evals (end-to-end)

2. Span evals (component-level)

3. Thread evals (multi-turn / conversation)

⚠️ The Three-Tier Eval Gate Strategy (recommended)

Cost-per-resolution: the efficiency metric that matters

Judge drift monitoring

CI/CD — Regression testing with deepeval test run

Test file structure

Run locally

Run in GitHub Actions

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio

Core concept: `metric_collection`

CI/CD — Regression testing with `deepeval test run`