Migrating DeepEval to production with Confident AI, async evals, continuous monitoring and CI/CD with pytest. Use when the user says "deepeval in production", "metric collection", "monitoring", "ci/cd evals", "regression testing", "async evals" or similar.
You are the production engineer. You have three jobs:
metric_collection in Confident AI (zero latency)deepeval test run in pytestThe user must have a Confident AI account for production evals. Without it, there's no way to run async evals without inflating agent latency. Confirm:
deepeval login
If they don't have one, pause here and ask them to log in before continuing. It's free to start.
In dev, you run synchronous evals: the agent waits for the eval to finish before responding. That's fine for local testing. In production, it's not:
DeepEval's solution: export traces to Confident AI, which evaluates everything asynchronously there. The agent doesn't wait. It just does normal @observe and traces go out via an OpenTelemetry-like protocol.
metric_collectionIn dev you pass metrics=[...] directly in @observe. In production, you create a metric collection in Confident AI (a named collection of metrics), and reference it by name:
# Dev (synchronous)
@observe(type="agent", metrics=[task_completion, step_efficiency])
def my_agent(input):
...
# Production (asynchronous)
@observe(type="agent", metric_collection="agent-task-completion-metrics")
def my_agent(input):
...
You can have both coexisting in the same code, controlled by an environment variable:
import os
metrics_kwargs = (
{"metric_collection": "agent-task-completion-metrics"}
if os.getenv("ENV") == "production"
else {"metrics": [task_completion, step_efficiency]}
)
@observe(type="agent", **metrics_kwargs)
def my_agent(input):
...
In the Confident AI dashboard:
app.confident-ai.comagent-task-completion-metrics)The collection becomes available for any agent to reference by name.
Same as in dev — scope dictates where to attach:
| Scope | Where to attach |
|---|---|
| End-to-end (TaskCompletion, StepEfficiency, PlanQuality, PlanAdherence) | @observe(type="agent", metric_collection="...") |
| Component-level (ToolCorrectness, ArgumentCorrectness) | @observe(type="llm", metric_collection="...") |
Both can coexist in the same trace:
# Component-level — evaluates tool calling decisions on each LLM step
@observe(type="llm", metric_collection="tool-correctness-metrics")
def call_llm(messages):
...
# End-to-end — evaluates entire trajectory after task completes
@observe(type="agent", metric_collection="agent-task-completion-metrics")
def travel_agent(user_input):
...
This matters because an agent can pick the wrong tool at step 3, recover at step 5, and still complete the task — without component-level, you'd miss the intermediate failure.
To organize runs in production, use update_current_trace:
from deepeval.tracing import observe, update_current_trace
@observe(
type="agent",
available_tools=["search_flights", "book_flight"],
metric_collection="agent-task-completion-metrics",
)
def travel_agent(user_request: str) -> str:
update_current_trace(
tags=["travel-booking", "v3.1"],
metadata={
"agent_version": "v3.1",
"deployment": "us-east",
"user_segment": "premium",
},
)
# ... agent logic
In the dashboard you can filter/group traces by tags and metadata. Useful for:
v3.0 vs v3.1Before declaring production ready, make a test call and confirm it shows up in the dashboard:
app.confident-ai.com → "Observability" → "Traces"Pending → then score)Confident AI distinguishes three types of evals in production:
Evaluate an entire trace from start to finish. For example:
Attach to the agent span via metric_collection.
Evaluate a specific component in isolation. For example:
Attach to the LLM/retriever span via metric_collection.
Evaluate an entire conversation (multiple traces from the same session). For example:
Configured via thread_id on the trace and uses multi-turn metrics.
Before diving into production setup, teach this architecture to the user. It's the recommended structure for balancing cost, frequency, and coverage.
┌──────────────────────────────────────────────────────────────────┐
│ TIER 1 — Every commit < 10s Zero LLM Blocks │
│ Schema validation, regex guards, format constraints │
│ Custom DeepEval JsonCorrectnessMetric, regex assertions │
│ Catch: format failures, obvious PII leakage, prompt fragments │
├──────────────────────────────────────────────────────────────────┤
│ TIER 2 — Every PR 2-5min < $2/run Flag-not-block│
│ 100-200 test cases, mix of assertions + LLM judges │
│ Faithfulness, tool correctness, custom product-specific GEval │
│ Use gpt-4o-mini or Selene Mini as judge │
│ Catch: quality regressions in core workflows │
├──────────────────────────────────────────────────────────────────┤
│ TIER 3 — Nightly Hours Higher Ship-blocker │
│ Full benchmark, pass^k (k≥4), complete multi-turn coherence │
│ End-to-end task completion over 50+ tasks │
│ Use gpt-4o or Claude for calibration │
│ Catch: drift, subtle regression, reliability issues │
└──────────────────────────────────────────────────────────────────┘
Behavior gates:
Why three tiers: each one solves a different problem. Tier 1 catches obvious failures at zero cost. Tier 2 catches quality regressions at a controlled cost per PR. Tier 3 catches reliability issues that only appear at volume — and that would cost an absurd amount to run on every commit.
Most teams only have Tier 3 (manual, sporadic, "let me run evals next week") and never ship with confidence because the feedback loop is too long. Others only have Tier 1 and think they're "doing evals" because they have assertions in CI — but the assertions only catch obvious failures. Three-tier solves both.
Aggregate token cost is a vanity metric. Cost-per-resolution is what matters in production — it accounts for the fact that some interactions require more turns, more tool calls, and more retrieval to reach the same outcome.
from dataclasses import dataclass
from typing import Optional
@dataclass
class SessionCostMetrics:
session_id: str
total_input_tokens: int
total_output_tokens: int
tool_calls_count: int
resolved: bool # from downstream system state or human review
model: str
def cost_usd(self, input_price_per_1m: float, output_price_per_1m: float) -> float:
return (
self.total_input_tokens / 1_000_000 * input_price_per_1m +
self.total_output_tokens / 1_000_000 * output_price_per_1m
)
def cost_per_resolution(sessions, input_price, output_price):
resolved = [s for s in sessions if s.resolved]
if not resolved:
return None
total_cost = sum(s.cost_usd(input_price, output_price) for s in resolved)
return total_cost / len(resolved)
Track weekly. Alert if it increases >15% without a corresponding improvement in resolution rate — that's the signal something regressed in a prompt change or retrieval config.
Teams that don't instrument this find out when a retrieval bug doubled the context window size via cost tracking. Teams that don't measure resolution rate separately from interaction completion miss that the agent is "completing" conversations (reaching a conclusion) without actually solving the user's problem.
In production, judges silently drift. Don't let calibration decay.
Minimum cadence: every 2 weeks:
Recalibration triggers:
Don't trust aggregate agreement. A judge that calls everything PASS has 70% agreement if 70% of traces are actually PASS — but TNR is 0. Track both sides.
deepeval test runCI/CD is the second pillar of production. The idea: every PR runs evals against a regression dataset and blocks merging if something regressed.
# tests/test_agent_evals.py
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import TaskCompletionMetric, ToolCorrectnessMetric
# Load dataset (pull from Confident AI or load locally)
dataset = EvaluationDataset()
dataset.pull(alias="regression-dataset-v1")
# Generate test cases by running the agent
for golden in dataset.goldens:
test_case = LLMTestCase(
input=golden.input,
actual_output=your_agent(golden.input),
)
dataset.add_test_case(test_case)
# Pytest parameterizes over the test cases
@pytest.mark.parametrize("test_case", dataset.test_cases)
def test_agent(test_case: LLMTestCase):
metric = TaskCompletionMetric(threshold=0.8)
assert_test(test_case=test_case, metrics=[metric])
deepeval test run tests/test_agent_evals.py
# .github/workflows/eval.yml