Purpose

Guide the metric improvement cycle: identify misaligned metric results, leave structured feedback, run the labs improvement pipeline, and validate changes. This workflow transforms metric quality from initial draft to production-ready through systematic iteration.

Manual Fix First, Then Labs

When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:

Read failure explanations and categorize root causes (e.g., cross-pollination from other flows, extra_questions flagged, end-of-call protocol violations, should-be-N/A cases)
Write manual prompt fixes targeting the dominant failure categories — add SCOPE & FOCUS, DO NOT FLAG, narrow FAILURE CONDITIONS
PATCH the updated descriptions via API
Re-evaluate a sample of 20-30 calls per metric to validate the fixes
THEN use labs feedback for remaining edge cases that manual fixes didn't catch

This avoids wasting labs iterations on issues that are clearly fixable by prompt editing. Labs is for nuanced edge cases, not systemic prompt design flaws.

Purpose

Manual Fix First, Then Labs

When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:

Read failure explanations and categorize root causes (e.g., cross-pollination from other flows, extra_questions flagged, end-of-call protocol violations, should-be-N/A cases)
Write manual prompt fixes targeting the dominant failure categories — add SCOPE & FOCUS, DO NOT FLAG, narrow FAILURE CONDITIONS
PATCH the updated descriptions via API
Re-evaluate a sample of 20-30 calls per metric to validate the fixes
THEN use labs feedback for remaining edge cases that manual fixes didn't catch

This avoids wasting labs iterations on issues that are clearly fixable by prompt editing. Labs is for nuanced edge cases, not systemic prompt design flaws.

Endpoint	Purpose
`GET /observability/v1/call-logs-external/?agent=ID`	List calls
`GET /observability/v1/call-logs-external/{id}/`	Get call details + evaluation results
`POST /observability/v1/call-logs-external/{id}/mark_metric_vote/`	Leave feedback
`POST /test_framework/metric-reviews/process_feedbacks/`	Run labs auto-improve (see below)
`GET /test_framework/metric-reviews/process_feedbacks_progress/`	Poll improvement progress
`POST /observability/v1/call-logs/evaluate_metrics/`	Evaluate specific metrics on calls
`POST /observability/v1/call-logs/rerun_evaluation/`	Re-run evaluation on calls
`POST /test_framework/test-sets/create_from_call_log/`	Create test set from call log

Cekura Labs Workflow

Purpose

Manual Fix First, Then Labs

Cekura Labs Workflow

Purpose

Manual Fix First, Then Labs

The Labs Improvement Cycle

Step 1: Identify Misaligned Results

Manual Approach

Guided Approach (Simulate Labs)

Step 2: Leave Feedback

Good Feedback Patterns

Bad Feedback Patterns

Step 3: Accumulate Feedback

Step 4: Run Auto-Improve

Cost Guard — Never Evaluate >100 Calls Without Confirmation

Step 5: Validate Changes

Step 6: Deploy (Optional Pythonic Conversion)

Interactive Labs Simulation

API Endpoints Reference

Labs Auto-Improve (process_feedbacks)

Create Test Set from Call Log

Additional Resources

Reference Files

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio