This skill should be used when the user asks to "improve a metric", "run labs", "leave feedback on a metric", "add to labs", "fix metric accuracy", "review metric results", "find misaligned metrics", "iterate on metric quality", or discusses the metric improvement cycle, feedback workflow, or labs pipeline in the Cekura platform.
Guide the metric improvement cycle: identify misaligned metric results, leave structured feedback, run the labs improvement pipeline, and validate changes. This workflow transforms metric quality from initial draft to production-ready through systematic iteration.
When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:
This avoids wasting labs iterations on issues that are clearly fixable by prompt editing. Labs is for nuanced edge cases, not systemic prompt design flaws.
For edge case refinement after manual fixes are validated:
Review recent call evaluations to find suspicious results:
Use mcp__cekura__call_logs_list with agent filters to list recent calls.
Use mcp__cekura__call_logs_retrieve with a call ID to get evaluation results.
Look for:
To systematically find misalignment:
evaluate_calls or list_callsUse the mark_metric_vote endpoint to leave structured feedback. Submit via mcp__cekura__call_logs_retrieve to get the call, then use the feedback/vote API.
Collect at least 6 feedback instances before running auto-improve. This gives labs enough signal to identify patterns in the feedback and make meaningful prompt adjustments.
Track feedback progress:
Once 6+ feedback instances are accumulated:
Use mcp__cekura__metrics_run_reviews_create with the metric ID to trigger auto-improvement.
Labs analyzes the feedback and suggests changes to the metric prompt. Review the suggested changes carefully:
Each evaluation costs the client real money. Before triggering any bulk evaluation:
page_size=1 and read the response count)Use page_size parameter (up to 200) instead of paginating through multiple pages. Use server-side filters (agent_id, project, timestamp__gte/timestamp__lte) to scope calls before evaluating.
Re-run the improved metric on the same calls that had misaligned results:
Use mcp__cekura__call_logs_rerun_evaluation_create with the call IDs and metric ID.
Check:
If validation fails, leave additional feedback and iterate.
Once the metric prompt is validated through labs, consider converting to a Pythonic custom_code metric for production:
description fielddescription (the prompt) and custom_code (the Python wrapper)This gives the benefit of the labs-refined prompt with the performance advantage of targeted context extraction.
When the user wants to simulate the labs workflow interactively:
| Endpoint | Purpose |
|---|---|
GET /observability/v1/call-logs-external/?agent=ID | List calls |
GET /observability/v1/call-logs-external/{id}/ | Get call details + evaluation results |
POST /observability/v1/call-logs-external/{id}/mark_metric_vote/ | Leave feedback |
POST /test_framework/metric-reviews/process_feedbacks/ | Run labs auto-improve (see below) |
GET /test_framework/metric-reviews/process_feedbacks_progress/ | Poll improvement progress |
POST /observability/v1/call-logs/evaluate_metrics/ | Evaluate specific metrics on calls |
POST /observability/v1/call-logs/rerun_evaluation/ | Re-run evaluation on calls |
POST /test_framework/test-sets/create_from_call_log/ | Create test set from call log |
POST /test_framework/metric-reviews/process_feedbacks/
{
"metric_id": 123,
"test_set_ids": [456, 789]
}
Optional fields: metric_type (default "llm_judge"), skip_evaluation (bool), simplified_prompt (string).
Returns {"progress_id": "<uuid>"}. Poll at GET /test_framework/metric-reviews/process_feedbacks_progress/?progress_id=<uuid>.
The response includes improved description and evaluation_trigger when complete — you must PATCH the metric to apply changes (they are not auto-applied).
POST /test_framework/test-sets/create_from_call_log/
{
"call_log_id": 3358270,
"metrics": [{"metric": 123, "feedback": "The metric incorrectly failed this call because..."}]
}
Note: metrics must be an array of objects [{"metric": <id>, "feedback": "<text>"}], NOT bare metric IDs. Passing bare IDs returns 500.
references/feedback-examples.md — Examples of good feedback for different metric types