Human-In-The-Loop gate that presents the action plan with full context, collects an informed approval/modification/rejection decision, and records the outcome. Trigger when the user says "run stage 6", "HITL review", "approve action plan", or when invoked by the kayba-pipeline orchestrator. Requires eval/action_plan.md and eval/baseline_metrics.md to exist.
Present the action plan with enough context for an informed decision, collect the user's approval, and record the outcome.
The goal is not rubber-stamping. The user must receive enough information to genuinely evaluate, modify, or reject the plan -- even if they have not seen Stages 1-5.
eval/action_plan.md -- the prioritized action plan from Stage 5eval/baseline_metrics.md -- the evaluation rubric with baseline valueseval/baseline_metrics.json -- raw metric data (for exact numerator/denominator counts)eval/stage1_insights_summary.md -- original insights (for trace evidence references)Read all four files before starting.
Compute and present the following counts from the action plan:
Format:
EXECUTIVE SUMMARY
-----------------
Insights analyzed: 19 (raw) -> 12 distinct after dedup
Actionable: 9 (8 prompt fixes, 1 code fix)
Discarded: 3 (reasons listed below)
Discards:
- 5ac7f4ce (Upfront Info Collection): conflicts with higher-priority turn discipline
- fe2d51cb (Proactive Reservation Lookup): already default behavior, no failure evidence
- 1fa1b826 (Cancellation Denial Enumeration): subsumed into cancellation checklist
For each of the top 3 fixes by priority, present:
Before/after behavior -- use concrete examples from actual traces referenced in the insights. Quote the specific agent behavior that was wrong (before) and describe what the agent should do instead (after). Reference the trace task ID.
Target metric delta -- which metric(s) this fix targets, the current baseline value, and the expected direction. Do not fabricate precise target numbers. Use the format: "M1: 41.4% -> higher (target: 90%+)" only when the action plan provides a target; otherwise use "M1: 41.4% -> up".
Risk rating -- assess each fix:
Low -- additive prompt instruction, no behavioral side effects expectedMedium -- changes existing behavior, could affect adjacent workflowsHigh -- modifies code/infrastructure, or could degrade a metric while improving anotherFormat each as a numbered block:
#1: Turn Discipline (covers 55c00c40, d9683144)
Type: prompt fix
Metrics: M1 (41.4% -> up), M2 (20.7% -> up)
Risk: Low
BEFORE (task_1, task_5, task_7, ...):
Agent batches 2-3 tool calls per turn (e.g., get_reservation + get_flight_status
in a single response). Also includes user-facing text alongside tool calls.
AFTER:
Exactly one tool call per response. No user-facing content in tool-call turns.
Agent processes each result before making the next call.
Display all non-discarded fixes in a table:
| Priority | Fix Name | Type | Target Metrics | Risk | Effort |
|----------|-----------------------------------|------------|-----------------|--------|--------|
| 1 | Turn Discipline | prompt fix | M1, M2 | Low | Low |
| 2 | Post-Confirmation Execution | prompt fix | M3 | Low | Low |
| 3 | Cancellation Checklist | prompt fix | M5 | Low | Low |
| ... | ... | ... | ... | ... | ... |
Effort ratings:
Low -- single prompt addition, under 5 linesMedium -- multiple prompt additions or minor code changeHigh -- significant code changes, new metric implementation, or architectural changesList every discarded insight with:
This section exists so the user can override a discard if they disagree.
Any metric with denominator < 5 must be explicitly called out:
LOW-CONFIDENCE METRICS (small sample size):
- M5 (Cancellation Policy Compliance): based on 2 observations -- directional only
- M6 (Compensation Execution Rate): based on 1 observation -- directional only
Fixes targeting these metrics (Cancellation Checklist, Compensation Rules) are
still recommended because the policy violations are clear from trace evidence,
but the measured improvement may not be statistically meaningful until the
trace corpus grows.
Also flag any fix where the action plan notes uncertainty or partial evidence.
For each fix, present the chain: insight -> metric -> fix -> expected improvement. This can be a compact list or a table. The purpose is to let the user verify that nothing was lost or invented between stages.
TRACEABILITY:
55c00c40 (Tool Call Discipline) -> M1, M2 -> Skill 1 (Turn Discipline) -> M1 up, M2 up
6ea141e1 (Execution Discipline) -> M3 -> Skill 2 (Post-Confirmation) -> M3 up
0f4a952b + 6ce88ebb (Cancellation) -> M5 -> Skill 3 (Cancellation Checklist) -> M5 up
...
Present exactly three options:
OPTIONS:
[A] Approve all -- implement all 9 fixes as described
[B] Approve with modifications -- review each fix individually
[C] Reject -- return to Stage 5 with feedback
Use the appropriate mechanism to collect the user's choice (direct question or AskUserQuestion if available).
Record the decision and proceed. No further interaction needed.
Walk through each fix individually, in priority order. For each fix, present:
Then ask: "Approve / Skip / Modify?"
After walking through all fixes, present a summary of changes:
Ask for final confirmation: "Proceed with this modified plan?"
Then update eval/action_plan.md:
Ask the user for specific feedback:
Record the feedback in eval/stage6_decision.md and signal that Stage 5 should be re-run with the user's feedback incorporated.
Write this file regardless of which option was selected.
# Stage 6: HITL Decision Record
## Date
[timestamp]
## Decision
[Approve all | Approve with modifications | Reject]
## What was presented
- Total insights: N (M distinct after dedup)
- Actionable fixes: X (Y prompt, Z code)
- Discarded: W
- Metrics: [list metric IDs and baselines]
- Low-confidence flags: [list metrics with small denominators]
## Top 3 changes presented
1. [fix name] -- [type] -- targets [metrics] -- risk [rating]
2. ...
3. ...
## Decision details
### If Approve all:
User approved all N fixes without modification.
Reasoning: [any reasoning the user provided, or "No additional reasoning provided"]
### If Approve with modifications:
| Fix | Original Status | Decision | Reason |
|-----|----------------|----------|--------|
| Turn Discipline | Priority 1 | Approved | -- |
| Compensation Rules | Priority 5 | Modified | User changed wording to... |
| Cabin Change Rules | Priority 8 | Skipped | User considers low priority |
Modifications detail:
- [Fix name]: Original: "..." -> Modified: "..." -- User rationale: "..."
### If Reject:
User feedback: [verbatim feedback]
Specific concerns: [list]
Re-run instructions for Stage 5: [what to change]
## Traceability snapshot
[Copy of the traceability chain from step 6, so the decision record is self-contained]
If the user selected [B] and made changes:
eval/action_plan.md unless the user explicitly requests modifications.eval/stage6_decision.md -- full record of what was presented, decided, and whyeval/action_plan.md -- updated only if the user selected "Approve with modifications"