Name: Kayba Stage 6 Hitl
Author: kayba-ai

스킬 검색.../

Kayba Stage 6 Hitl | Skills Pool

EXECUTIVE SUMMARY
-----------------
Insights analyzed:    19 (raw) -> 12 distinct after dedup
Actionable:           9  (8 prompt fixes, 1 code fix)
Discarded:            3  (reasons listed below)

Discards:
  - 5ac7f4ce (Upfront Info Collection): conflicts with higher-priority turn discipline
  - fe2d51cb (Proactive Reservation Lookup): already default behavior, no failure evidence
  - 1fa1b826 (Cancellation Denial Enumeration): subsumed into cancellation checklist

#1: Turn Discipline (covers 55c00c40, d9683144)
    Type:     prompt fix
    Metrics:  M1 (41.4% -> up), M2 (20.7% -> up)
    Risk:     Low

    BEFORE (task_1, task_5, task_7, ...):
      Agent batches 2-3 tool calls per turn (e.g., get_reservation + get_flight_status
      in a single response). Also includes user-facing text alongside tool calls.

    AFTER:
      Exactly one tool call per response. No user-facing content in tool-call turns.
      Agent processes each result before making the next call.

| Priority | Fix Name                          | Type       | Target Metrics  | Risk   | Effort |
|----------|-----------------------------------|------------|-----------------|--------|--------|
| 1        | Turn Discipline                   | prompt fix | M1, M2          | Low    | Low    |
| 2        | Post-Confirmation Execution       | prompt fix | M3              | Low    | Low    |
| 3        | Cancellation Checklist            | prompt fix | M5              | Low    | Low    |
| ...      | ...                               | ...        | ...             | ...    | ...    |

LOW-CONFIDENCE METRICS (small sample size):
  - M5 (Cancellation Policy Compliance): based on 2 observations -- directional only
  - M6 (Compensation Execution Rate): based on 1 observation -- directional only

Fixes targeting these metrics (Cancellation Checklist, Compensation Rules) are
still recommended because the policy violations are clear from trace evidence,
but the measured improvement may not be statistically meaningful until the
trace corpus grows.

TRACEABILITY:
  55c00c40 (Tool Call Discipline) -> M1, M2 -> Skill 1 (Turn Discipline) -> M1 up, M2 up
  6ea141e1 (Execution Discipline) -> M3 -> Skill 2 (Post-Confirmation) -> M3 up
  0f4a952b + 6ce88ebb (Cancellation) -> M5 -> Skill 3 (Cancellation Checklist) -> M5 up
  ...

OPTIONS:
  [A] Approve all -- implement all 9 fixes as described
  [B] Approve with modifications -- review each fix individually
  [C] Reject -- return to Stage 5 with feedback

# Stage 6: HITL Decision Record

## Date
[timestamp]

## Decision
[Approve all | Approve with modifications | Reject]

## What was presented
- Total insights: N (M distinct after dedup)
- Actionable fixes: X (Y prompt, Z code)
- Discarded: W
- Metrics: [list metric IDs and baselines]
- Low-confidence flags: [list metrics with small denominators]

## Top 3 changes presented
1. [fix name] -- [type] -- targets [metrics] -- risk [rating]
2. ...
3. ...

## Decision details

### If Approve all:
User approved all N fixes without modification.
Reasoning: [any reasoning the user provided, or "No additional reasoning provided"]

### If Approve with modifications:
| Fix | Original Status | Decision | Reason |
|-----|----------------|----------|--------|
| Turn Discipline | Priority 1 | Approved | -- |
| Compensation Rules | Priority 5 | Modified | User changed wording to... |
| Cabin Change Rules | Priority 8 | Skipped | User considers low priority |

Modifications detail:
- [Fix name]: Original: "..." -> Modified: "..." -- User rationale: "..."

### If Reject:
User feedback: [verbatim feedback]
Specific concerns: [list]
Re-run instructions for Stage 5: [what to change]

## Traceability snapshot
[Copy of the traceability chain from step 6, so the decision record is self-contained]

Kayba Stage 6 Hitl

Stage 6: Human-In-The-Loop Gate

Inputs

Process

1. Build the executive summary

Kayba Stage 6 Hitl

Stage 6: Human-In-The-Loop Gate

Inputs

Process

1. Build the executive summary

2. Present the top 3 highest-impact changes

3. Present the full prioritized fix list

4. Present "What we are NOT fixing and why"

5. Flag small-sample and low-confidence items

6. Show the insight-to-fix traceability chain

7. Collect the decision

If the user selects [A] Approve all

If the user selects [B] Approve with modifications

If the user selects [C] Reject

Output format

eval/stage6_decision.md

eval/action_plan.md (updated, only if modifications were made)

Rules

Outputs

Things Mac

Trello

Production Scheduling

Jira Integration

Production Scheduling

Cost Aware Llm Pipeline