Stage 4: Rubric Definition

Organize metrics into a tiered evaluation rubric. Detect and resolve redundancy quantitatively. Ensure every insight is accounted for.

Inputs

Read all four files before starting.

Before tiering, check every pair of metrics for overlap. Two metrics are redundancy candidates if ANY of the following hold:

Denominator overlap >70%: compute |denom_events(A) ∩ denom_events(B)| / min(|denom(A)|, |denom(B)|). If >0.70, they are candidates. To compute this, trace through the detector functions in and determine which trace events (turns, calls, threads) each denominator iterates over. When denominators are identical sets (same loop, same filter), overlap is 100%.

Organize metrics into a tiered evaluation rubric. Detect and resolve redundancy quantitatively. Ensure every insight is accounted for.

Read all four files before starting.

Before tiering, check every pair of metrics for overlap. Two metrics are redundancy candidates if ANY of the following hold:

Denominator overlap >70%: compute |denom_events(A) ∩ denom_events(B)| / min(|denom(A)|, |denom(B)|). If >0.70, they are candidates. To compute this, trace through the detector functions in and determine which trace events (turns, calls, threads) each denominator iterates over. When denominators are identical sets (same loop, same filter), overlap is 100%.

Tier	Purpose	Moves when...	Diagnostic signal
Leading	Behaviors a single skill directly changes	Skill is adopted	If leading moves but lagging doesn't → skill adopted but not solving the right problem
Lagging	Aggregate outcomes requiring multiple skills	Multiple skills coordinate	If lagging moves but leading doesn't → something else improved, not your skills
Quality	Requires domain understanding, not just instruction-following	Agent reasons correctly	If quality moves but lagging doesn't → agent got lucky or metric is mis-tiered