Name: Validate Evaluator
Author: hamelsmu

Validate Evaluator | Skills Pool

from sklearn.model_selection import train_test_split

# First split: separate test set
train_dev, test = train_test_split(
    labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# Second split: separate training examples from dev set
train, dev = train_test_split(
    train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# Result: ~15% train, ~45% dev, ~40% test

TPR = (judge says Pass AND human says Pass) / (human says Pass)

TNR = (judge says Fail AND human says Fail) / (human says Fail)

from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
                                   labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)

Disagreement Type	Judge	Human	Fix
False Pass	Pass	Fail	Judge is too lenient. Strengthen Fail definitions or add edge-case examples.
False Fail	Fail	Pass	Judge is too strict. Clarify Pass definitions or adjust examples.

Problem	Solution
TPR and TNR both low	Use a more capable LLM for the judge
One metric low, one acceptable	Inspect disagreements for the low metric specifically
Both plateau below target	Decompose the criterion into smaller, more atomic checks
Consistently wrong on certain input types	Add targeted few-shot examples from training set
Labels themselves seem inconsistent	Re-examine human labels; the rubric may need refinement

theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)

import numpy as np

def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
    """Bootstrap 95% CI for corrected success rate."""
    n = len(human_labels)
    estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        h = np.array(human_labels)[idx]
        e = np.array(eval_labels)[idx]

        tp = ((h == 'Pass') & (e == 'Pass')).sum()
        fn = ((h == 'Pass') & (e == 'Fail')).sum()
        tn = ((h == 'Fail') & (e == 'Fail')).sum()
        fp = ((h == 'Fail') & (e == 'Pass')).sum()

        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
        denom = tpr_b + tnr_b - 1

        if abs(denom) < 1e-6:
            continue
        theta = (p_obs + tnr_b - 1) / denom
        estimates.append(np.clip(theta, 0, 1))

    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)

lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")

from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

Training	10-20% (~10-20 examples)	Source of few-shot examples for the judge prompt	Only clear-cut Pass and Fail cases. Used directly in the prompt.
Dev	40-45% (~40-45 examples)	Iterative evaluator refinement	Never include in the prompt. Evaluate against repeatedly.
Test	40-45% (~40-45 examples)	Final unbiased accuracy measurement	Do NOT look at during development. Used once at the end.

Validate Evaluator

Overview

Prerequisites

Core Instructions

Step 1: Create Data Splits

Validate Evaluator

Overview

Prerequisites

Core Instructions

Step 1: Create Data Splits

Step 2: Run Evaluator on Dev Set

Step 3: Measure TPR and TNR

Step 4: Inspect Disagreements

Step 5: Iterate

Step 6: Final Measurement on Test Set

Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)

Step 8: Confidence Interval

Practical Guidance

Anti-Patterns

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio