Train your ability to estimate uncertainty accurately using confidence interval and probability exercises. Use when you want to improve your estimation skills, check your calibration before a measurement project, or when another measure-anything skill suggests calibration. Tracks your calibration history over time in .measure/calibration.json.
Calibration is the ability to assess your own uncertainty accurately. A well-calibrated person who says "I'm 90% confident" is right about 90% of the time. Research shows most people start out overconfident — their 90% ranges contain the true answer only 50-60% of the time. This skill trains you to close that gap.
Check for the .measure/ directory and required files. Create anything missing:
.measure/ does not exist, create it..measure/config.json does not exist, create it with:
{
"default_decomposition_depth": 3,
"default_simulation_iterations": 10000,
"default_confidence_level": 0.90,
"created_at": "<current UTC timestamp>"
}
.measure/calibration.json does not exist, create it with:
{
"summary": {
"total_questions_answered": 0,
"interval": { "total": 0, "hit_rate": null, "target_hit_rate": 0.90, "assessment": null, "trend": null },
"binary": { "total": 0, "brier_score": null, "assessment": null, "trend": null },
"last_session": null
},
"history": []
}
.measure/question_bank/ does not exist, create it and copy question files from
.claude/skills/calibration-trainer/assets/interval_questions.json and
.claude/skills/calibration-trainer/assets/binary_questions.json into
.measure/question_bank/. When copying, add "asked": false to each question object.Ask the user:
Would you like to practice interval estimation (90% confidence intervals), binary probability assessment, or both?
If no preference, default to both.
Briefly explain each mode:
Ask:
How many questions would you like? (default: 10, minimum: 5)
Before presenting questions, check the available pool in .measure/question_bank/.
Count unanswered questions (where "asked": false) for the selected mode(s).
"Only N unanswered [type] questions remain. Would you like to proceed with N, or reset the question pool? Resetting marks all questions as unanswered — your calibration history is preserved."
When mode is "both," split questions roughly evenly between interval and binary.
This step is critical, especially for first-time users. Before presenting any interval questions, explain:
About 90% confidence intervals: For each question, give a LOW and HIGH value such that you are 90% confident the true answer falls between them. This means:
- Only about 1 in 10 answers should fall outside your range
- Most people naturally give ranges that are much too narrow (closer to 50% confidence)
- The equivalent bet test: Would you bet $1,000 at 9:1 odds that the answer is inside your range? If that bet feels risky, widen your range.
If the user has calibration history showing overconfidence (hit rate < 80%), add:
Your calibration history suggests your ranges tend to be too narrow. Try making your ranges about 30-50% wider than feels natural.
IMPORTANT: Present exactly ONE question per message, then STOP and wait for the user's response before presenting the next question. Do not list multiple questions. Do not batch questions. One question, one message, wait for the answer.
Do NOT reveal the correct answer after each question — wait until all questions are answered to avoid anchoring effects.
For each question, send a single message in this format:
Interval:
Question N of M: What is [question]?
Give your 90% confidence interval:
- Low (5th percentile — 95% chance the true answer is above this):
- High (95th percentile — 95% chance the true answer is below this):
Binary:
Question N of M: True or false: [statement]
What is your probability (0-100%) that this statement is true?
Then stop and wait for the user to respond. After they answer, present the next question. Repeat until all questions are answered.
Construct a JSON payload with the user's responses:
{
"interval_responses": [
{"question": "...", "answer": <true_answer>, "user_low": <low>, "user_high": <high>}
],
"binary_responses": [
{"statement": "...", "answer": <true/false>, "user_probability": <0-100>}
],
"history": [<previous sessions from .measure/calibration.json>]
}
Run the scoring script:
echo '<json_payload>' | uv run python .claude/skills/calibration-trainer/scripts/score_calibration.py
Parse the JSON output.
Interval results:
Interval Score: N out of M correct (X% hit rate). Target: 90%.
Show average CI width and surprise index (how far misses were from the range).
Binary results:
Brier Score: X (0 = perfect, 0.25 = random guessing, lower is better)
Per-question reveal: Show each question with the correct answer, the user's response, and whether it was a hit or miss.
Trend (if calibration history exists):
Previous session: X% hit rate. This session: Y%. Over N sessions: [improving / stable / declining].
If this is the first session:
This is your first calibration session. We're establishing your baseline.
.measure/question_bank/, mark each presented question as "asked": true..measure/calibration.json history array:
{
"session_id": "<from scoring output>",
"timestamp": "<from scoring output>",
"type": "<interval|binary|both>",
"questions": [<per-question details from scoring output>],
"session_hit_rate": <hit_rate or null>,
"brier_score": <brier_score or null>
}
.measure/calibration.json summary:
total_questions_answeredlast_session to the new session timestampBased on results, provide targeted advice:
If overconfident (hit rate < 80%):
To improve: Before finalizing each range, apply the equivalent bet test — would you bet $1,000 at 9:1 odds that the answer is inside your range? If not, widen it.
Also try thinking about reasons you might be wrong: "What if my assumptions are completely off? What would the answer be then?"
If underconfident (hit rate > 95%):
To improve: You know more than you think. Try starting with your best guess, adding a range that feels "just barely wide enough," then leaving it there instead of widening further.
If well-calibrated:
Your calibration is good. This means your uncertainty estimates on real measurement projects will be more reliable, and value-of-information calculations will be more accurate.
For deeper methodology, read references/scoring-methodology.md.