Measurement Validity Experimental Design Lens

Philosophical Mode: Psychometric Primary Question: "Do measurements justify the interpretation?" Focus: Metric-Construct Alignment, Proxy Validity, Reliability, Sensitivity, Consequential Validity

When to Use

Metrics may not measure what they claim to; proxy metrics used instead of true objectives
Evaluation scores treated as "truth" without validation
Metric choice is contested or under-specified
User invokes /exp-lens-measurement-validity or /make-experiment-diag measurement

Critical Constraints

NEVER:

Modify any source code or experimental artifacts
Do not litter the codebase with useless comments, TODO markers, or explanatory annotations — the skill output and diagram speak for themselves

ALWAYS:

Measurement Validity Experimental Design Lens

When to Use

Metrics may not measure what they claim to; proxy metrics used instead of true objectives
Evaluation scores treated as "truth" without validation
Metric choice is contested or under-specified
User invokes /exp-lens-measurement-validity or /make-experiment-diag measurement

Critical Constraints

NEVER:

Modify any source code or experimental artifacts
Do not litter the codebase with useless comments, TODO markers, or explanatory annotations — the skill output and diagram speak for themselves

ALWAYS:

# Measurement Validity Analysis: {Experiment Name} **Lens:** Measurement Validity (Psychometric) **Question:** Do measurements justify the interpretation? **Date:** {YYYY-MM-DD} **Scope:** {What was analyzed} ## Metric Inventory | Metric | Construct Claimed | Computation | Reliability | Sensitivity | |--------|-------------------|-------------|-------------|-------------| | {metric name} | {what it claims to measure} | {aggregation/formula} | {stable / unstable / unknown} | {high / low / saturated} | ## Validity Arguments ### {Metric Name} **Construct claimed:** {The property this metric is presented as measuring} **Evidence for alignment:** - {Supporting argument or citation} **Evidence against alignment / known failure modes:** - {Failure mode 1: e.g., gameable by surface pattern matching} - {Failure mode 2: e.g., proxy collapses when distribution shifts} **Reliability assessment:** {Stable under reruns? Sensitive to seed?} **Sensitivity assessment:** {Can it distinguish meaningful differences in the relevant range?} **Verdict:** {Strong / Partial / Weak / Unsupported} --- ## Proxy Collapse Risks | Metric | Proxy For | Collapse Condition | Consequence | |--------|-----------|--------------------|-------------| | {metric} | {true construct} | {when proxy diverges from construct} | {what is falsely concluded} | ## Gaming Vulnerabilities | Metric | Gaming Strategy | Detection Method | |--------|----------------|-----------------| | {metric} | {how to maximize score without improving construct} | {how to detect gaming} | ## Optional: Metric-Construct Mapping Diagram {Include only if it clarifies alignment; omit if argument tables are sufficient} ```mermaid %%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%% flowchart LR %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef integration fill:#c62828,stroke:#ef9a9a,stroke-width:2px,color:#fff; subgraph Constructs ["Intended Constructs"] C1["Construct A ━━━━━━━━━━ The property claimed"] C2["Construct B ━━━━━━━━━━ Another property"] end subgraph Metrics ["Measured Metrics"] M1["Metric X ━━━━━━━━━━ Computation method"] M2["Metric Y ━━━━━━━━━━ Computation method"] end subgraph Gaps ["Weak / Missing Alignments"] G1["Proxy Collapse Risk ━━━━━━━━━━ Condition for divergence"] end %% ALIGNMENTS %% C1 -->|"strong alignment"| M1 C2 -->|"proxy (weak)"| M2 M2 -.->|"diverges under"| G1 %% CLASS ASSIGNMENTS %% class C1,C2 cli; class M1,M2 output; class G1 gap;

Color	Category	Description
Dark Blue	Construct	Intended properties being claimed
Dark Teal	Metric	What is actually computed and reported
Yellow	Gap	Weak alignment, proxy collapse, or missing evidence
Orange	Proxy	Intermediate proxy relationships

Exp Lens Measurement Validity

Measurement Validity Experimental Design Lens

When to Use

Critical Constraints

Exp Lens Measurement Validity

Measurement Validity Experimental Design Lens

When to Use

Critical Constraints

Analysis Workflow

Step 1: Launch Parallel Exploration Subagents

Step 2: Build Validity Arguments

Step 3: Analyze Metric-Construct Alignment

Step 4: Optional Metric-Construct Mapping Diagram

Step 5: Write Output

Output Template

Summary Verdict

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio