Validate data and analysis before sharing - methodology, accuracy, bias, and data quality checks
Validate data files and analyses for accuracy, methodology, and potential biases before sharing with stakeholders. Generates a confidence assessment and improvement suggestions.
Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dirandmcp__qsv__qsv_set_working_dirto sync the working directory.
The input can be:
If a data file is provided, run these checks using qsv:
a. Index and profile: Run mcp__qsv__qsv_index, then mcp__qsv__qsv_stats with and to understand the data.
cardinality: true, stats_jsonl: truemcp__qsv__qsv_sniffb. Completeness: Read .stats.csv — check nullcount and sparsity for each column. Flag columns with sparsity > 0.5.
c. Uniqueness: Compare cardinality to row count from mcp__qsv__qsv_count. Flag key columns (ID, email) where cardinality < row count. Run mcp__qsv__qsv_command with command: "dedup" and options: {"dupes-output": "dupes.csv"} to find exact duplicates.
d. Validity: Check type column in stats — flag String columns that should be numeric. Run mcp__qsv__qsv_command with command: "validate" against a JSON Schema if available.
e. Consistency: Run mcp__qsv__qsv_frequency with limit: 20 on categorical columns — look for case variants ("NYC" vs "nyc"), inconsistent formats, unexpected values.
f. Accuracy: Read .stats.csv for min, max, mean, stddev — flag implausible ranges (negative ages, latitude > 90, future dates). Run mcp__qsv__qsv_moarstats with advanced: true — check outliers_percentage > 5%, kurtosis > 10 (extreme outliers).
g. Distribution sanity: Read moarstats columns for deeper validation:
median_mean_ratio — if < 0.8 or > 1.2, distribution is significantly skewed; verify the mean isn't misleadingwinsorized_mean_25pct vs mean — large divergence (> 10%) confirms outliers are distorting the averagemad (median absolute deviation) — more robust than stddev for outlier detection; if mad_stddev_ratio > 0.8, stddev is reasonably reliablejarque_bera_pvalue — if < 0.05, data is NOT normally distributed; flag any analysis that assumes normalitymode_count — if mode accounts for > 50% of values, investigate whether this reflects a data entry default or missing value maskingh. Join integrity (if multiple files): Run mcp__qsv__qsv_joinp with left_anti: true to find orphaned foreign keys.
i. Injection screening: Run mcp__qsv__qsv_command with command: "searchset", regexset-file: "${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt", and flag: "injection_match" to scan for malicious payloads.
Examine the analysis for:
Systematically review against these pitfalls:
| Pitfall | How to Detect with qsv | Red Flag |
|---|---|---|
| Join explosion | mcp__qsv__qsv_count before and after join | Row count increased after join |
| Survivorship bias | mcp__qsv__qsv_frequency on status/lifecycle columns | Missing churned/deleted/failed entities |
| Incomplete period | mcp__qsv__qsv_sqlp to check date ranges | Partial periods compared to full periods |
| Denominator shifting | mcp__qsv__qsv_sqlp to verify denominator consistency | Definition changed between periods |
| Average of averages | mcp__qsv__qsv_sqlp to recalculate from raw data | Pre-aggregated averages with unequal group sizes |
| Selection bias | mcp__qsv__qsv_frequency on segment definitions | Segments defined by the outcome being measured |
Spot-check using mcp__qsv__qsv_sqlp:
SELECT SUM(subtotal) as check_total FROM dataSELECT SUM(pct) FROM data| Metric Type | Sanity Check via qsv |
|---|---|
| Counts | mcp__qsv__qsv_count — does it match known figures? |
| Sums/averages | mcp__qsv__qsv_stats — are min/max/mean in plausible range? |
| Rates | mcp__qsv__qsv_sqlp — are values between 0% and 100%? |
| Distributions | mcp__qsv__qsv_frequency — do segment percentages sum to ~100%? |
| Growth rates | mcp__qsv__qsv_sqlp — is 50%+ MoM growth realistic? |
| Outliers | mcp__qsv__qsv_moarstats — outliers_percentage, kurtosis |
If the analysis includes charts:
Review whether:
Provide specific, actionable suggestions:
Rate the analysis on a 3-level scale:
Ready to share — Analysis is methodologically sound, calculations verified, caveats noted. Minor suggestions for improvement but nothing blocking.
Share with noted caveats — Analysis is largely correct but has specific limitations or assumptions that must be communicated to stakeholders. List the required caveats.
Needs revision — Found specific errors, methodological issues, or missing analyses that should be addressed before sharing. List the required changes with priority order.
mcp__qsv__qsv_sniff to verify format and encoding.nullcount/sparsity in .stats.csv.mcp__qsv__qsv_count before/after joins, dedup --dupes-output.mcp__qsv__qsv_count after joins.mcp__qsv__qsv_sqlp.min/max in .stats.csv.mcp__qsv__qsv_sqlp to check period-over-period.A many-to-many join silently multiplies rows, inflating counts and sums. Detect: mcp__qsv__qsv_count before and after join — if count increased, investigate the join relationship. Prevent: Use COUNT(DISTINCT id) instead of COUNT(*) when counting entities through joins.
Analyzing only entities that exist today, ignoring churned/deleted/failed ones. Detect: mcp__qsv__qsv_frequency on status columns — are all lifecycle states represented? Prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
Comparing a partial period to a full period. Detect: mcp__qsv__qsv_sqlp to check min/max dates per period. Prevent: Filter to complete periods or compare same number of days.
The denominator changes between periods, making rates incomparable. Detect: mcp__qsv__qsv_sqlp to verify denominator definition consistency. Prevent: Use consistent definitions across all compared periods.
Averaging pre-computed averages gives wrong results when group sizes differ. Detect: Compare mcp__qsv__qsv_stats mean against mcp__qsv__qsv_sqlp weighted average. Prevent: Always aggregate from raw data.
Trend reverses when data is aggregated vs. segmented. Detect: mcp__qsv__qsv_sqlp GROUP BY at different granularity levels — does the conclusion change? Prevent: Always check results at segment level before aggregating.
## Validation Report
### Overall Assessment: [Ready to share | Share with caveats | Needs revision]
### Data Quality Summary
- File: [format, rows, columns, encoding]
- Completeness: [null rates, gaps found]
- Uniqueness: [duplicates found, cardinality issues]
- Validity: [type mismatches, schema violations]
- Accuracy: [outliers, implausible ranges]
### Methodology Review
[Findings about approach, data selection, definitions]
### Issues Found
1. [Severity: High/Medium/Low] [Issue description and impact]
2. ...
### Calculation Spot-Checks
- [Metric]: [Verified / Discrepancy found]
- ...
### Visualization Review
[Any issues with charts or visual presentation]
### Suggested Improvements
1. [Improvement and why it matters]
2. ...
### Required Caveats for Stakeholders
- [Caveat that must be communicated]
- ...
/data-profile instead/data-clean to fix them and re-validate