Name: Experiment Audit
Author: ZhangHanbo

Buscar habilidades.../

Experiment Audit | Skills Pool

python scripts/audit_stats.py <exp_dir> --venue RSS

PYTHONPATH=src python -c "
from alpha_review.apis import search_all
import json, sys
# Search for methods the paper didn't cite
results = search_all(sys.argv[1], limit_per_source=15, year_lo=2022)
results.sort(key=lambda r: r.get('citationCount', 0), reverse=True)
print(json.dumps([{
    'title': r['title'],
    'year': r['year'],
    'cites': r.get('citationCount', 0),
    'venue': r.get('venue', ''),
} for r in results[:10]], indent=2))
" "<problem_keywords> <approach_keywords>"

Pattern	Detection rule
Generality overclaim	Claim scope > test scope by >1 level. E.g., "manipulation" claimed, 3 objects tested.
Novelty overclaim	Claim is "novel framework" but actual delta is "application to new domain".
Comparison overclaim	Claim "outperforms all baselines" but only tests on metrics where the paper's method wins.
Learning overclaim	"Robot learns to X" when robot executes a policy trained on X (attribution of agency).
Robustness overclaim	"Robust to perturbations" with only 3 perturbation types tested.

PYTHONPATH=src python -c "
from alpha_research.records.jsonl import append_record
from pathlib import Path
import json, sys
rid = append_record(Path(sys.argv[1]), 'audit', json.loads(sys.stdin.read()))
print(rid)
" "<project_dir>" <<< '<audit_json>'

{
  "input": "arxiv:2501.12345" or "logs/minimal_run/",
  "venue_target": "RSS",
  "statistical_audit": {
    "trials_per_condition": 8,
    "mean_success_rate": 0.62,
    "std_across_seeds": 0.11,
    "ci_95": [0.51, 0.73],
    "venue_threshold_met": false,
    "severity": "serious"
  },
  "baselines": {
    "present": ["BC", "DiffusionPolicy"],
    "missing_by_category": {
      "simple": false,
      "sota": true,
      "oracle": true
    },
    "strongest_missing": {
      "name": "RT-2 fine-tuned",
      "year": 2024,
      "rationale": "Published 8 months before submission, same task class, cited 200+ times"
    }
  },
  "ablation": {
    "claimed_contribution": "tactile feedback",
    "ablation_row_present": true,
    "performance_drop_without": 0.02,
    "isolation_verdict": "weak",
    "backward_trigger": "t15"
  },
  "overclaiming": [
    {
      "pattern": "generality overclaim",
      "claimed": "manipulation of deformable objects",
      "tested": "3 cloth samples in 1 environment",
      "severity": "serious"
    }
  ],
  "overall_verdict": "serious weaknesses — 2 findings",
  "human_review_required": false
}

Venue	Min trials/condition	CI required?	Real robot?
IJRR, T-RO, RSS, CoRL	≥ 20	Yes	Yes (strongly)
RA-L	≥ 15	Yes	Yes
ICRA, IROS	≥ 10	Preferred	Preferred

Venue	Min trials/condition	CI required?	Real robot?
IJRR, T-RO, RSS, CoRL	≥ 20	Yes	Yes (strongly)
RA-L	≥ 15	Yes	Yes
ICRA, IROS	≥ 10	Preferred	Preferred

Experiment Audit

When to use

Venue-calibrated thresholds (review_plan.md §1.6)

Process

Step 1 — Identify the input

Experiment Audit

When to use

Venue-calibrated thresholds (review_plan.md §1.6)

Process

Step 1 — Identify the input

Step 2 — Run deterministic statistical audit

Step 3 — Check baseline strength

Step 4 — Name the strongest MISSING baseline

Step 5 — Check ablation isolation

Step 6 — Detect overclaiming patterns (review_guideline.md §3.5.3)

Step 7 — Persist

Output format

Honesty protocol

References

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio