Classify a paper's demonstrated capability into the three-tier frontier (reliable/sometimes/can't-yet) from research_guideline §5.1 Axis 3. Small skill called in a loop by the frontier-mapping pipeline.
Invoked in a loop by the frontier-mapping pipeline — once per paper.
Given a single paper's task chain + reported success rate + scope of
evaluation, classify the demonstrated capability into one of three
tiers from research_guideline.md §5.1 Axis 3:
This skill does NOT make tool calls — it is pure reasoning on provided input.
{
"paper_id": "arxiv:2501.12345",
"task_chain": {
"task": "Contact-rich peg-in-hole insertion of 3 object types",
"problem": "...",
"challenge": "...",
"approach": "..."
},
"reported_results": {
"success_rate": 0.85,
"n_objects_tested": 3,
"n_environments": 1,
"real_robot": true,
"perturbation_tested": false,
"ablation_strong": true
},
"venue": "RSS",
"year": 2025,
"domain": "contact-rich manipulation"
}
Work through these questions in order:
Is there a demonstrated success rate in a realistic setting?
can't_yetcan't_yetDoes the demonstration span diverse objects / environments / conditions?
sometimes (at best)sometimesreliableIs the reported success rate ≥ 90% with confidence intervals?
sometimes (or can't_yet if < 50%)sometimesreliableHas the capability been independently reproduced by other groups?
sometimesreliable possibleAre there known "can't-yet" capabilities from research_guideline.md
§5.1 Axis 3 that overlap with this task?
sometimes or can't_yet unless evidence is
overwhelming{
"paper_id": "arxiv:2501.12345",
"capability_description": "Contact-rich peg-in-hole insertion (3 peg types)",
"tier": "sometimes",
"evidence": [
"Success rate 85% on 3 objects, single environment (§5.1 of paper)",
"No perturbation tests reported",
"Single-group result, no independent reproduction"
],
"rationale": "Reported in a realistic setting with reasonable success but demonstration scope is narrow: 3 objects, 1 environment, no perturbation tests. Meets 'sometimes' criteria but falls short of 'reliable' which requires diverse conditions and independent reproduction.",
"confidence": "high",
"boundary_notes": "If authors extend evaluation to 10+ objects with perturbation, upgrade to borderline reliable."
}
tier is one of "reliable", "sometimes", "can't_yet".
confidence indicates how confident the classification is:
high — clear-cut case, fits the heuristics wellmedium — borderline between two tierslow — insufficient data to classify; caller should flag for humanYou CAN classify with reasonable confidence based on the reported scope and success rate alone. This is a LIGHTWEIGHT skill — the frontier-mapping pipeline calls you many times and aggregates the results. It is not a substitute for the researcher's own judgment of the field.
You CANNOT assess:
When in doubt between two tiers, pick the more conservative (can't_yet
or sometimes rather than reliable). False-positive "reliable" claims
are more damaging to the frontier map than conservative "sometimes"
claims.
guidelines/doctrine/research_guideline.md §5.1 Axis 3 — capability frontier
(primary source for tier definitions)guidelines/doctrine/research_guideline.md §1.5 — the long tail of the physical
world (context for why "sometimes" is a large category)