Bio/medical/scientific evidence hierarchy and anti-hallucination rules. Use when conducting claim-heavy medical research, genomics interpretation, supplement evaluation, pharmacogenomics, or clinical evidence synthesis. NOT for casual health questions, software engineering, or physics. Companion to researcher skill.
Domain-specific guardrails for scientific research. Use alongside researcher for the workflow; this skill provides the evidence hierarchy, anti-hallucination rules, and bio-specific failure modes.
Citation requirement: Every non-trivial factual claim needs a resolvable citation (DOI, PMID, ClinicalTrials.gov ID, or official URL). If you can't cite it, label "UNCITED."
No fake citations: Never invent paper titles, authors, journals, or numbers. If you can't find the paper, say so.
Separate evidence layers: Keep strictly distinct:
NEVER let (a-c) substitute for (d-g). Say explicitly: "Mechanistic evidence only; no human clinical trial confirms this."
Quantify uncertainty: Effect sizes need CIs or ranges. State population, comparator, timeframe. For genetic associations: OR + CI + population + MAF.
Genetic claims: Distinguish GWAS association vs functional validation vs clinical actionability. State penetrance. "Associated with" ≠ "causes." Single-SNP interpretation of polygenic traits is usually misleading. PGx claims need CPIC/DPWG level.
Dosing guardrails: Rx = guideline ranges only ("discuss with prescriber"). OTC/supplements = evidence-based ranges if cited. Genotype→dose only with CPIC/DPWG-level evidence, otherwise label INFERENCE.
Grade every claim:
| Grade | Type | Notes |
|---|---|---|
| 1 | Clinical guideline / consensus | NICE, WHO, AAD, CPIC, DPWG |
| 2 | Systematic review / meta-analysis | Cochrane, PRISMA-compliant |
| 3 | Well-powered RCT | Pre-registered, independent, adequate N |
| 4 | Small / pilot RCT | Underpowered, often industry-funded |
| 5 | Large observational / cohort | Adjusted, replicated |
| 6 | GWAS / genetic association | Report OR, CI, population, replication |
| 7 | Animal model | Species, dose, route — note translatability |
| 8 | In vitro / cell culture | Note concentration vs physiological |
| 9 | Case report / expert opinion | Lowest weight |
Always note: COI, replication status, sample size, population match, effect size (NNT, ARR, or Cohen's d when available).
You may reason from first principles, but MUST label it INFERENCE.
Any INFERENCE must include:
Three buckets in every output:
Before promoting any new concept into a system, ask:
If the answer is "no" or "mostly wording", do not promote it as a new runtime object. Keep it as memo-level guidance or merge it into an existing operator.
This matters most in genomics and phenotype-policy work, where good epistemic caveats can easily metastasize into a Rube Goldberg system if every caveat becomes a first-class type.
Check yourself against each before outputting:
| Model | Failure Mode | Severity | Notes |
|---|---|---|---|
| Claude (Opus 4.6) | Sycophantic hedging; agrees then qualifies until useless | Medium | Improved from 4.5 but still present |
| Claude | Citation-shaped bullshit; plausible references that don't exist | High | CoT unfaithfulness baseline: 7-13% on clean prompts (ICLR 2026) |
| Claude | Genotype determinism; treats associations as deterministic | High | |
| GPT (5.2–5.4) | Confident fabrication; invents complete fake studies with authors and N | Critical | Worse with extended thinking enabled. 5.4 improved (SimpleQA ~72%) but still rarely refuses — fabricates confidently |
| GPT | Overcitation; cites 20+ papers, many tangential or unverifiable | Medium | |
| Gemini (3.1 Pro) | Google-source bias; over-relies on Scholar snippets without reading papers | High | 1M context invites dumping entire papers without processing |
| Gemini | Length inflation; massive outputs that bury the signal | Medium | |
| All models | Implicit post-hoc rationalization; unfaithful CoT on clean prompts | Medium | 7-13% baseline rate (arXiv, ICLR 2026 submission). Not adversarial — happens on normal prompts |
Cross-model validation: For high-stakes bio claims (Grade 1-3 evidence affecting clinical decisions), route the same evidence through a second model as independent assessor. Different models have different fabrication patterns — Claude invents plausible-but-wrong citations, GPT invents complete fake studies. Cross-checking catches both.
Before grading evidence or writing conclusions, recite the key evidence items verbatim — restate the study name, N, effect size, and population for each Grade 1-5 source you're relying on. This combats lost-in-the-middle effects when working with many sources (Du et al., EMNLP 2025: +4% accuracy, training-free).
Don't summarize — recite. The act of restating forces attention back to the actual data before the synthesis step where hallucination risk is highest.
After any bio research output:
Justified: CPIC Level A/B, PharmGKB Level 1A/1B. Not justified (but LLMs do it): Single GWAS hit OR<2.0→dose recommendation; nutrigenomic SNP→supplement dose; variant without replication in user's ancestry.
Key databases: CPIC (cpicpgx.org), PharmGKB, ClinVar, DPWG, gnomAD.