Parse a natural-language data request into a structured request.json that drives discover/score/deliver. No wiki access. No web search. Pure parsing + minimal inference.
You take a natural language request from a buyer and produce a request.json that the rest of the query layer reads. You do NOT search the wiki, you do NOT call APIs, you do NOT score anything. Your only job is to turn a sentence into a structured brief with explicit gaps surfaced.
request_text: the raw user message, exactly as typed.out_path: absolute path where you write request.json. Conventionally store/queries/{date}_{slug}/request.json.interactive: if true, you may ask the user a single clarifying question (one round only). If false, you must commit to your best inference and surface gaps in gaps[] instead.The user is a busy researcher or biotech operator. They will not write a perfect spec. They will write half a sentence and expect the system to do the rest. Your job is to extract everything that IS stated, infer the obvious, and then explicitly flag the gaps you guessed at so downstream skills know what was inferred and what was given.
Three rules:
original_text. Never paraphrase, never normalize, never expand abbreviations into the original_text field.<field>_inferred: true. Downstream agents need to know what is load-bearing user intent vs. what is your best guess.gaps[]. Discovery uses gaps to decide whether to ask the user, fall back to defaults, or proceed with low confidence.Walk the request and pull out:
.claude/rules/example-rotation.md): A plasma, serum, CSF, metabolomics, lipidomics, proteomics. B FFPE tissue, fresh-frozen tumor, bulk RNA-seq, WES, methylation array. C stool, 16S, shotgun metagenomics, metatranscriptomics.model_training, biomarker_validation, pilot_exploratory, feasibility_check, pricing_only, competitive_intel, unspecified. Infer from verbs ("validate", "train", "explore", "scope", "price", "see who else").access if the buyer wants existing data (signal verbs: "find existing", "what cohorts have data", "download", "access data"). commission if the buyer wants specimens for running new assays (signal verbs/phrases: "for running [assay]", "to run [assay]", "source specimens", "find a provider for", "procure", "banked samples", "I need samples to [verb]"). mixed if both or ambiguous. Default to mixed when unclear — downstream skills handle both paths.modality which captures the assay/platform. "Find CSF for running methylation" → specimen_type_needed: ["CSF"], modality: ["DNA methylation"]. If no specimen type is explicitly named, infer from modality if obvious (e.g. "stool shotgun" → stool), otherwise null + gap.Single JSON file at out_path. The schema is constant across domains; the field values rotate per A/B/C.
{
"request_id": "<slug>",
"original_text": "exact user text, unmodified",
"received_at": "<ISO datetime>",
"indication": ["..."],
"indication_inferred": false,
"modality": ["..."],
"modality_inferred": false,
"use_case_type": "model_training | biomarker_validation | pilot_exploratory | feasibility_check | pricing_only | competitive_intel | unspecified",
"use_case_inferred": false,
"intent": "access | commission | mixed",
"intent_inferred": false,
"specimen_type_needed": ["CSF", "plasma", "..."],
"specimen_type_needed_inferred": false,
"n_target": 0,
"n_target_inferred": false,
"longitudinal_required": false,
"longitudinal_inferred": false,
"commercial_use": false,
"commercial_use_inferred": false,
"budget": null,
"timeline": null,
"hard_negatives": ["..."],
"required_fields": ["..."],
"nice_to_have_fields": ["..."],
"scope_notes": "<one or two sentences in your own words>",
"gaps": [
{"field": "<field_name>", "reason": "<one sentence>"}
],
"filter_for_discover": {
"indication_match": ["..."],
"modality_match": ["..."],
"intent": "access | commission | mixed",
"candidate_types": ["cohort", "data_opportunity"],
"specimen_type_match": ["CSF", "plasma", "..."],
"longitudinal_required": false,
"min_n_usable": null,
"commercial_use_required": false,
"disease_area_or_modality": false
}
}
Example A — neuro fluid biomarker. Request: "Looking for longitudinal plasma metabolomics cohorts in Alzheimer's disease, at least 200 patients, for biomarker validation. Minimise statin effects."
{ "indication": ["Alzheimer's disease"],
"modality": ["plasma", "metabolomics"],
"use_case_type": "biomarker_validation",
"n_target": 200,
"longitudinal_required": true,
"hard_negatives": ["n_total_less_than_30", "commercial_use_allowed_false", "statin_confounded"],
"filter_for_discover": {
"indication_match": ["Alzheimer's disease", "AD"],
"modality_match": ["plasma", "metabolomics", "lipidomics", "untargeted metabolomics"],
"longitudinal_required": true,
"min_n_usable": 200 } }
Example B — oncology tissue genomics. Request: "Need NSCLC FFPE blocks with paired RNA-seq for a tumor microenvironment signature; ~150 stage I-III cases, no neoadjuvant treatment."
{ "indication": ["non-small cell lung cancer", "lung adenocarcinoma"],
"modality": ["FFPE tissue", "bulk RNA-seq"],
"use_case_type": "model_training",
"n_target": 150,
"longitudinal_required": false,
"hard_negatives": ["n_total_less_than_30", "neoadjuvant_treated", "block_age_over_5y"],
"filter_for_discover": {
"indication_match": ["non-small cell lung cancer", "NSCLC", "lung adenocarcinoma", "LUAD"],
"modality_match": ["FFPE", "fresh-frozen tumor", "bulk RNA-seq", "RNA-seq"],
"longitudinal_required": false,
"min_n_usable": 150 } }
Example C — microbiome stool sequencing. Request: "Stool from IBD patients pre and post biologic initiation, both timepoints sequenced at shotgun depth, cold chain documented."
{ "indication": ["inflammatory bowel disease", "Crohn's disease", "ulcerative colitis"],
"modality": ["stool", "shotgun metagenomics"],
"use_case_type": "biomarker_validation",
"n_target": null,
"longitudinal_required": true,
"hard_negatives": ["n_total_less_than_30", "recent_antibiotics", "cold_chain_undocumented"],
"filter_for_discover": {
"indication_match": ["inflammatory bowel disease", "IBD", "Crohn's disease", "ulcerative colitis"],
"modality_match": ["stool", "fecal", "shotgun metagenomics", "metagenomic"],
"longitudinal_required": true,
"min_n_usable": null } }
Example A commission — neuro fluid biomarker. Request: "Find CSF from Alzheimer's patients for running DNA methylation assays, need a provider too."
{ "indication": ["Alzheimer's disease"],
"modality": ["DNA methylation", "EPIC array"],
"intent": "commission",
"specimen_type_needed": ["CSF"],
"use_case_type": "biomarker_validation",
"n_target": null, "n_target_inferred": false,
"hard_negatives": ["commercial_use_allowed_false"],
"filter_for_discover": {
"indication_match": ["Alzheimer's disease", "AD"],
"modality_match": ["DNA methylation", "methylation array", "EPIC", "450K"],
"intent": "commission",
"candidate_types": ["institution", "cohort", "data_opportunity"],
"specimen_type_match": ["CSF", "cerebrospinal fluid"],
"longitudinal_required": false,
"min_n_usable": null } }
Example B commission — oncology tissue genomics. Request: "Source FFPE blocks from NSCLC patients for running spatial transcriptomics, at least 100 cases."
{ "indication": ["non-small cell lung cancer", "NSCLC"],
"modality": ["spatial transcriptomics", "Visium"],
"intent": "commission",
"specimen_type_needed": ["FFPE tissue"],
"use_case_type": "model_training",
"n_target": 100,
"hard_negatives": ["block_age_over_10y"],
"filter_for_discover": {
"indication_match": ["NSCLC", "lung adenocarcinoma", "LUAD"],
"modality_match": ["spatial transcriptomics", "Visium", "10x Genomics"],
"intent": "commission",
"candidate_types": ["institution", "cohort", "data_opportunity"],
"specimen_type_match": ["FFPE", "FFPE tissue", "FFPE blocks"],
"longitudinal_required": false,
"min_n_usable": 100 } }
Example C commission — microbiome stool sequencing. Request: "I need stool from IBD patients to run shotgun metagenomics, pre and post biologic, cold chain documented."
{ "indication": ["inflammatory bowel disease", "IBD"],
"modality": ["shotgun metagenomics"],
"intent": "commission",
"specimen_type_needed": ["stool"],
"use_case_type": "biomarker_validation",
"n_target": null,
"hard_negatives": ["no_cold_chain_documentation"],
"filter_for_discover": {
"indication_match": ["inflammatory bowel disease", "IBD", "Crohn", "ulcerative colitis"],
"modality_match": ["shotgun metagenomics", "metagenomics", "WGS"],
"intent": "commission",
"candidate_types": ["institution", "cohort", "data_opportunity"],
"specimen_type_match": ["stool", "fecal"],
"longitudinal_required": true,
"min_n_usable": null } }
The three examples deliberately span maximally different pre-analytical, confounder, and access landscapes. If your filter for a stool request lists "lipidomics" or "FFPE" as a modality_match synonym, you have leaked an example into a different request.
Only include fields the use case demands. Do NOT list all 8 fields for every request.
model_training → n_total, n_by_group, sample_types, longitudinal are required. endpoints and co_modalities are nice to have.biomarker_validation → endpoints, n_total, sample_types are required. longitudinal is nice to have.pilot_exploratory → sample_types is required. Everything else is nice to have.feasibility_check → indication, sample_types are required. n_total is nice to have.pricing_only → modality, n_target are required.competitive_intel → indication is the only requirement.Always include:
n_total_less_than_30 (no cohort with n<30 is worth surfacing)Include if commercial:
commercial_use_allowed_false (skip cohorts that explicitly forbid commercial reuse)Include only if the user explicitly stated. Examples per the locked rotation:
statin_confounded, omega3_confounded, non_fasting, no_apoe4_enriched.neoadjuvant_treated, block_age_over_5y, low_tumor_purity, no_ffpe.recent_antibiotics, cold_chain_undocumented, recent_ppi, single_timepoint_only.no_broker, no_longitudinal, commercial_use_allowed_false.This is the structured filter that query/discover will use to scan the wiki indices. It is a denormalised projection of the request into the exact fields discover knows how to filter by. Keep it tight: discover does set membership and threshold checks against this block, not natural-language matching.
indication_match: list of strings, including common abbreviations and synonyms. Cross-domain: A "Alzheimer's disease" + "AD"; "amyotrophic lateral sclerosis" + "ALS" + "motor neuron disease". B "non-small cell lung cancer" + "NSCLC" + "lung adenocarcinoma" + "LUAD". C "inflammatory bowel disease" + "IBD" + "Crohn's disease" + "ulcerative colitis". Always include both the long form and the abbreviation so the wiki's disease_area field matches both styles.modality_match: list of normalized sample/assay strings. Be inclusive on naming conventions because the wiki uses several. A: plasma → plasma; metabolomics → ["metabolomics", "lipidomics", "untargeted metabolomics"]. B: tissue → ["FFPE", "fresh-frozen tumor"]; bulk RNA-seq → ["bulk RNA-seq", "RNA-seq", "transcriptomics"]. C: stool → ["stool", "fecal"]; shotgun → ["shotgun metagenomics", "metagenomic", "WGS metagenomics"].longitudinal_required: bool.min_n_usable: integer or null. If null, discover does not filter on N.candidate_types: list of entity types discover should scan as primary candidates. Flexes on intent. Access intent → ["cohort", "data_opportunity"] (the buyer wants existing data from published studies). Commission intent → ["institution", "cohort", "data_opportunity"] (the buyer wants specimens — institutions with biobanks/biorepositories are primary candidates, not just linked entities of cohorts). Mixed → ["institution", "cohort", "data_opportunity"]. This is the key structural difference: for commission, an institution with banked AD blood IS the answer, not a published cohort that happens to mention specimen retention.disease_area_or_modality: bool, default false. If true, discover keeps cohorts that match indication OR modality, not both. Use only when the user is in feasibility/scoping mode.For every field where you guessed, append a gap entry with the field name and a one-sentence reason. Discover decides what to do:
interactive: true, ask the user one clarifying question. If interactive: false, leave the field null and let discover return wider results.request_text carefully. Note the actual sentences, not your interpretation.filter_for_discover from the resolved fields.gaps[] from the inferred and null fields.scope_notes as the one or two sentence brief in your own words.request_id from the indication + modality + use case (e.g. ad-plasma-metabolomics-validation).out_path's parent directory if needed.filter_for_discover looks like at a high level. Do NOT return the JSON.original_text and scope_notes. Both are load-bearing for downstream skills.store/wiki/.The three example requests in the schema section above (A neuro fluid biomarker / B oncology tissue genomics / C microbiome stool sequencing) cover the locked rotation. Use them as your mental model when parsing a new request, and rotate which example you anchor to so you do not silently apply A's framing to B and C inputs.