Parses natural language perturbation biology questions into structured query objects. Use when a user asks about gene knockouts, drug effects, CRISPR screens, perturbation experiments, or says "analyze this perturbation question" or "search for perturb-seq data". This is the pipeline entry point.
Parses a user's natural language perturbation biology question into a structured query object. This is the entry point of the pipeline — every user question flows through here first.
Invoke this workflow whenever a user asks a perturbation biology question. The output feeds into paper-search-workflow and perturbation-type-router.
Identify biological entities from the user's question:
Classify the question into one of these categories:
| Type | Description | Example |
|---|---|---|
mechanism | How does a perturbation work? | "How does KRAS knockout affect downstream signaling?" |
comparison | Compare perturbations or conditions | "Compare dexamethasone vs prednisolone in A549 cells" |
dose-response | Dose or time-dependent effects | "What happens to TP53 targets at different nutlin-3a doses?" |
screening | Large-scale perturbation screen results | "What are the top hits from a genome-wide CRISPR screen in K562?" |
dataset-search | Find relevant datasets | "Find CRISPR screens in lung cancer cell lines" |
analysis | Analyze a specific dataset | "Run differential expression on this perturbation dataset" |
Classify the perturbation type:
| Type | Indicators |
|---|---|
chemical | Drug names, compound IDs, dose mentions, MOA references |
genetic_crispr | CRISPR, Cas9, sgRNA, guide RNA, knockout, KO |
genetic_rnai | RNAi, shRNA, siRNA, knockdown, KD |
combinatorial | Multiple perturbations, combinations, synergy, interaction |
unknown | Insufficient information to classify |
Output this JSON structure:
{
"raw_query": "<original user question>",
"entities": {
"genes": ["<gene symbols>"],
"drugs": ["<drug/compound names>"],
"cell_types": ["<cell lines or types>"],
"diseases": ["<disease names>"],
"organisms": ["<species, default 'Homo sapiens'>"],
"perturbation_agents": ["<specific constructs if mentioned>"]
},
"question_type": "<mechanism|comparison|dose-response|screening|dataset-search|analysis>",
"perturbation_type": "<chemical|genetic_crispr|genetic_rnai|combinatorial|unknown>",
"search_terms": ["<derived search keywords for paper/dataset retrieval>"],
"filters": {
"organism": "<species filter>",
"data_availability": "<true if user wants downloadable data>",
"year_range": [null, null]
},
"confidence": {
"entity_extraction": "<0.0-1.0>",
"question_classification": "<0.0-1.0>",
"perturbation_classification": "<0.0-1.0>"
}
}
If confidence in any field is below 0.6:
Return the structured query JSON to the calling workflow (typically the main orchestrator or paper-search-workflow).
Input: "What are the effects of KRAS knockout in A549 cells?"
{
"raw_query": "What are the effects of KRAS knockout in A549 cells?",
"entities": {
"genes": ["KRAS"],
"drugs": [],
"cell_types": ["A549"],
"diseases": [],
"organisms": ["Homo sapiens"],
"perturbation_agents": []
},
"question_type": "mechanism",
"perturbation_type": "genetic_crispr",
"search_terms": ["KRAS", "knockout", "A549", "CRISPR"],
"filters": {
"organism": "Homo sapiens",
"data_availability": true,
"year_range": [null, null]
},
"confidence": {
"entity_extraction": 0.95,
"question_classification": 0.85,
"perturbation_classification": 0.90
}
}
When LanceDB is available, validate extracted entities against the curated registries:
genes table (GeneSchema) to get gene_index and canonical ensembl_id. Uses gene-resolver skill for standardization (Bionty ontologies, Ensembl prefix → organism mapping).molecules table (MoleculeSchema) to get pubchem_cid and sample_uid. Uses molecule-resolver skill for PubChem resolution.resolved_gene_indices and resolved_molecule_uids to the structured query for direct DB filtering downstream.This step is optional — the pipeline works without it (falls back to text-based search). But when available, it enables precise perturbation-level queries against gene_expression table's perturbation_search_string field.
paper-search-workflow, perturbation-type-routerlancedb-query (optional, for identifier resolution), gene-resolver, molecule-resolver (via src/ych/skills/)