Retrieves gene expression and omics datasets from ArrayExpress and BioStudies with gene disambiguation, experiment quality assessment, and structured reports. Creates comprehensive dataset profiles with metadata, sample information, and download links. Use when users need expression data, omics datasets, or mention ArrayExpress (E-MTAB, E-GEOD) or BioStudies (S-BSST) accessions.
Retrieve gene expression experiments and multi-omics datasets with disambiguation and quality assessment.
IMPORTANT: Always use English terms in tool calls. Respond in the user's language.
LOOK UP DON'T GUESS: Never assume which datasets exist or their accessions. Always search to confirm.
Before retrieving, determine: organism, tissue, experimental design (case-control/time-series/dose-response). These affect which database to search and how to interpret results. RNA-seq provides wider dynamic range; microarray has extensive legacy data. Prioritize experiments with >=3 biological replicates, complete annotations, and both raw+processed data.
Phase 0: Clarify (if ambiguous) → Phase 1: Disambiguate → Phase 2: Search & Retrieve → Phase 3: Report
Ask ONLY if: gene name ambiguous, tissue/condition unclear, organism not specified. Skip for: specific accessions (E-MTAB-, E-GEOD-, S-BSST*), clear disease/tissue+organism, explicit platform requests.
Resolve official gene symbol (HGNC for human, MGI for mouse). Note common aliases for search expansion.
| User Query Type | Search Strategy |
|---|---|
| Specific accession | Direct retrieval |
| Gene + condition | "[gene] [condition]" + species filter |
| Disease only | "[disease]" + species filter |
| Technology-specific | Add platform keywords |
Search silently. Do NOT narrate the process.
# ArrayExpress search
result = tu.tools.arrayexpress_search_experiments(keywords="[gene/disease]", species="[species]", limit=20)
# Get experiment details, samples, files
details = tu.tools.arrayexpress_get_experiment(accession=accession)
samples = tu.tools.arrayexpress_get_experiment_samples(accession=accession)
files = tu.tools.arrayexpress_get_experiment_files(accession=accession)
# BioStudies for multi-omics
biostudies = tu.tools.biostudies_search(query="[keywords]", limit=10)
study = tu.tools.biostudies_get_study(accession=study_accession)
study_files = tu.tools.biostudies_get_study_files(accession=study_accession)
| Primary | Fallback |
|---|---|
| ArrayExpress search | BioStudies search |
| arrayexpress_get_experiment | biostudies_get_study |
| arrayexpress_get_experiment_files | Note "Files unavailable" |
Present as a Dataset Search Report. Hide search process. Include:
| Tier | Symbol | Criteria |
|---|---|---|
| High | ●●● | >=3 bio replicates, complete metadata, processed data available |
| Medium | ●●○ | 2-3 replicates OR some metadata gaps |
| Low | ●○○ | No replicates, sparse metadata, or access issues |
| Caution | ○○○ | Single sample, no replication, outdated platform |
Dataset quality: Prioritize >=3 biological replicates, complete annotations, both raw+processed data. Single-replicate experiments can inform but not be sole evidence.
Platform comparison: RNA-seq = wider dynamic range, novel transcripts. Microarray = probe-limited but extensive legacy data. Cross-platform combining requires batch correction.
Metadata scoring: Rate 0-5 on: (1) sample annotations, (2) design documented, (3) pipeline described, (4) raw data deposited, (5) publication linked. Score <=2 warrants caution.
GEO vs ArrayExpress: GEO has broader coverage (older studies); ArrayExpress enforces stricter metadata. BioStudies captures multi-omics. Search both.
| Error | Response |
|---|---|
| "No experiments found" | Broaden keywords, remove species filter, try synonyms |
| "Accession not found" | Verify format, check if withdrawn |
| "Files not available" | Note: "Data files restricted by submitter" |
| "API timeout" | Retry once, note "(metadata retrieval incomplete)" |
ArrayExpress: arrayexpress_search_experiments (search), arrayexpress_get_experiment (metadata), arrayexpress_get_experiment_files (downloads), arrayexpress_get_experiment_samples (annotations)
BioStudies: biostudies_search (search), biostudies_get_study (metadata+sections), biostudies_get_study_files (files)
Additional Sources:
GEO_search_rnaseq_datasets / geo_search_datasets -- GEO (largest RNA-seq repo)OmicsDI_search_datasets -- cross-repository aggregation (GEO+ArrayExpress+PRIDE+MassIVE)GTEx_get_expression_summary -- baseline tissue expression (54 normal tissues, param: gene_symbol)ENAPortal_search_studies -- sequencing studies (param: query with description="...")CxGDisc_search_datasets -- single-cell datasets (needs exact disease ontology terms)PubMed_search_articles -- dataset discovery via publicationsArrayExpress: keywords (free text), species (scientific name), array (platform filter), limit
BioStudies: query (free text), limit