Systematic strategies for searching, retrieving, and analyzing scientific literature across PubMed, arXiv, Google Scholar, and AI-assisted tools. Covers the PICO framework for clinical question formulation, three-tiered search strategy (database-specific, AI-assisted, content extraction), PubMed field tags and MeSH vocabulary, boolean query construction, and full-text extraction workflows. Consult this guide when planning a literature search, constructing database queries, or deciding which search tier to use for a given research question.
Scientific literature search is the foundation of evidence-based research. A well-executed search maximizes recall (finding all relevant papers) while maintaining precision (avoiding irrelevant results). This guide provides a systematic approach that combines database-specific query strategies, AI-assisted synthesis, and direct content extraction, organized into a three-tiered framework that scales from targeted lookups to comprehensive landscape reviews.
For clinical and biomedical questions, structure queries using the PICO framework:
PICO queries can be combined with publication type filters to target specific evidence levels:
"Diabetes Mellitus"[MeSH] AND "Metformin"[MeSH] AND "Cardiovascular Diseases"[MeSH] AND ("clinical trial"[Publication Type] OR "meta-analysis"[Publication Type])
Literature search is most effective when approached in tiers of increasing breadth:
Tier 1 -- Database-Specific Searches (Most Reliable)
Query established academic databases (PubMed, arXiv, Google Scholar) for peer-reviewed, indexed content. This is the most reliable tier and should always be the starting point.
query_pubmed): Primary database for biomedical and life science literature. Supports MeSH controlled vocabulary and advanced field tags.query_arxiv): Preprint server for physics, mathematics, computer science, and quantitative biology. Results appear faster than peer-reviewed journals.query_scholar): Broadest coverage across all academic disciplines. Note: has aggressive rate limits on automated queries.Best for: finding specific papers, systematic reviews, clinical evidence, preprints.
Tier 2 -- AI-Assisted Web Search (Comprehensive)
Use AI tools (advanced_web_search_claude) to synthesize broader context, identify research trends, and surface recent developments not yet indexed in databases. Also use general web search (search_google) for protocols, tutorials, and software documentation.
Best for: understanding the research landscape, complex multi-faceted questions, finding recent developments, identifying key researchers.
Avoid for: specific paper lookups (use Tier 1), citation counts (use Google Scholar), systematic reviews requiring reproducibility, searches where exact query terms must be documented.
Tier 3 -- Direct Content Extraction (Deep Dive)
Extract and analyze full-text content, PDFs, and supplementary materials from identified papers using extract_url_content, extract_pdf_content, and fetch_supplementary_info_from_doi.
Best for: detailed methodology extraction, data retrieval, protocol identification, supplementary data access.
PubMed supports field-specific searching to improve precision:
| Tag | Description | Example |
|---|---|---|
[MeSH] | Medical Subject Heading (controlled vocabulary) | "Neoplasms"[MeSH] |
[Title] | Title field only | "CRISPR"[Title] |
[Title/Abstract] | Title or abstract | "gene therapy"[Title/Abstract] |
[Author] | Author name | "Zhang F"[Author] |
[Journal] | Journal name | "Nature"[Journal] |
[Publication Type] | Article type filter | "Review"[Publication Type] |
[Date - Publication] | Publication date range | "2020/01/01"[Date - Publication]:"2024/12/31"[Date - Publication] |
[MeSH Major Topic] | MeSH term as major focus of the article | "CRISPR-Cas Systems"[MeSH Major Topic] |
Boolean operators control how search terms combine:
# AND: All terms must be present -- narrows results
results = query_pubmed("CRISPR AND cancer AND therapy")
# OR: Any term can be present -- broadens results (use for synonyms)
results = query_pubmed("(tumor OR tumour OR neoplasm) AND immunotherapy")
# NOT: Exclude terms -- use sparingly to avoid losing relevant papers
results = query_pubmed("cancer immunotherapy NOT review")
Use parentheses to group OR terms together before combining with AND.
arXiv organizes preprints by subject category. Biology-related categories include:
| Category | Description |
|---|---|
q-bio.BM | Biomolecules |
q-bio.CB | Cell Behavior |
q-bio.GN | Genomics |
q-bio.MN | Molecular Networks |
q-bio.NC | Neurons and Cognition |
q-bio.QM | Quantitative Methods |
cs.AI | Artificial Intelligence |
cs.LG | Machine Learning |
Use this tree to determine which search tier and database to start with:
What type of question are you answering?
├── Clinical / biomedical question
│ ├── Specific drug or treatment → Tier 1: PubMed with PICO query
│ ├── Disease mechanism → Tier 1: PubMed with MeSH terms
│ └── Clinical trial evidence → Tier 1: PubMed filtered by Publication Type
├── Computational / quantitative methods
│ ├── ML model or algorithm → Tier 1: arXiv (cs.LG, cs.AI)
│ ├── Computational biology method → Tier 1: arXiv (q-bio.*) + PubMed
│ └── Software tool or pipeline → Tier 2: AI-assisted web search
├── Broad research landscape
│ ├── Current state of a field → Tier 2: AI-assisted web search
│ ├── Recent developments (last 6 months) → Tier 2: AI-assisted web search
│ └── Cross-disciplinary question → Tier 1: Google Scholar + Tier 2
├── Specific paper or data
│ ├── Known paper details → Tier 1: any database by title/author/DOI
│ ├── Methodology or protocol → Tier 3: full-text extraction
│ └── Supplementary data → Tier 3: DOI-based supplementary fetch
└── Protocols / reagents
├── Lab protocol → Tier 2: web search for protocols.io, etc.
└── Validated reagents → Tier 2: AI-assisted web search
| Scenario | Recommended Tier and Database | Rationale |
|---|---|---|
| Systematic review of clinical evidence | Tier 1: PubMed with MeSH + publication type filters | Reproducible, documented search strategy required |
| Finding a preprint on a new ML method | Tier 1: arXiv with category and keyword search | Preprints appear on arXiv before journals |
| Understanding the research landscape | Tier 2: AI-assisted web search | Requires synthesis across many sources |
| Extracting a specific protocol from a paper | Tier 3: PDF content extraction | Need full-text access to methods section |
| Finding papers across disciplines | Tier 1: Google Scholar | Broadest coverage across fields |
| Identifying key researchers in a niche area | Tier 2: AI-assisted web search | Requires contextual synthesis |
| Downloading supplementary data tables | Tier 3: DOI-based supplementary fetch | Direct access to supplementary files |
Use controlled vocabulary (MeSH) for PubMed searches: Free-text searches miss papers that use different terminology. MeSH terms map synonyms to a single concept, improving recall without sacrificing precision.
# Free text misses synonyms
query_pubmed("heart attack treatment")
# MeSH captures all synonyms
query_pubmed('"Myocardial Infarction"[MeSH] AND "Drug Therapy"[MeSH]')
Include synonyms and alternative terms with OR: Scientific concepts often have multiple names (e.g., tumor/tumour/neoplasm). Group synonyms with OR inside parentheses to avoid missing relevant papers.
query_pubmed("(myocardial infarction OR heart attack) AND (treatment OR therapy)")
Use phrase searching for multi-word concepts: Quoting exact phrases prevents the search engine from splitting terms and matching them independently.
query_pubmed('"single cell RNA sequencing" AND methods')
Filter by publication type when seeking specific evidence: Clinical trials, systematic reviews, and meta-analyses each answer different questions. Use [Publication Type] to target the evidence level you need.
query_pubmed("COVID-19 vaccine efficacy AND clinical trial[Publication Type]")
Start broad, then narrow iteratively: Begin with core concepts (2-3 terms) and review initial results. Add specificity based on what you find -- more terms, date ranges, field tags, or publication types.
# Step 1: Broad
results = query_pubmed("CRISPR base editing iPSC", max_papers=20)
# Step 2: Add MeSH and specificity
results = query_pubmed(
'"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency',
max_papers=20
)
# Step 3: Filter by date
results = query_pubmed(
'"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency AND ("2022"[Date - Publication]:"2024"[Date - Publication])',
max_papers=20
)
Cross-reference multiple databases: No single database covers all literature. Use PubMed for biomedical content, arXiv for computational preprints, and Google Scholar for cross-disciplinary coverage.
Assess result quality systematically: Evaluate papers for source reliability (peer-reviewed journal), author credentials, recency, study design appropriateness, sample size adequacy, reproducibility, declared conflicts of interest, and citation count.
Overly long and specific queries: Packing too many terms into a single query causes missed results because all terms must match simultaneously.
# Too specific -- misses relevant papers
query_pubmed("CRISPR Cas9 gene editing HEK293T cells 2024 efficiency optimization delivery")
# Better -- core concepts only
query_pubmed("CRISPR Cas9 gene editing optimization efficiency")
Relying on a single database: PubMed has biomedical focus, arXiv covers preprints, Google Scholar spans disciplines. Using only one database guarantees blind spots.
Ignoring publication dates: Scientific knowledge evolves rapidly. Foundational papers remain relevant, but methods and clinical evidence may be superseded.
Skipping title and abstract review before deep-diving: Not all search results that match keywords are actually relevant. Downloading and reading full texts without screening wastes time.
Using NOT operators too aggressively: The NOT operator can inadvertently exclude relevant papers that mention the excluded term in a different context.
Ignoring Google Scholar rate limits: Google Scholar aggressively rate-limits automated queries, which can block further searches.
Not documenting the search strategy: For systematic reviews and reproducible research, an undocumented search cannot be verified or reproduced.
Step 1: Define the research question
Step 2: Construct and execute database queries (Tier 1)
from biomni.tool.literature import query_pubmed, query_arxiv, query_scholar
# PubMed: biomedical literature
results = query_pubmed(
'"CRISPR-Cas Systems"[MeSH] AND "Gene Editing"[MeSH]',
max_papers=20
)
# arXiv: computational biology preprints
results = query_arxiv("protein structure prediction", max_papers=10)
# Google Scholar: broad cross-disciplinary coverage
result = query_scholar("single cell RNA sequencing analysis methods")
Step 3: Supplement with AI-assisted search (Tier 2)
from biomni.tool.literature import advanced_web_search_claude
results = advanced_web_search_claude(
"What are the latest developments in CAR-T cell therapy for solid tumors in 2024?",
max_searches=3
)
Step 4: Evaluate and filter results
Step 5: Deep dive into key papers (Tier 3)
from biomni.tool.literature import extract_url_content, extract_pdf_content, fetch_supplementary_info_from_doi
# Extract article content from URL
content = extract_url_content("https://www.nature.com/articles/nature12373")
# Extract text from PDF
content = extract_pdf_content("https://arxiv.org/pdf/1706.03762.pdf")
# Download supplementary files using DOI
log = fetch_supplementary_info_from_doi(
"10.1038/nature12373",
output_dir="./supplementary_materials"
)
Step 6: Document and iterate
The following scenarios illustrate how to combine the three tiers for typical research questions.
Start with PubMed for published methodology papers, then supplement with web search for step-by-step protocols from resources like protocols.io.
from biomni.tool.literature import query_pubmed, search_google
# Search for methodology papers in PubMed
results = query_pubmed(
'"Western Blotting"[MeSH] AND (protocol OR method OR technique)',
max_papers=10
)
# Check web for step-by-step protocols
results = search_google("Western blot protocol for membrane proteins", num_results=5)
Begin with review articles for a broad overview, then drill into specific mechanistic studies.
# Find review articles first for an overview
results = query_pubmed(
'"Alzheimer Disease"[MeSH] AND pathophysiology AND review[Publication Type]',
max_papers=10
)
# Then find specific mechanistic studies
results = query_pubmed(
'"Alzheimer Disease"[MeSH] AND ("amyloid beta"[MeSH] OR tau) AND mechanism',
max_papers=20
)
Use publication type filters to separate clinical trial evidence from systematic reviews.
# Clinical trials for a specific drug-condition pair
results = query_pubmed(
'"Drug Name"[Substance Name] AND "Condition"[MeSH] AND clinical trial[Publication Type]',
max_papers=20
)
# Systematic reviews and meta-analyses
results = query_pubmed(
'"Drug Name" AND "Condition" AND (systematic review[Publication Type] OR meta-analysis[Publication Type])',
max_papers=10
)
Combine AI-assisted search for synthesis with database searches for recent indexed publications.
from biomni.tool.literature import advanced_web_search_claude, query_pubmed
# AI-assisted synthesis of recent advances
results = advanced_web_search_claude(
"What are the most significant advances in CAR-T cell therapy in 2024?",
max_searches=3
)
# Supplement with recent PubMed results
results = query_pubmed(
'"Chimeric Antigen Receptor T-Cell Therapy"[MeSH] AND "2024"[Date - Publication]',
max_papers=20
)
Use AI-assisted search for validated reagent recommendations, supplemented by general web search.
from biomni.tool.literature import advanced_web_search_claude, search_google
# Search for validated reagents
results = advanced_web_search_claude(
"validated antibodies for Western blot detection of p53 protein",
max_searches=2
)
# Search supplier databases
results = search_google("p53 antibody Western blot validated", num_results=5)
Use AI-assisted search for synthesized comparisons of techniques or tools.
from biomni.tool.literature import advanced_web_search_claude
# Compare approaches with AI synthesis
results = advanced_web_search_claude(
"Compare different CRISPR delivery methods for in vivo gene editing: viral vectors vs lipid nanoparticles",
max_searches=5
)
When evaluating search results, apply these criteria:
pubmed-database -- Direct PubMed API access for programmatic literature retrievalscientific-manuscript-writing -- Structuring literature review sections within manuscriptsresearch-question-formulation -- Frameworks for defining answerable research questions