Define extraction schema, extract study data from full texts, and store it in a structured database for meta-analysis. Use when moving from full-text collection to statistical analysis.
Extract consistent data, capture provenance, and build a clean analysis dataset.
04_fulltext/manifest.csv01_protocol/outcomes.md05_extraction/extraction.sqlite05_extraction/extraction.csv05_extraction/llm_suggestions.jsonl (optional)05_extraction/data-dictionary.md05_extraction/extraction-log.md05_extraction/study_map.csv (optional if record_id not in extraction CSV)05_extraction/source.csv (optional source references)05_extraction/source_validation.md (optional)⚠️ Default approach: Run web-based extraction FIRST, then use PDFs only for gaps.
05_extraction/data-dictionary.md (use references/data-dictionary-template.md)scripts/init_extraction_db.py via uv run.
scripts/init_extraction_db.py05_extraction/extraction.sqlite03_screening/round-01/included.bib or 04_fulltext/manifest.csv[web] in the notes column05_extraction/extraction.sqlite (studies table)scripts/llm_extract.py via uv run on available PDFs.
scripts/llm_extract.py04_fulltext/*.pdf05_extraction/llm_suggestions.jsonl05_extraction/extraction.sqlite05_extraction/extraction-log.md.
05_extraction/extraction-log.md05_extraction/extraction.csv from SQLite05_extraction/source.csv and validate with scripts/validate_sources.py.
05_extraction/source.csv (use references/source-template.csv)scripts/validate_sources.py → 05_extraction/source_validation.mdscripts/init_extraction_db.py initializes a standard extraction schema.scripts/llm_extract.py provides LLM-assisted extraction suggestions.scripts/validate_sources.py validates extraction vs source references.references/data-dictionary-template.md provides a dictionary scaffold.references/study-map-template.csv maps record_id to study_id if needed.references/source-template.csv for source references.
Note: llm_extract.py requires a PDF parser such as pdfplumber or pypdf (install via uv add).⚠️ This is the DEFAULT first step — Claude Code should run this BEFORE attempting PDF-based extraction. No scripts or API keys required.
extraction.csv for NULL or empty cells in critical columns (e.g., n_total, events_intervention, events_control, mean, sd).WebSearch with the query pattern:
"<first_author> <year> <journal> <intervention> <outcome> results" or"<DOI>" or "PMID:<pmid> abstract"WebFetch on high-value URLs:
https://pubmed.ncbi.nlm.nih.gov/<pmid>/https://clinicaltrials.gov/study/<nct_id>https://europepmc.org/article/MED/<pmid>extraction.csv.[web] in the notes column (e.g., n_total from PubMed abstract [web]).| Source | Confidence | Action |
|---|---|---|
| PubMed structured abstract | 0.90 | Accept |
| ClinicalTrials.gov registry | 0.85 | Accept |
| Journal webpage / press release | 0.70 | Accept with note |
| Conference abstract only | 0.60 | Flag for verification |
| No source found | — | Leave NULL, document gap |
scripts/validate_sources.py when sources are available.| Step | Skill | Stage |
|---|---|---|
| Prev | /ma-fulltext-management | 04 Full-text Management |
| Next | /ma-meta-analysis | 06 Statistical Analysis |
| All | /ma-end-to-end | Full pipeline orchestration |