Corpus retrieval — searches academic, policy, media, and report sources to build a curated corpus
Search for and collect relevant academic, policy, media, and report sources based on the approved scoping report. Build a structured corpus of source records for audit and downstream analysis.
This skill implements PRD Stage 2 (Section 17.3) — retrieval portion.
$ARGUMENTS[0]: workspace path (e.g., workspaces/ar-2026-03-22-a1b2c3d4)Read from the workspace:
{workspace}/analysis/scoping/scoping_report.json — research question, sub-questions, scope boundaries, track{workspace}/status.json — paper_type, topic_seedFrom the scoping report, extract:
Run 15-25 WebSearch calls organised into 4 source categories:
Academic sources (5-8 searches):
"{research_question}" peer-reviewed journal"{topic}" systematic review OR literature review"{sub_question}" academic research"{topic}" theoretical framework {discipline}"{topic}" methodology {discipline} researchPolicy sources (3-5 searches, more for policy_analysis track):
"{topic}" government report official"{topic}" policy document white paper"{topic}" official statement institutional"{topic}" regulatory frameworkMedia sources (3-4 searches):
"{topic}" news analysis commentary"{topic}" public debate discourse"{topic}" media coverage {year_range}Reports and working papers (3-4 searches):
"{topic}" think tank report"{topic}" research brief working paper"{topic}" {region} report (if geographic scope is set)Adjust emphasis by track:
"{topic}" {case_name})For each unique result found, create a SourceRecord JSON file:
Generate a source ID: .venv/bin/python3 utils/schemas.py id src
Populate fields:
source_id: the generated IDtitle: exact title from search resultauthors: author names — if the initial search result does not include author names, run an additional targeted WebSearch: "{title}" authors published {year} to find them. Only use [] as a last resort after this additional search also fails. NEVER use placeholder values like "Various" or "Unknown".year: publication year if availablesource_type: classify as journal / policy / media / book / archive / datasetvenue: journal name, publisher, or outleturl_or_path: URL from search resultjurisdiction: geographic jurisdiction if relevantpeer_reviewed: true if from a peer-reviewed journalprimary_or_secondary: classifyreliability_score: 0.0 (will be scored by auditor)verification_status: "unverified" (will be verified by auditor)Write to {workspace}/sources/{category}/src-{id}.json where category is academic, policy, or media
Check for duplicate sources by comparing:
Remove duplicates, keeping the first occurrence.
Write {workspace}/sources/corpus_manifest.json:
{
"total_sources": 0,
"academic_count": 0,
"policy_count": 0,
"media_count": 0,
"report_count": 0,
"search_queries_used": ["query1", "query2"],
"source_ids": ["src-a1b2c3d4", "src-e5f6g7h8"]
}
Update {workspace}/status.json:
state to CORPUS_BUILDING"retrieval" to stages_completed if not present