Find and assess datasets for a research question.
Find and assess datasets for your research question. Two Explorer agents search in parallel across data source categories; an Explorer-Critic then stress-tests each candidate against the research design.
Input: $ARGUMENTS — a topic, or from spec to read the research question from quality_reports/.
Find the most recent quality_reports/project_spec_*.md or quality_reports/specs/*.md — extract:
Read references/domain-profile.md if it exists — extract the Common Datasets section (domain-specific datasets to check first).
If no research spec exists, extract the variables and strategy from directly. If the request is vague, ask:
$ARGUMENTSRead agents/explorer.md once before spawning the Explorer subagents.
Split the source categories between two explorer subagents to parallelize the search.
Explorer A — Institutional Data:
Subagent prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].
Your source categories to search:
1. Public microdata (CPS, ACS, NHIS, MEPS, SIPP, QWI)
2. Administrative data (Medicare/Medicaid, IRS, SSA, vital statistics, court records)
3. Survey panels (PSID, HRS, Add Health, NLSY97/79, BHPS/UKHLS)
For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."
Explorer B — Broader and Alternative Sources:
Subagent prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].
Your source categories to search:
1. International data (World Bank, OECD, Eurostat, IMF, IPUMS International)
2. Novel/alternative (satellite, web scraping, proprietary, RCT registries)
3. Any field-specific datasets not covered by Explorer A
For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."
Read agents/explorer-critic.md before spawning the critic subagent.
After both Explorer subagents complete, spawn the Explorer-Critic with the full combined dataset list.
Subagent prompt: "You are an Explorer-Critic agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Here is the combined dataset list from the Explorer agents:
[paste all Explorer findings]
Apply the 5-point critique to each dataset:
1. Measurement validity
2. Sample selection
3. External validity
4. Identification compatibility
5. Known issues
Produce adjusted feasibility grades and deal-breaker flags.
Follow the Explorer-Critic agent instructions."
After the Explorer-Critic completes, compile the final ranked report:
Save to quality_reports/data_exploration_[sanitized_topic].md:
# Data Exploration: [Topic]
**Date:** [YYYY-MM-DD]
**Research question:** [one sentence]
**Empirical strategy:** [method]
**Variables sought:** Treatment = [X], Outcome = [Y], Controls = [list]
---
## Top Candidates (Grade A–B)
### 1. [Dataset Name] — Grade: A/B
**Provider:** [Name] | **Access:** [Public/Restricted/etc.] | **URL:** [link]
**Coverage:** [time period] | [geography] | [unit of observation] | N ≈ [size]
**Key Variables:**
- Treatment proxy: [variable]
- Outcome: [variable]
- Controls available: [list]
**Explorer-Critic Assessment:**
- Measurement validity: [1-2 sentences]
- Sample selection: [1-2 sentences]
- External validity: [1-2 sentences]
- Identification compatibility: [focused on the proposed strategy]
- Known issues: [specific documented problems]
**Bottom line:** [1-2 sentences — viable and under what conditions]
---
[Repeat for all A and B grade datasets]
---
## Accessible With Effort (Grade C)
[Brief summaries — name, access path, main limitation, why C not B]
---
## Rejection Table
| Dataset | Reason for Rejection | Deal-breaker? |
|---------|---------------------|---------------|
| [Name] | [Explorer-Critic finding] | YES/NO |
---
## Recommended Path Forward
1. **Best dataset:** [Name] — [one sentence why]
2. **Fallback if [best] unavailable:** [Name] — [why it's second choice]
3. **Access steps for [best]:** [specific actions needed — download link, application URL, IRB requirements]
---
## Next Steps
- **`data-analysis [dataset]`** — begin analysis with the recommended dataset
- **`lit-review [topic]`** — check if papers in the literature use these datasets (helps validate choice)