Find, download, and match empirical datasets to lecture content. Knows major economics/conflict data sources (World Bank, UCDP, ACLED, V-Dem, Penn World Tables, etc.). Auto-scans slide content to suggest fitting data for charts, exercises, and examples. Use when user says 'find data', 'download dataset', 'what data fits', 'add empirical evidence', 'update statistics'.
Automatically find, assess, and organize empirical datasets that match lecture content. Designed for economics of conflict and geoeconomics courses, but extensible to any applied micro topic.
| Source | URL | Key Variables | API | Format |
|---|---|---|---|---|
| UCDP/PRIO Armed Conflict Dataset | ucdp.uu.se | Conflict events, battle deaths, actors, geography | REST API | CSV, XLSX |
| ACLED (Armed Conflict Location & Event Data) | acleddata.com | Geo-coded events, protests, violence against civilians | REST API (key required) | CSV, JSON |
| GTD (Global Terrorism Database) | start.umd.edu/gtd | Terrorist attacks, casualties, targets, weapons | Download | CSV |
| UCDP GED (Georeferenced Event Dataset) | ucdp.uu.se/downloads | Geo-coded conflict events with coordinates | REST API | CSV |
| Correlates of War | correlatesofwar.org | Interstate wars, MIDs, alliances, trade, capabilities | Download | CSV |
| PRIO GRID | grid.prio.org | Cell-level conflict, climate, population, economic data | Download | CSV, NetCDF |
| ICB (International Crisis Behavior) | sites.duke.edu/icbdata | International crises, triggers, outcomes | Download | CSV |
| Source | URL | Key Variables | API | Format |
|---|---|---|---|---|
| V-Dem (Varieties of Democracy) | v-dem.net | Democracy indices, institutional features, 400+ indicators | REST API | CSV, RDS |
| Polity5 | systemicpeace.org | Regime type, political competition, executive constraints | Download | XLSX |
| World Governance Indicators | info.worldbank.org/governance/wgi | Rule of law, corruption, government effectiveness | WB API | CSV |
| ICRG (International Country Risk Guide) | prsgroup.com | Political risk, economic risk, composite risk | Subscription | XLSX |
| Fragile States Index | fragilestatesindex.org | State fragility, 12 dimensions | Download | CSV |
| Source | URL | Key Variables | API | Format |
|---|---|---|---|---|
| World Bank WDI | data.worldbank.org | GDP, trade, poverty, education, health (1400+ indicators) | wbgapi Python / REST | CSV, JSON |
| Penn World Tables | rug.nl/ggdc/productivity/pwt | Real GDP, capital stock, TFP, exchange rates | Download | XLSX |
| IMF WEO | imf.org/en/Publications/WEO | GDP forecasts, inflation, fiscal balances | REST API | CSV |
| UN Comtrade | comtrade.un.org | Bilateral trade flows by commodity | REST API | CSV, JSON |
| CEPII (BACI, Gravity) | cepii.fr | Bilateral trade, gravity variables, GeoDist | Download | CSV |
| Maddison Project | rug.nl/ggdc/historicaldevelopment/maddison | Historical GDP per capita (1 AD–present) | Download | XLSX |
| Source | URL | Key Variables | API | Format |
|---|---|---|---|---|
| World Bank Commodity Prices | worldbank.org/commodities | Monthly/annual prices for 70+ commodities | WB API | CSV |
| USGS Minerals Data | usgs.gov/centers/national-minerals-information-center | Mineral production, reserves, trade | Download | XLSX |
| BP Statistical Review (now Energy Institute) | energyinst.org | Energy production, consumption, CO2, renewables | Download | XLSX |
| EITI (Extractive Industries Transparency) | eiti.org | Revenue from extractive industries by country | Download | PDF, CSV |
| Source | URL | Key Variables | API | Format |
|---|---|---|---|---|
| CRU TS (Climatic Research Unit) | crudata.uea.ac.uk | Temperature, precipitation, monthly gridded | Download | NetCDF |
| ERA5 (ECMWF Reanalysis) | cds.climate.copernicus.eu | Temperature, wind, precipitation, global gridded | API | NetCDF, GRIB |
| EM-DAT (Emergency Events Database) | emdat.be | Natural disasters, deaths, damage, affected population | Download | XLSX |
| SPEI Global Drought Monitor | spei.csic.es | Drought index, monthly global | Download | NetCDF |
| Source | URL | Key Variables | API | Format |
|---|---|---|---|---|
| UNHCR Refugee Statistics | unhcr.org/refugee-statistics | Refugees, asylum seekers, IDPs, stateless persons | REST API | CSV, JSON |
| IDMC (Internal Displacement Monitoring) | internal-displacement.org | Internal displacement by country, cause, year | REST API | CSV |
| IOM DTM (Displacement Tracking Matrix) | dtm.iom.int | Flow monitoring, displacement tracking | Download | XLSX |
| Source | URL | Key Variables | API | Format |
|---|---|---|---|---|
| GSDB (Global Sanctions Database) | globalsanctionsdb.com | Sanctions episodes, type, sender, target | Download | XLSX |
| UN Security Council Sanctions | un.org/securitycouncil/sanctions | Active sanctions regimes, listed entities | Download | XML |
| OFAC SDN List | ofac.treasury.gov | Sanctioned individuals and entities (US) | REST API | CSV, XML |
| Correlates of War Trade | correlatesofwar.org | Bilateral trade 1870–present | Download | CSV |
When running inside the pipeline (R1b), the data-sourcer receives the case file as input, which contains:
The data-sourcer does NOT freestyle — it follows a 3-step routing logic.
| Figure role | Expected tier | Reasoning |
|---|---|---|
| Motivator (trends, magnitudes, opening hooks) | Tier 1 (API download) | Time-series/aggregate data is always available from APIs |
| Context (current events, background) | Tier 1 or Tier 2 | Try API first; fall back to published if specific |
| Evidence (paper regression results, coefficients) | Tier 2 (published) | Paper-specific coefficients — no API exists |
| Mechanism (DAG, flow chart) | Conceptual (TikZ) | No data needed — skip |
| Identification (first stage, IV diagnostics) | Tier 2 (published) | Paper-specific methodology outputs |
| Synthesis (comparison table) | N/A | Text/tabular — no external data |
Rule: If a figure is classified as Motivator or Context, the data-sourcer MUST attempt Tier 1 before accepting Tier 2. Document the attempt in the pipeline logbook.
The data-sourcer extracts keywords from the case file (papers, lecture title, figure descriptions) and routes to sources:
| Keywords in case file | Data domain | Primary source (tested, fastest) | Fallback |
|---|---|---|---|
| conflict, civil war, battle deaths, violence | Conflict events | OWID catalog (war topic) | UCDP CSV bulk download |
| GDP, growth, income, poverty, development | Development | WB WDI (wbgapi) | Global Macro Database |
| commodities, oil, coffee, prices, pink sheet | Commodity prices | WB Pink Sheet (Excel URL) | OWID catalog |
| food prices, agriculture, FAO | Food/agriculture | FAO bulk ZIP | WB Pink Sheet |
| refugees, displacement, IDPs, asylum | Displacement | UNHCR API (REST, no key) | OWID catalog |
| child mortality, nutrition, education, WASH | Human development | UNICEF (unicefdata) | OWID catalog |
| democracy, institutions, governance, polity | Political institutions | V-Dem (R: vdemdata) | WB WGI via wbgapi |
| trade, exports, imports, tariffs, sanctions | Trade flows | WB WDI (wbgapi) | UN Comtrade |
| climate, temperature, rainfall, drought | Climate | CRU TS (NetCDF) | OWID catalog |
| GDP historical, macro panel, inflation | Macro panel | Global Macro Database (global_macro_data) | Penn World Table |
| maps, geography, spatial, borders | Geospatial | Natural Earth (shapefile) + sf in R | — |
| Access pattern | When to use | Code template |
|---|---|---|
| Python API | WDI, OWID, UNICEF, UNHCR | import wbgapi as wb; df = wb.data.DataFrame(indicator, countries, time) |
| R API | WDI, V-Dem | WDI(country, indicator, start, end) or vdemdata::vdem |
| Direct file | Pink Sheet, IMF WEO, PWT | pd.read_excel(url) or pd.read_stata(url) |
| Bulk ZIP | UCDP, FAO, Natural Earth | Download ZIP → extract → pd.read_csv() |
| Full dataset | GMD | from global_macro_data import gmd; df = gmd() (cache after first call) |
R vs Python: Use R when the figure will be rendered with ggplot2. Use Python when the figure uses matplotlib. The data-sourcer outputs download scripts in whichever language the data-visualizer will use.
The data-sourcer produces L{XX}_data_sources.md:
# Data Sources — L{XX}
## Figure-by-Figure Routing
| Fig# | Role | Topic keywords | Tier | Source | Access pattern | Script |
|------|------|---------------|------|--------|---------------|--------|
| fig1 | Motivator | commodity prices, coffee, oil | Tier 1 | WB Pink Sheet | Direct Excel | `fig1_download.py` |
| fig2 | Evidence | MSS coefficients, IV results | Tier 2 | MSS (2004) Table 3 | Manual entry | hardcoded in `fig2_plot.py` |
| fig5 | Evidence | DV coefficients | Tier 2 | DV (2013) Table 4 | Manual entry | hardcoded in `fig5_plot.py` |
| fig7 | Context | food prices, Arab Spring | Tier 1 | FAO bulk ZIP | Bulk download | `fig7_download.py` |
## Tier 1 Attempts Log
| Figure | Source tried | Result | Fallback |
|--------|-------------|--------|----------|
| fig1 | WB Pink Sheet Excel | SUCCESS | — |
| fig7 | FAO API | TIMEOUT | FAO bulk ZIP: SUCCESS |
The orchestrator reads the syllabus (or course structure) during Phase A.1 and extracts:
The data-sourcer does NOT need to read the syllabus directly. The case file is its interface — all routing information is already extracted by the orchestrator. This means the same data-sourcer works for any course: change the syllabus, the orchestrator produces a different case file, and the data-sourcer routes accordingly.
Read a .qmd lecture file and identify where empirical data could strengthen the content.
For each identified data need, match against the Known Data Sources table:
Matching rules:
For each match, classify as:
Chart/Figure opportunity: Data that could become a visual on a slide
Exercise data: Data students could work with
Updated statistics: Fresher numbers for claims in slides
Trend illustration: Long-run patterns supporting lecture narrative
# Data Source Scan: [Lecture Filename]
**Date:** [YYYY-MM-DD]
## Slide-by-Slide Data Opportunities
### Slide: "[Title]" (slide N)
**Current content:** [Brief description]
**Data opportunity:** [What data could add]
**Recommended source:** [Source name + specific indicator]
**Type:** Chart / Exercise / Updated Stat / Trend
**Effort:** Quick (API call) / Medium (download + clean) / Large (merge multiple sources)
[Repeat for each opportunity]
## Summary
- Total data opportunities: N
- Quick wins (API-downloadable): N
- Recommended priority: [Top 3 to implement first]
## Download Commands
[Pre-written commands for top recommendations]
Search for a specific dataset, indicator, or data source.
# Data Search: [Query]
## Best Match
- **Source:** [Name]
- **URL:** [Link]
- **Coverage:** [Countries × Years]
- **Variables:** [Key indicators available]
- **Format:** [CSV/XLSX/API]
- **Citation:** [How to cite this data]
- **Download:** [Command or URL]
## Alternatives
[Other sources that partially match]
Download specific datasets and organize them in the course folder.
Based on the source, use the appropriate download method:
World Bank WDI (via wbgapi or direct URL):
# pip install wbgapi
import wbgapi as wb
# GDP per capita for conflict-affected countries
data = wb.data.DataFrame('NY.GDP.PCAP.CD', economy=['SOM', 'SSD', 'COD', 'AFG', 'YEM'], time=range(2000, 2025))
data.to_csv('data/wdi_gdppc_conflict.csv')
World Bank (direct CSV):
# Download specific indicator via API
curl -o data/indicator.csv "https://api.worldbank.org/v2/country/all/indicator/NY.GDP.PCAP.CD?format=csv&date=2000:2024"
UCDP (REST API):
# Battle-related deaths
curl -o data/ucdp_brd.json "https://ucdpapi.pcr.uu.se/api/battledeaths/24.1?pagesize=1000"
UCDP (direct download):
curl -o data/ucdp_ged.csv "https://ucdp.uu.se/downloads/ged/ged241-csv.zip"
unzip data/ged241-csv.zip -d data/ucdp/
V-Dem:
# Download full dataset (large, ~500MB)
curl -o data/vdem.csv "https://v-dem.net/data/dataset-archive/"
# Or use specific indicators via API
World Bank Commodity Prices:
curl -o data/commodity_prices.xlsx "https://thedocs.worldbank.org/en/doc/5d903e848db1d1b83e0ec8f744e55570-0350012021/CMO-Historical-Data-Monthly.xlsx"
Data/raw/Data/processed/Data/README.mdData/
├── raw/ # Original downloads
│ ├── wdi_gdppc_2024.csv
│ ├── ucdp_ged_24.1.csv
│ └── vdem_v14.csv
├── processed/ # Cleaned, subsetted
│ ├── L01_conflict_trends.csv
│ ├── L02_resources_conflict.csv
│ └── L03_shocks_panel.csv
├── scripts/ # Download/cleaning scripts
│ ├── download_wdi.py
│ ├── download_ucdp.py
│ └── clean_merge.py
└── README.md # Data dictionary
# Course Data Dictionary
## L01: Foundations — Conflict Trends
**File:** `processed/L01_conflict_trends.csv`
**Sources:** UCDP/PRIO ACD v24.1, World Bank WDI
**Variables:**
- `country`: Country name
- `year`: Year (1946–2024)
- `conflict_active`: Binary indicator for active conflict
- `battle_deaths`: Best estimate of battle-related deaths
- `gdp_pc`: GDP per capita (constant 2015 USD)
**Coverage:** 195 countries, 1946–2024
**Citation:** Gleditsch et al. (2002); Sundberg & Melander (2013)
Pre-computed recommendations for common lecture topics in the conflict/geoeconomics course:
\source{Dataset (Year)} to slides using downloaded data# Scan a lecture for data opportunities
/data-sourcer scan lecture_01.qmd
# Search for specific data
/data-sourcer search "rainfall anomalies Sub-Saharan Africa"
# Download a specific dataset
/data-sourcer download "World Bank WDI GDP per capita conflict countries 2000-2024"