Skill File

Data Sourcer

Name: Data Sourcer
Author: KaliGi

Find, download, and match empirical datasets to lecture content. Knows major economics/conflict data sources (World Bank, UCDP, ACLED, V-Dem, Penn World Tables, etc.). Auto-scans slide content to suggest fitting data for charts, exercises, and examples. Use when user says 'find data', 'download dataset', 'what data fits', 'add empirical evidence', 'update statistics'.

KaliGi0 starsApr 7, 2026

Occupation
Categories: Data Analysis

Skill Content

Data Sourcer — Empirical Data Pipeline for Lectures

Automatically find, assess, and organize empirical datasets that match lecture content. Designed for economics of conflict and geoeconomics courses, but extensible to any applied micro topic.

Mode Selection

"find data for", "what datasets match", "scan slides for data needs" → scan mode
"search for [specific dataset/indicator]", "World Bank data on X" → search mode
"download [dataset]", "get the data", "pull indicators" → download mode
If user provides a .qmd file → default to scan mode
If user names a specific source or indicator → default to search mode

═══════════════════════════════════════════════════

KNOWN DATA SOURCES

═══════════════════════════════════════════════════

Related Skills

Data Sourcer | Skills Pool

Source	URL	Key Variables	API	Format
UCDP/PRIO Armed Conflict Dataset	ucdp.uu.se	Conflict events, battle deaths, actors, geography	REST API	CSV, XLSX
ACLED (Armed Conflict Location & Event Data)	acleddata.com	Geo-coded events, protests, violence against civilians	REST API (key required)	CSV, JSON
GTD (Global Terrorism Database)	start.umd.edu/gtd	Terrorist attacks, casualties, targets, weapons	Download	CSV
UCDP GED (Georeferenced Event Dataset)	ucdp.uu.se/downloads	Geo-coded conflict events with coordinates	REST API	CSV
Correlates of War	correlatesofwar.org	Interstate wars, MIDs, alliances, trade, capabilities	Download	CSV
PRIO GRID	grid.prio.org	Cell-level conflict, climate, population, economic data	Download	CSV, NetCDF
ICB (International Crisis Behavior)	sites.duke.edu/icbdata	International crises, triggers, outcomes	Download	CSV

Source	URL	Key Variables	API	Format
V-Dem (Varieties of Democracy)	v-dem.net	Democracy indices, institutional features, 400+ indicators	REST API	CSV, RDS
Polity5	systemicpeace.org	Regime type, political competition, executive constraints	Download	XLSX
World Governance Indicators	info.worldbank.org/governance/wgi	Rule of law, corruption, government effectiveness	WB API	CSV
ICRG (International Country Risk Guide)	prsgroup.com	Political risk, economic risk, composite risk	Subscription	XLSX
Fragile States Index	fragilestatesindex.org	State fragility, 12 dimensions	Download	CSV

Source	URL	Key Variables	API	Format
World Bank WDI	data.worldbank.org	GDP, trade, poverty, education, health (1400+ indicators)	`wbgapi` Python / REST	CSV, JSON
Penn World Tables	rug.nl/ggdc/productivity/pwt	Real GDP, capital stock, TFP, exchange rates	Download	XLSX
IMF WEO	imf.org/en/Publications/WEO	GDP forecasts, inflation, fiscal balances	REST API	CSV
UN Comtrade	comtrade.un.org	Bilateral trade flows by commodity	REST API	CSV, JSON
CEPII (BACI, Gravity)	cepii.fr	Bilateral trade, gravity variables, GeoDist	Download	CSV
Maddison Project	rug.nl/ggdc/historicaldevelopment/maddison	Historical GDP per capita (1 AD–present)	Download	XLSX

Source	URL	Key Variables	API	Format
World Bank Commodity Prices	worldbank.org/commodities	Monthly/annual prices for 70+ commodities	WB API	CSV
USGS Minerals Data	usgs.gov/centers/national-minerals-information-center	Mineral production, reserves, trade	Download	XLSX
BP Statistical Review (now Energy Institute)	energyinst.org	Energy production, consumption, CO2, renewables	Download	XLSX
EITI (Extractive Industries Transparency)	eiti.org	Revenue from extractive industries by country	Download	PDF, CSV

Source	URL	Key Variables	API	Format
CRU TS (Climatic Research Unit)	crudata.uea.ac.uk	Temperature, precipitation, monthly gridded	Download	NetCDF
ERA5 (ECMWF Reanalysis)	cds.climate.copernicus.eu	Temperature, wind, precipitation, global gridded	API	NetCDF, GRIB
EM-DAT (Emergency Events Database)	emdat.be	Natural disasters, deaths, damage, affected population	Download	XLSX
SPEI Global Drought Monitor	spei.csic.es	Drought index, monthly global	Download	NetCDF

Source	URL	Key Variables	API	Format
UNHCR Refugee Statistics	unhcr.org/refugee-statistics	Refugees, asylum seekers, IDPs, stateless persons	REST API	CSV, JSON
IDMC (Internal Displacement Monitoring)	internal-displacement.org	Internal displacement by country, cause, year	REST API	CSV
IOM DTM (Displacement Tracking Matrix)	dtm.iom.int	Flow monitoring, displacement tracking	Download	XLSX

Source	URL	Key Variables	API	Format
GSDB (Global Sanctions Database)	globalsanctionsdb.com	Sanctions episodes, type, sender, target	Download	XLSX
UN Security Council Sanctions	un.org/securitycouncil/sanctions	Active sanctions regimes, listed entities	Download	XML
OFAC SDN List	ofac.treasury.gov	Sanctioned individuals and entities (US)	REST API	CSV, XML
Correlates of War Trade	correlatesofwar.org	Bilateral trade 1870–present	Download	CSV

Figure role	Expected tier	Reasoning
Motivator (trends, magnitudes, opening hooks)	Tier 1 (API download)	Time-series/aggregate data is always available from APIs
Context (current events, background)	Tier 1 or Tier 2	Try API first; fall back to published if specific
Evidence (paper regression results, coefficients)	Tier 2 (published)	Paper-specific coefficients — no API exists
Mechanism (DAG, flow chart)	Conceptual (TikZ)	No data needed — skip
Identification (first stage, IV diagnostics)	Tier 2 (published)	Paper-specific methodology outputs
Synthesis (comparison table)	N/A	Text/tabular — no external data

Keywords in case file	Data domain	Primary source (tested, fastest)	Fallback
conflict, civil war, battle deaths, violence	Conflict events	OWID catalog (`war` topic)	UCDP CSV bulk download
GDP, growth, income, poverty, development	Development	WB WDI (`wbgapi`)	Global Macro Database
commodities, oil, coffee, prices, pink sheet	Commodity prices	WB Pink Sheet (Excel URL)	OWID catalog
food prices, agriculture, FAO	Food/agriculture	FAO bulk ZIP	WB Pink Sheet
refugees, displacement, IDPs, asylum	Displacement	UNHCR API (REST, no key)	OWID catalog
child mortality, nutrition, education, WASH	Human development	UNICEF (`unicefdata`)	OWID catalog
democracy, institutions, governance, polity	Political institutions	V-Dem (R: `vdemdata`)	WB WGI via `wbgapi`
trade, exports, imports, tariffs, sanctions	Trade flows	WB WDI (`wbgapi`)	UN Comtrade
climate, temperature, rainfall, drought	Climate	CRU TS (NetCDF)	OWID catalog
GDP historical, macro panel, inflation	Macro panel	Global Macro Database (`global_macro_data`)	Penn World Table
maps, geography, spatial, borders	Geospatial	Natural Earth (shapefile) + `sf` in R	—

Access pattern	When to use	Code template
Python API	WDI, OWID, UNICEF, UNHCR	`import wbgapi as wb; df = wb.data.DataFrame(indicator, countries, time)`
R API	WDI, V-Dem	`WDI(country, indicator, start, end)` or `vdemdata::vdem`
Direct file	Pink Sheet, IMF WEO, PWT	`pd.read_excel(url)` or `pd.read_stata(url)`
Bulk ZIP	UCDP, FAO, Natural Earth	Download ZIP → extract → `pd.read_csv()`
Full dataset	GMD	`from global_macro_data import gmd; df = gmd()` (cache after first call)

# Data Sources — L{XX}

## Figure-by-Figure Routing

| Fig# | Role | Topic keywords | Tier | Source | Access pattern | Script |
|------|------|---------------|------|--------|---------------|--------|
| fig1 | Motivator | commodity prices, coffee, oil | Tier 1 | WB Pink Sheet | Direct Excel | `fig1_download.py` |
| fig2 | Evidence | MSS coefficients, IV results | Tier 2 | MSS (2004) Table 3 | Manual entry | hardcoded in `fig2_plot.py` |
| fig5 | Evidence | DV coefficients | Tier 2 | DV (2013) Table 4 | Manual entry | hardcoded in `fig5_plot.py` |
| fig7 | Context | food prices, Arab Spring | Tier 1 | FAO bulk ZIP | Bulk download | `fig7_download.py` |

## Tier 1 Attempts Log
| Figure | Source tried | Result | Fallback |
|--------|-------------|--------|----------|
| fig1 | WB Pink Sheet Excel | SUCCESS | — |
| fig7 | FAO API | TIMEOUT | FAO bulk ZIP: SUCCESS |

# Data Source Scan: [Lecture Filename]
**Date:** [YYYY-MM-DD]

## Slide-by-Slide Data Opportunities

### Slide: "[Title]" (slide N)
**Current content:** [Brief description]
**Data opportunity:** [What data could add]
**Recommended source:** [Source name + specific indicator]
**Type:** Chart / Exercise / Updated Stat / Trend
**Effort:** Quick (API call) / Medium (download + clean) / Large (merge multiple sources)

[Repeat for each opportunity]

## Summary
- Total data opportunities: N
- Quick wins (API-downloadable): N
- Recommended priority: [Top 3 to implement first]

## Download Commands
[Pre-written commands for top recommendations]

# Data Search: [Query]

## Best Match
- **Source:** [Name]
- **URL:** [Link]
- **Coverage:** [Countries × Years]
- **Variables:** [Key indicators available]
- **Format:** [CSV/XLSX/API]
- **Citation:** [How to cite this data]
- **Download:** [Command or URL]

## Alternatives
[Other sources that partially match]

# pip install wbgapi
import wbgapi as wb
# GDP per capita for conflict-affected countries
data = wb.data.DataFrame('NY.GDP.PCAP.CD', economy=['SOM', 'SSD', 'COD', 'AFG', 'YEM'], time=range(2000, 2025))
data.to_csv('data/wdi_gdppc_conflict.csv')

# Download specific indicator via API
curl -o data/indicator.csv "https://api.worldbank.org/v2/country/all/indicator/NY.GDP.PCAP.CD?format=csv&date=2000:2024"

# Battle-related deaths
curl -o data/ucdp_brd.json "https://ucdpapi.pcr.uu.se/api/battledeaths/24.1?pagesize=1000"

curl -o data/ucdp_ged.csv "https://ucdp.uu.se/downloads/ged/ged241-csv.zip"
unzip data/ged241-csv.zip -d data/ucdp/

# Download full dataset (large, ~500MB)
curl -o data/vdem.csv "https://v-dem.net/data/dataset-archive/"
# Or use specific indicators via API

curl -o data/commodity_prices.xlsx "https://thedocs.worldbank.org/en/doc/5d903e848db1d1b83e0ec8f744e55570-0350012021/CMO-Historical-Data-Monthly.xlsx"

Data/
├── raw/                          # Original downloads
│   ├── wdi_gdppc_2024.csv
│   ├── ucdp_ged_24.1.csv
│   └── vdem_v14.csv
├── processed/                    # Cleaned, subsetted
│   ├── L01_conflict_trends.csv
│   ├── L02_resources_conflict.csv
│   └── L03_shocks_panel.csv
├── scripts/                      # Download/cleaning scripts
│   ├── download_wdi.py
│   ├── download_ucdp.py
│   └── clean_merge.py
└── README.md                     # Data dictionary

# Course Data Dictionary

## L01: Foundations — Conflict Trends
**File:** `processed/L01_conflict_trends.csv`
**Sources:** UCDP/PRIO ACD v24.1, World Bank WDI
**Variables:**
- `country`: Country name
- `year`: Year (1946–2024)
- `conflict_active`: Binary indicator for active conflict
- `battle_deaths`: Best estimate of battle-related deaths
- `gdp_pc`: GDP per capita (constant 2015 USD)
**Coverage:** 195 countries, 1946–2024
**Citation:** Gleditsch et al. (2002); Sundberg & Melander (2013)

# Scan a lecture for data opportunities
/data-sourcer scan lecture_01.qmd

# Search for specific data
/data-sourcer search "rainfall anomalies Sub-Saharan Africa"

# Download a specific dataset
/data-sourcer download "World Bank WDI GDP per capita conflict countries 2000-2024"

Data Sourcer

Data Sourcer — Empirical Data Pipeline for Lectures

Mode Selection

═══════════════════════════════════════════════════

KNOWN DATA SOURCES

═══════════════════════════════════════════════════

Data Sourcer

Data Sourcer — Empirical Data Pipeline for Lectures

Mode Selection

═══════════════════════════════════════════════════

KNOWN DATA SOURCES

═══════════════════════════════════════════════════

Conflict & Security

Governance & Institutions

Economic

Resources & Commodities

Climate & Environment

Displacement & Migration

Sanctions & Geopolitics

═══════════════════════════════════════════════════

PIPELINE INTEGRATION: Automatic Data Source Routing

═══════════════════════════════════════════════════

Step 1: Figure Role → Data Tier (automatic)

Step 2: Topic Keywords → Data Domain → Source (automatic)

Step 3: Access Pattern Selection

Step 4: Output Format

How This Connects to the Syllabus

═══════════════════════════════════════════════════

MODE 1: SCAN — Auto-Match Data to Slide Content

═══════════════════════════════════════════════════

Steps

S1: Parse Lecture Content

S2: Match Against Known Sources

S3: Classify Data Opportunities

S4: Generate Data Report

═══════════════════════════════════════════════════

MODE 2: SEARCH — Find Specific Data

═══════════════════════════════════════════════════

Steps

Output Format

═══════════════════════════════════════════════════

MODE 3: DOWNLOAD — Retrieve Data

═══════════════════════════════════════════════════

Steps

D1: Identify Source and Method

D2: Clean and Organize

D3: Folder Structure

D4: Generate Data Dictionary

═══════════════════════════════════════════════════

TOPIC-SPECIFIC RECOMMENDATIONS

═══════════════════════════════════════════════════

L01: Foundations of Conflict Economics

L02: Greed, Grievance, and State Capacity

L03: Economic Shocks and Conflict

L04–L06 (future): Resources, Displacement, Historical Conflict

L07–L10 (future): Geoeconomics

Important Principles

Execution Command

Visualization Expert

Data Analyst

Huggingface Hub

Multi Reviewer Patterns

Dbt Transformation Patterns

Startup Financial Modeling