Paper Data Processing Skill

Purpose

Build a reproducible preprocessing layer that transforms raw WoS exports into canonical, UT-keyed datasets for downstream analysis. This skill is fully portable: it uses workspace-relative paths only, so it works on any machine with zero path editing.

Portable Directory Convention

All paths are relative to the workspace root (the folder that contains this .github/ directory). When the project is copied to a new machine, everything works as-is.

{workspace}/
├── data/
│   ├── raw/               # ← USER DROPS WoS .txt FILES HERE (the ONLY manual input)
│   ├── processed/         # ← All processing outputs land here
│   └── external/          # Optional external lookups
├── src/                   # Processing and analysis scripts
├── reports/               # Downstream report artifacts (populated by analysis skill)
├── models/                # Optional model artifacts (e.g. BERTopic)
├── figures/               # Optional standalone figure outputs
├── requirements.txt       # Python dependencies
└── .github/skills/        # Skill definitions (this file)

Paper Data Processing Skill

Purpose

Portable Directory Convention

All paths are relative to the workspace root (the folder that contains this .github/ directory). When the project is copied to a new machine, everything works as-is.

{workspace}/
├── data/
│   ├── raw/               # ← USER DROPS WoS .txt FILES HERE (the ONLY manual input)
│   ├── processed/         # ← All processing outputs land here
│   └── external/          # Optional external lookups
├── src/                   # Processing and analysis scripts
├── reports/               # Downstream report artifacts (populated by analysis skill)
├── models/                # Optional model artifacts (e.g. BERTopic)
├── figures/               # Optional standalone figure outputs
├── requirements.txt       # Python dependencies
└── .github/skills/        # Skill definitions (this file)

Stage	Script	Input	Output
A	`src/wos_txt_to_csv.py`	`data/raw/*.txt`	`data/processed/wos_merged.csv`
B	`src/extract_country_institution.py`	`data/processed/wos_merged.csv`	`data/processed/wos_cleaned.csv`
C	`src/extract_authors.py`	`data/processed/wos_cleaned.csv`	`data/processed/paper_authors.csv`, `data/processed/authors_summary.csv`

Artifact	File	Stage
Canonical paper table	`data/processed/wos_merged.csv`	A
Enriched paper table	`data/processed/wos_cleaned.csv`	B
Paper-author pairs	`data/processed/paper_authors.csv`	C
Author summary	`data/processed/authors_summary.csv`	C

Paper Data Processing

Paper Data Processing Skill

Purpose

Portable Directory Convention

Paper Data Processing

Paper Data Processing Skill

Purpose

Portable Directory Convention

Scope

Stage Workflow

Core Scripts

Contract Outputs

Required Quality Gates

Handoff Rule

Usage Pattern

New Machine Setup

References

Done Criteria

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing