Use when building or rerunning bibliometric data preprocessing pipelines: WoS txt parsing, institution-country extraction, author disambiguation, schema validation, and UT-key dataset preparation. Portable — no hardcoded paths. Keywords: preprocess, clean, parse WoS, data pipeline, UT linkage, country extraction, author extraction, 数据处理, 清洗, 解析WOS, 作者消歧, 国家机构提取.
Build a reproducible preprocessing layer that transforms raw WoS exports into canonical, UT-keyed datasets for downstream analysis. This skill is fully portable: it uses workspace-relative paths only, so it works on any machine with zero path editing.
All paths are relative to the workspace root (the folder that contains this .github/ directory). When the project is copied to a new machine, everything works as-is.
{workspace}/
├── data/
│ ├── raw/ # ← USER DROPS WoS .txt FILES HERE (the ONLY manual input)
│ ├── processed/ # ← All processing outputs land here
│ └── external/ # Optional external lookups
├── src/ # Processing and analysis scripts
├── reports/ # Downstream report artifacts (populated by analysis skill)
├── models/ # Optional model artifacts (e.g. BERTopic)
├── figures/ # Optional standalone figure outputs
├── requirements.txt # Python dependencies
└── .github/skills/ # Skill definitions (this file)
Key portability rules (enforced by the agent before every script run):
data/raw/ or data/processed/ and write outputs to data/processed/ or reports/.Path(__file__).resolve().parent.parent to derive the workspace root, or accept paths via CLI arguments.Pre-flight Interactive Inquiry (MANDATORY before first script run): The agent MUST ask the user:
PROJECT_ROOT 环境变量data/raw/,从 Stage A 开始wos_merged.csv,跳过 Stage A,从 Stage B 开始wos_cleaned.csv,跳过 A+B,从 Stage C 开始ROOT = Path(os.environ.get("PROJECT_ROOT", Path(__file__).resolve().parent.parent)).data/raw/ into a paper-level canonical CSV at data/processed/wos_merged.csv. SKIP if user provided CSV/Excel that already has UT + base columns.data/processed/wos_cleaned.csv. SKIP if input already has country/institution fields.data/processed/paper_authors.csv, data/processed/authors_summary.csv. SKIP if input already has author fields.| Stage | Script | Input | Output |
|---|---|---|---|
| A | src/wos_txt_to_csv.py | data/raw/*.txt | data/processed/wos_merged.csv |
| B | src/extract_country_institution.py | data/processed/wos_merged.csv | data/processed/wos_cleaned.csv |
| C | src/extract_authors.py | data/processed/wos_cleaned.csv | data/processed/paper_authors.csv, data/processed/authors_summary.csv |
Before running any script, the agent MUST:
All outputs in data/processed/:
| Artifact | File | Stage |
|---|---|---|
| Canonical paper table | data/processed/wos_merged.csv | A |
| Enriched paper table | data/processed/wos_cleaned.csv | B |
| Paper-author pairs | data/processed/paper_authors.csv | C |
| Author summary | data/processed/authors_summary.csv | C |
data/processed/wos_merged.csv exists and is readable.UT column exists, non-empty, unique at paper level.data/processed/wos_cleaned.csv including country/region collaboration fields.data/processed/paper_authors.csv exists and contains UT linkage.Analysis can start only when all quality gates pass. The paper-processing-analysis-handoff skill is the mandatory next step.
python -m venv .venv && .venv/Scripts/activate (Windows) or source .venv/bin/activate (Linux/Mac).pip install -r requirements.txt.data/raw/ (supports .txt, .csv, .xlsx).data/processed/.