Python data analysis, statistics, and data science workflows. Use when the goal involves loading data, cleaning, aggregating, computing statistics, building models, or producing plots and reports.
machine learning — scikit-learn, training and evaluation
time series analysis
generating reports or notebooks with analysis narratives
reading data from an API, database, or file and producing insights
Do NOT use for:
web applications or UIs (use webdeveloper)
generic scripting not about data
pure math without data (just use compute directly)
Planning Guidance
Data science work is usually a small number of scripts with rich internal logic, not a sprawling file tree like a webapp. The right shape is:
One or two data-loading + cleaning scripts (ingestion, normalization).
相關技能
One or two analysis scripts (the actual computation).
A plotting or reporting script (figures, tables, or a markdown summary).
Optionally a shared utilities module if multiple scripts reuse helpers.
A typical data analysis task is 3–6 files, not 20. Don't over-decompose. But each file should be substantive — real loading, real cleaning, real computation, real visualization — not three-line stubs.
Planning notes:
Use compute with mode:"deep" and skill:"data_science" for any non-trivial analysis (more than one step). Use mode:"shallow" only for single self-contained computations.
Pass the data source in the goal or via context — file path, URL, database query, or API endpoint.
If the data needs to be fetched first (HTTP or database), plan a fetch step BEFORE compute so the file is on disk and the compute can focus on analysis.
If the user wants a report, include a final "generate report" task that produces either a markdown file or a rendered HTML/PDF from a notebook.
Architect Guidance
Decompose data science projects into small, focused Python scripts. Each script does one job well. The pipeline is linear: ingest → clean → analyze → visualize → report.
Typical layout:
project/
data/
raw/ (original, immutable inputs)
processed/ (cleaned, ready for analysis)
outputs/ (figures, tables, reports)
src/
load.py (read raw data, handle encodings, parse dates)
clean.py (drop nulls, fix types, normalize, handle outliers)
analyze.py (the main computation — groupby, stats, model)
plot.py (figures — matplotlib or seaborn)
report.py (assemble final output — markdown or HTML)
utils.py (shared helpers — only if genuinely reused)
requirements.txt
README.md
For smaller analyses, collapse layers — a single analysis.py that does load+clean+analyze, plus a plot.py, is fine.
Scaffolding. Include pip install for the dependencies the scripts will use:
Interfaces. In data science, interfaces are less about API contracts and more about data schemas. Use the interfaces field to lock the column structure between pipeline stages: