Codex 數據分析類 Use Case 技能。涵蓋數據集分析、報告生成、 評分迴圈等工作流。觸發條件:當需要使用 Codex 進行數據分析、 探索性分析、模型建構、報告輸出時啟用。
數據分析是 Codex 最強大的應用領域之一。透過 eval-driven loop(評分驅動迴圈), Codex 可以自動執行數據清理、探索性分析、模型建構,並迭代到品質達標。
本分類包含 1 個 use case:
| 步驟 | 動作 | 說明 |
|---|---|---|
| 1 | 讀取 AGENTS.md | 了解專案限制與規範 |
| 2 | 盤點數據檔案 | 識別格式、大小、schema |
| 3 | 建立評分腳本 | 評分維度:clarity, accuracy, completeness (0-100) |
| 4 | 匯入與清理 | pandas 處理缺失值、型別修正、去重 |
| 5 | 探索性分析 | 圖表、相關性、分佈 |
| 6 | 建立模型 |
| statsmodels / scikit-learn |
| 7 | 評分與迭代 | 重複 4-6 直到分數 > 90% |
| 8 | 輸出報告 | 最終分數、迭代日誌、剩餘風險 |
Role: You are a senior data analyst working inside Codex.
Context: This workspace contains one or more data files for analysis. An AGENTS.md may define project-specific rules.
Task:
1. Read AGENTS.md (if present) and follow every instruction.
2. Inventory all data files in the workspace — note formats, sizes, and schemas.
3. If no eval script exists, create `eval.py` that scores: clarity (0–100), accuracy (0–100), completeness (0–100).
4. Execute this analysis loop:
a. Import and tidy the data (handle missing values, fix dtypes, deduplicate).
b. Run exploratory analysis — produce at least 5 charts (distributions, correlations, time trends).
c. Fit an appropriate model (regression, classification, clustering) based on the question.
d. Re-run eval.py after each step; log scores to `iteration_log.md`.
e. Use `view_image` to inspect every chart before proceeding.
5. Continue iterating until overall score AND LLM average are both > 90%.
Constraints:
- Use pandas for data manipulation, matplotlib/seaborn for visualization.
- Use statsmodels or scikit-learn for modeling.
- Save all outputs to `outputs/` directory.
- Never overwrite raw data files.
Output Format:
- `outputs/report.md`: Executive summary, key findings, charts, model results.
- `outputs/iteration_log.md`: Score progression with timestamps.
- `outputs/eval_results.json`: Final scored metrics.
- Print: final scores, iteration count, remaining risks.
Self-Verification:
Before finishing, re-run eval.py one final time and confirm both scores > 90%.
If any chart is unreadable, regenerate it with higher DPI (300+).
Role: You are a data engineering assistant.
Task: Set up the analysis environment for {{project_name}}.
Steps:
1. List all files in `data/` — report format, size, encoding for each.
2. Install missing Python packages: pandas, matplotlib, seaborn, scikit-learn, statsmodels.
3. Load every CSV/Parquet into DataFrames.
4. Print: shape, dtypes, first 5 rows, null percentage per column.
5. Flag any data quality issues (mixed types, >30% nulls, suspicious outliers).
Output: A `data_profile.md` summarizing all datasets with quality flags.
# Data Analysis Conventions
- Always validate column types before analysis
- Use seaborn for statistical visualizations, matplotlib for custom charts
- Log all hyperparameters and model configurations
- Save outputs to `outputs/` — never overwrite source data
- Charts: 300 DPI minimum, include title and axis labels
- Missing values: document strategy (drop/impute) with justification
view_image 讓 Codex 能看到產出的圖表並改進Codex 環境有記憶體限制。建議:
sample() 先取子集進行探索pd.read_csv(path, chunksize=10000)在 prompt 中明確指定,例如:「Use plotly for interactive charts」。 Codex 會自動安裝。
| 技術 | 版本建議 | 用途 |
|---|---|---|
| Python | 3.11+ | 運行環境 |
| pandas | 2.0+ | 數據處理 |
| matplotlib | 3.7+ | 基礎繪圖 |
| seaborn | 0.12+ | 統計視覺化 |
| scikit-learn | 1.3+ | 機器學習模型 |
| statsmodels | 0.14+ | 統計模型 |