Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Use when you need to analyze datasets, perform statistical tests, create visualizations, or build predictive models with reproducible, code-based workflows.
Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Generate publication-quality charts, statistical reports, and actionable insights from data files or databases.
When to use this skill
You need to analyze datasets to understand patterns, trends, or relationships.
You want to perform statistical tests or build predictive models.
You need data visualizations (charts, graphs, dashboards) to communicate findings.
You're doing exploratory data analysis (EDA) to understand data structure and quality.
You need to clean, transform, or merge datasets for analysis.
You want reproducible analysis with documented methodology and code.
Key capabilities
Unlike point-solution data analysis tools:
Skills relacionados
Full Python ecosystem: Access to pandas, numpy, scikit-learn, statsmodels, matplotlib, seaborn, plotly, and more.
Runs locally: Your data stays on your machine; no uploads to third-party services.
Reproducible: All analysis is code-based and version controllable.
Customizable: Extend with any Python library or custom analysis logic.
Publication-quality output: Generate professional charts and reports.
Statistical rigor: Access to comprehensive statistical and ML libraries.
Inputs
Data sources: CSV files, Excel files, JSON, Parquet, or database connections.
Analysis goals: Questions to answer or hypotheses to test.
Variables of interest: Specific columns, metrics, or dimensions to focus on.
Output preferences: Chart types, report format, statistical tests needed.
Context: Business domain, data dictionary, or known data quality issues.
Out of scope
Real-time streaming data analysis (use appropriate streaming tools).
Extremely large datasets requiring distributed computing (use Spark/Dask instead).
Production ML model deployment (use ML ops tools and infrastructure).
Live dashboarding (use BI tools like Tableau/Looker for operational dashboards).
Conventions and best practices
Python environment
Use virtual environments to isolate dependencies.
Install only necessary packages for the specific analysis.
Document all dependencies in requirements.txt or environment.yml.
Code structure
Write self-contained scripts that can be re-run by others.
Use clear variable names and add comments for complex logic.
Separate concerns: data loading, cleaning, analysis, visualization.
Save intermediate results to files when analysis is multi-stage.
Data handling
Never modify source data files – work on copies or in-memory dataframes.
Document data transformations clearly in code comments.
Handle missing values explicitly and document approach.
Validate data quality before analysis (check for nulls, outliers, duplicates).
Visualization best practices
Choose appropriate chart types for the data and question.
Use clear labels, titles, and legends on all charts.
Apply appropriate color schemes (colorblind-friendly when possible).
Include sample sizes and confidence intervals where relevant.
Save visualizations in high-resolution formats (PNG 300 DPI, SVG for vector graphics).
Statistical analysis
State assumptions for statistical tests clearly.
Check assumptions before applying tests (normality, homoscedasticity, etc.).
Report effect sizes not just p-values.
Use appropriate corrections for multiple comparisons.
Explain practical significance in addition to statistical significance.
Required behavior
Understand the question: Clarify what insights or decisions the analysis should support.
Explore the data: Check structure, types, missing values, distributions, outliers.
Clean and prepare: Handle missing data, outliers, and transformations appropriately.
Analyze systematically: Apply appropriate statistical methods or ML techniques.
Visualize effectively: Create clear, informative charts that answer the question.
Generate insights: Translate statistical findings into actionable business insights.
Document thoroughly: Explain methodology, assumptions, limitations, and conclusions.
Make reproducible: Ensure others can re-run the analysis and get the same results.
Required artifacts
Analysis script(s): Well-documented Python code performing the analysis.
Visualizations: Charts saved as high-quality image files (PNG/SVG).
Analysis report: Markdown or text document summarizing:
Research question and methodology
Data description and quality assessment
Key findings with supporting statistics
Visualizations with interpretations
Limitations and caveats
Recommendations or next steps
Requirements file: requirements.txt with all dependencies.
Sample data (if appropriate and non-sensitive): Small sample for reproducibility.
Implementation checklist
1. Data exploration and preparation
Load data and inspect structure (shape, columns, types)