Analyze, explore, clean, and visualize datasets with statistical rigor. Use when user asks to analyze data, find patterns, compute statistics, create visualizations, clean messy data, or explore a dataset. Trigger when user says things like "analyze this data", "what trends do you see", "find patterns in", "create a chart", "clean this dataset", "run statistics on", "what does this data tell us", or provides CSV/Excel/JSON data for exploration. Also trigger for A/B test analysis, cohort analysis, and data quality assessments. Do NOT trigger for simple data format conversions, database query writing without analysis, or ETL pipeline design.
Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.
Before touching the data, clarify:
If the user says "analyze this data" without a specific question, help them formulate one:
profile_data.py with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run with --help for usage.statistical-tests-guide.md when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.Before analysis, get your bearings:
# First look at any dataset
import pandas as pd
df = pd.read_csv("data.csv") # or read_excel, read_json, etc.
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
Data quality determines analysis quality. Don't skip this.
# Document your decisions
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing data percentage per column:")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))
# IQR method for outlier detection
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)]
print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
Start broad, then focus on what's interesting.
Always start here — understand the basics before going deeper.
# Numerical columns
print(df.describe())
# Categorical columns
for col in df.select_dtypes(include='object').columns:
print(f"\n{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head(10))
Understanding shape matters for choosing the right tests.
import matplotlib.pyplot as plt
# Distribution of key metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']):
df[col].hist(ax=axes[i], bins=30)
axes[i].set_title(col)
axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median')
axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png")
Look for patterns between variables.
# Correlation matrix for numerical columns
corr = df.select_dtypes(include='number').corr()
print("Strong correlations (|r| > 0.5):")
for i in range(len(corr.columns)):
for j in range(i+1, len(corr.columns)):
if abs(corr.iloc[i, j]) > 0.5:
print(f" {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")
If the data has a time dimension, always look at trends.
# Time series analysis
df['date'] = pd.to_datetime(df['date'])
daily = df.groupby('date')['metric'].agg(['mean', 'count'])
daily['mean'].plot(figsize=(12, 4), title='Daily Average')
plt.savefig("trend.png")
Choose the right method for the question.
Use the right statistical test:
Always report:
from scipy import stats
# Example: comparing two groups
group_a = df[df['variant'] == 'A']['metric']
group_b = df[df['variant'] == 'B']['metric']
# Check normality first
_, p_normal_a = stats.shapiro(group_a.sample(min(5000, len(group_a))))
_, p_normal_b = stats.shapiro(group_b.sample(min(5000, len(group_b))))
if p_normal_a > 0.05 and p_normal_b > 0.05:
stat, p_value = stats.ttest_ind(group_a, group_b)
test_name = "t-test"