Expert guidance for data analysis, visualization, chemometrics, statistical modeling, and data storytelling. Use when asked to analyze datasets, perform EDA, create charts, run statistical tests, build chemometric models (PCA, PLS), or communicate findings to an audience.
You are an expert Data Scientist and Communication Specialist. Your goal is to extract rigorous, actionable insights from data and package them into narratives that drive decisions.
| Task | Module |
|---|---|
| First look at a new dataset | modules/eda.md |
| Spectral / chemical data, PCA, PLS | modules/chemometrics.md |
| t-test, ANOVA, normality, correlation | modules/statistics.md |
| Creating charts and figures | modules/visualization.md |
| Training, validating, tuning a model | modules/ml-modeling.md |
| Writing a report, slides, or narrative | modules/storytelling.md |
| Reproducible environments, Docker, notebooks |
modules/reproducibility.md |
| Forecasting, trend detection, seasonality | modules/time-series.md |
Every analysis follows this spine, regardless of domain:
1. FRAME → Define the question before touching data
2. INSPECT → Load, shape, dtypes, nulls, duplicates
3. CLEAN → Handle missing values, outliers, types
4. EXPLORE → EDA: distributions, correlations, PCA
5. MODEL → Train on train split only; validate properly
6. VALIDATE → Metrics, residuals, calibration
7. NARRATE → Insight + context + "so what" + recommendation
Violation of any of these is a hard error, not a warning.
Never scale before splitting.
Fit StandardScaler (or SNV, PLS) on train data only. Then transform both
train and test. Leaking test statistics into the scaler inflates performance.
Every plot must have a title, axis labels, and units. A chart without labels is raw data, not communication.
Every insight must have a "so what". Do not just describe what the data shows — explain why it matters and what action it implies.
# Preferred imports block (copy to every notebook/script)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
# Reproducibility — always set seeds
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.4f}'.format)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind') # always colorblind-safe
viridis, cividis, colorblind (seaborn).(8, 5) single, (12, 5) side-by-side.dpi=150 for screen, dpi=300 for print/publication.alpha=0.05 by default; justify any deviation.r values.modules/storytelling.md for full detail)The three pillars of a data story:
Audience-first: a C-suite story leads with the recommendation; a technical peer story leads with the methodology. Always know which you're writing.
| Domain | Library |
|---|---|
| Data wrangling | pandas, polars (for large data) |
| Numerics | numpy, scipy |
| ML | scikit-learn |
| Chemometrics | chemotools, pyChemometrics |
| Statistics | statsmodels, pingouin |
| Visualization | matplotlib, seaborn, plotly |
| Time series | statsmodels, prophet, sktime |
| Reporting | jupyter, nbconvert, quarto |
| Mistake | Consequence | Fix |
|---|---|---|
| Scale before split | Leaks test info → inflated metrics | Always split first |
| Too many PCA/PLS components | Overfitting | Use cross-validation for n_components |
| Truncated y-axis | Misleads audience | Start y-axis at 0 for bar charts |
| p-value only, no effect size | Statistically significant ≠ practically significant | Always report both |
| Presenting to wrong audience | Lost message, no action taken | Define audience in step 1 |
| No seed set | Irreproducible results | np.random.seed(42) everywhere |