End-to-end biomarker discovery workflow from expression data to validated biomarker panels. Covers feature selection with Boruta/LASSO, classifier training with nested CV, and SHAP interpretation. Use when building and validating diagnostic or prognostic biomarker signatures from omics data.
Complete pipeline from expression data to validated biomarker panels with classifier.
Expression matrix + Metadata
|
v
[1. Data Preparation] -----> StandardScaler, train/test split
|
v
[2. Feature Selection] ----> Boruta or LASSO stability selection
|
v
[3. Model Training] -------> RandomForest/XGBoost with nested CV
|
v
[4. Model Interpretation] -> SHAP values, feature importance
|
v
[5. Validation] -----------> Hold-out test, bootstrap CI
|
v
Validated biomarker panel + classifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
expr = pd.read_csv('expression.csv', index_col=0)
meta = pd.read_csv('metadata.csv', index_col=0)
X = expr.T # samples x genes
y = meta.loc[X.index, 'condition'].values
# test_size=0.2: Standard 80/20 split; use 0.3 for <100 samples
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Fit scaler on training only to prevent data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
QC Checkpoint 1: Check class balance, sample counts per group
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
# Pre-filter if >10k features
if X_train_scaled.shape[1] > 10000:
selector = SelectKBest(f_classif, k=5000)
selector.fit(X_train_scaled, y_train)
X_train_filt = X_train_scaled[:, selector.get_support()]
feature_mask = selector.get_support()