End-to-end biomarker discovery workflow from expression data to validated biomarker panels. Covers feature selection with Boruta/LASSO, classifier training with nested CV, and SHAP interpretation. Use when building and validating diagnostic or prognostic biomarker signatures from omics data.
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+, scanpy 1.10+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Build a validated biomarker panel from my omics data" → Orchestrate feature selection (Boruta/LASSO), nested cross-validation classifier training, and SHAP interpretation to produce a robust, validated biomarker signature.
Complete pipeline from expression data to validated biomarker panels with classifier.
Expression matrix + Metadata
|
v
[1. Data Preparation] -----> StandardScaler, train/test split
|
v
[2. Feature Selection] ----> Boruta or LASSO stability selection
|
v
[3. Model Training] -------> RandomForest/XGBoost with nested CV
|
v
[4. Model Interpretation] -> SHAP values, feature importance
|
v
[5. Validation] -----------> Hold-out test, bootstrap CI
|
v
Validated biomarker panel + classifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
expr = pd.read_csv('expression.csv', index_col=0)
meta = pd.read_csv('metadata.csv', index_col=0)
X = expr.T # samples x genes
y = meta.loc[X.index, 'condition'].values
# test_size=0.2: Standard 80/20 split; use 0.3 for <100 samples
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Fit scaler on training only to prevent data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
QC Checkpoint 1: Check class balance, sample counts per group
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
# Pre-filter if >10k features
if X_train_scaled.shape[1] > 10000:
selector = SelectKBest(f_classif, k=5000)
selector.fit(X_train_scaled, y_train)
X_train_filt = X_train_scaled[:, selector.get_support()]
feature_mask = selector.get_support()