Name: Bio Metabolomics Statistical Analysis
Author: GPTomics

Bio Metabolomics Statistical Analysis | Skills Pool

Type	Meaning	Detection	Action
TPMV (technical)	Below detection limit	Random missingness within detected features	Impute: half-minimum per feature or KNN
BPMV (biological)	Metabolite truly absent	Structured: all zeros in one group	Leave as-is or use two-part test

log2_data = np.log2(intensities.replace(0, np.nan))

log2_matrix <- log2(intensity_matrix)
log2_matrix[!is.finite(log2_matrix)] <- NA

Method	When to use	Metabolomics note
PQN	Default for untargeted LC-MS metabolomics	More robust than TIC/median to dominant features
QC-RSC (LOESS)	Multi-batch LC-MS with pooled QC samples	Gold standard for batch correction; metabolomics-specific
VSN	High zero rate or heteroscedastic data	Handles zeros via arcsinh; replaces separate log2 step
TIC	Quick exploration; NMR data	Distorted by dominant features; avoid for LC-MS
Cyclic loess	Asymmetric DE (more up- than down-regulated)	Robust to assumption violations
None	IS-corrected data; single-batch balanced design	Check PCA on raw log2 data first

reference = log2_data.median(axis=1)
quotients = log2_data.div(reference, axis=0)
norm_factors = quotients.median(axis=0)
normalized = log2_data.div(norm_factors, axis=1)

reference <- apply(log2_matrix, 1, median, na.rm = TRUE)
quotients <- sweep(log2_matrix, 1, reference, '/')
norm_factors <- apply(quotients, 2, median, na.rm = TRUE)
normalized <- sweep(log2_matrix, 2, norm_factors, '/')

Scenario	Recommended	Rationale
Small n (3-10/group), default	limma `eBayes(trend=TRUE, robust=TRUE)`	Borrows variance across features; adds ~10-20 effective df
Large n (>10/group), Python-only	Welch's t-test + BH	Per-feature variance reliable; limma converges to ordinary t-test
Non-normal after log transform	Wilcoxon rank-sum	Cannot reach p < 0.05 with n < 4/group
Zero-inflated (many BPMVs)	Two-part test	Separately tests presence/absence and abundance
Paired/repeated measures	Paired t-test or limma with blocking	`duplicateCorrelation()` in limma for repeated measures
3+ groups	Welch's ANOVA or limma F-test	Post-hoc: Games-Howell or Tukey HSD

library(limma)

design <- model.matrix(~0 + condition, data = sample_info)
colnames(design) <- levels(factor(sample_info$condition))

fit <- lmFit(normalized_matrix, design)
contrast_matrix <- makeContrasts(Treatment - Control, levels = design)
fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2, trend = TRUE, robust = TRUE)

results <- topTable(fit2, coef = 1, number = Inf, adjust.method = 'BH')

design <- model.matrix(~0 + condition + batch, data = sample_info)

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

intensities = pd.read_csv('feature_table.tsv', sep='\t', index_col=0)
metadata = pd.read_csv('sample_info.tsv', sep='\t')
case_samples = metadata.loc[metadata['group'] == 'case', 'sample_id'].tolist()
ctrl_samples = metadata.loc[metadata['group'] == 'control', 'sample_id'].tolist()

log2_data = np.log2(intensities.replace(0, np.nan))

pvalues, log2fcs = [], []
for feature in log2_data.index:
    case_vals = log2_data.loc[feature, case_samples].dropna().values.astype(float)
    ctrl_vals = log2_data.loc[feature, ctrl_samples].dropna().values.astype(float)
    if len(case_vals) >= 2 and len(ctrl_vals) >= 2:
        _, pval = ttest_ind(case_vals, ctrl_vals, equal_var=False)  # Welch's -- scipy defaults to Student's
        pvalues.append(pval)
        log2fcs.append(case_vals.mean() - ctrl_vals.mean())
    else:
        pvalues.append(np.nan)
        log2fcs.append(np.nan)

results = pd.DataFrame({'feature_id': log2_data.index, 'log2fc': log2fcs, 'pvalue': pvalues})
results = results.dropna(subset=['pvalue'])
_, results['padj'], _, _ = multipletests(results['pvalue'], method='fdr_bh')  # default is Holm-Sidak, not BH
results['significant'] = results['padj'] < 0.05

Scenario	Test	Notes
2 groups, n >= 5/group	Welch's t-test	Always prefer over Student's; unequal variance is the norm
2 groups, non-normal after log	Mann-Whitney U	Cannot reach p < 0.05 with n < 4/group
2 groups, n < 5/group	limma moderated t	`eBayes(trend=TRUE)` borrows variance across features
Paired samples	Paired t-test	Pre/post, matched case-control
3+ groups	Welch's ANOVA	Post-hoc: Games-Howell or Dunn's test

log2fc = log2_data.loc[:, case_samples].mean(axis=1) - log2_data.loc[:, ctrl_samples].mean(axis=1)

log2fc <- rowMeans(normalized[, case_samples]) - rowMeans(normalized[, ctrl_samples])

library(ashr)

se <- sqrt(fit2$s2.post) * fit2$stdev.unscaled[, 1]
shrunk <- ash(fit2$coefficients[, 1], se, mixcompdist = 'normal')

shrunken_fc <- shrunk$result$PosteriorMean
lfsr <- shrunk$result$lfsr

fit2 <- treat(fit2, lfc = log2(1.5))
results <- topTreat(fit2, coef = 1, number = Inf)

import matplotlib.pyplot as plt

results['sig_and_fc'] = (results['padj'] < 0.05) & (results['log2fc'].abs() > 1)
colors = ['red' if s else 'gray' for s in results['sig_and_fc']]

plt.figure(figsize=(8, 6))
plt.scatter(results['log2fc'], -np.log10(results['pvalue']), c=colors, alpha=0.6, s=20)
plt.axhline(-np.log10(0.05), linestyle='--', color='gray')
plt.axvline(-1, linestyle='--', color='gray')
plt.axvline(1, linestyle='--', color='gray')
plt.xlabel('Log2 Fold Change')
plt.ylabel('-log10(p-value)')
plt.savefig('volcano_plot.png', dpi=150, bbox_inches='tight')

library(ggplot2)

results$sig_and_fc <- results$adj.P.Val < 0.05 & abs(results$logFC) > 1

ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = sig_and_fc)) +
    geom_point(alpha = 0.6) +
    scale_color_manual(values = c('gray60', 'firebrick')) +
    geom_hline(yintercept = -log10(0.05), linetype = 'dashed') +
    geom_vline(xintercept = c(-1, 1), linetype = 'dashed') +
    labs(x = 'Log2 Fold Change', y = '-Log10 Adjusted P-value') +
    theme_bw()

library(pcaMethods)

pca_result <- pca(data, nPcs = 5, method = 'ppca')

scores <- as.data.frame(scores(pca_result))
scores$group <- groups

ggplot(scores, aes(x = PC1, y = PC2, color = group)) +
    geom_point(size = 3) +
    stat_ellipse(level = 0.95) +
    labs(x = paste0('PC1 (', round(pca_result@R2[1] * 100, 1), '%)'),
         y = paste0('PC2 (', round(pca_result@R2[2] * 100, 1), '%)')) +
    theme_bw()

loadings <- as.data.frame(loadings(pca_result))
top_pc1 <- loadings[order(abs(loadings$PC1), decreasing = TRUE)[1:20], ]

library(mixOmics)

plsda_result <- plsda(as.matrix(data), groups, ncomp = 3)

perf_plsda <- perf(plsda_result, validation = 'Mfold', folds = 5, nrepeat = 50)
plot(perf_plsda, col = color.mixo(5:7))

ncomp_opt <- perf_plsda$choice.ncomp['BER', 'centroids.dist']

final_plsda <- plsda(as.matrix(data), groups, ncomp = ncomp_opt)
plotIndiv(final_plsda, group = groups, ellipse = TRUE, legend = TRUE)

vip <- vip(final_plsda)
top_vip <- sort(vip[, ncomp_opt], decreasing = TRUE)[1:20]

tune_splsda <- tune.splsda(as.matrix(data), groups, ncomp = 3,
                            validation = 'Mfold', folds = 5, nrepeat = 50,
                            test.keepX = c(5, 10, 20, 50, 100))

optimal_keepX <- tune_splsda$choice.keepX

splsda_result <- splsda(as.matrix(data), groups, ncomp = ncomp_opt, keepX = optimal_keepX)

selected_features <- selectVar(splsda_result, comp = 1)$name

library(ropls)

oplsda <- opls(data, groups, predI = 1, orthoI = NA)
plot(oplsda, typeVc = 'x-score')
plot(oplsda, typeVc = 'x-loading')

vip_scores <- getVipVn(oplsda)
top_vip <- sort(vip_scores, decreasing = TRUE)[1:20]

library(randomForest)

rf_model <- randomForest(x = data, y = groups, importance = TRUE, ntree = 500)
importance <- importance(rf_model)
top_features <- rownames(importance)[order(importance[, 'MeanDecreaseAccuracy'], decreasing = TRUE)[1:20]]
varImpPlot(rf_model, n.var = 20)

library(pROC)

top_feature <- 'feature_123'
roc_result <- roc(groups, data[, top_feature])
plot(roc_result, main = paste('AUC =', round(auc(roc_result), 3)))

biomarkers <- c('feature_1', 'feature_2', 'feature_3')
for (feat in biomarkers) {
    roc_i <- roc(groups, data[, feat])
    cat(feat, ': AUC =', round(auc(roc_i), 3), '\n')
}

library(pheatmap)

top_features <- rownames(sig_features)[1:50]
data_top <- data[, top_features]
annotation_row <- data.frame(Group = groups)
rownames(annotation_row) <- rownames(data)

pheatmap(t(data_top), annotation_col = annotation_row,
         scale = 'row', clustering_method = 'ward.D2',
         filename = 'heatmap.png', width = 10, height = 12)

Bio Metabolomics Statistical Analysis

Version Compatibility

Metabolomics Statistical Analysis

Processing Pipeline

Zero and Missing Value Handling

Bio Metabolomics Statistical Analysis

Version Compatibility

Metabolomics Statistical Analysis

Processing Pipeline

Zero and Missing Value Handling

Normalization

Method Selection

limma Workflow (R)

Welch's t-test Workflow (Python)

Fold Change Calculation

Fold Change Reporting

Fold Change Shrinkage (ashr)

Minimum Fold Change Testing

Common Pitfalls

Volcano Plot

PCA

PLS-DA

sPLS-DA (Sparse)

OPLS-DA

Random Forest

ROC Analysis

Heatmap

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns

Bio Metabolomics Statistical Analysis

Version Compatibility

Metabolomics Statistical Analysis

Processing Pipeline

Zero and Missing Value Handling

Bio Metabolomics Statistical Analysis

Version Compatibility

Metabolomics Statistical Analysis

Processing Pipeline

Zero and Missing Value Handling

Normalization

Method Selection

limma Workflow (R)

Welch's t-test Workflow (Python)

Fold Change Calculation

Fold Change Reporting

Fold Change Shrinkage (ashr)

Minimum Fold Change Testing

Common Pitfalls

Volcano Plot

PCA

PLS-DA

sPLS-DA (Sparse)

OPLS-DA

Random Forest

ROC Analysis

Heatmap

Related Skills

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns