Integrate and analyze multiple omics datasets (transcriptomics, proteomics, epigenomics, genomics, metabolomics) for systems biology and precision medicine. Performs cross-omics correlation, multi-omics clustering (MOFA+, NMF), pathway-level integration, and sample matching. Coordinates ToolUniverse skills for expression data (RNA-seq), epigenomics (methylation, ChIP-seq), variants (SNVs, CNVs), protein interactions, and pathway enrichment. Use when analyzing multi-omics datasets, performing integrative analysis, discovering multi-omics biomarkers, studying disease mechanisms across molecular layers, or conducting systems biology research that requires coordinated analysis of transcriptome, genome, epigenome, proteome, and metabolome data.
Coordinate and integrate multiple omics datasets for comprehensive systems biology analysis. This skill orchestrates specialized ToolUniverse skills to perform cross-omics correlation, multi-omics clustering, pathway-level integration, and unified interpretation across molecular layers.
Triggers:
Example Questions This Skill Solves:
| Capability | Description |
|---|---|
| Data Integration | Match samples across omics, handle missing data, normalize scales |
| Cross-Omics Correlation | Correlate features across molecular layers (gene expression vs protein, methylation vs expression) |
| Multi-Omics Clustering | MOFA+, NMF, joint clustering to identify omics-driven subtypes |
| Pathway Integration | Combine omics evidence at pathway level for unified biological interpretation |
| Biomarker Discovery | Identify multi-omics signatures with improved predictive power |
| Skill Coordination | Orchestrate RNA-seq, epigenomics, variant-analysis, protein-interactions, gene-enrichment skills |
| Visualization | Circos plots, integrated heatmaps, network visualizations |
| Reporting | Unified multi-omics reports with cross-layer insights |
Input: Multiple Omics Datasets
|
v
Phase 1: Data Loading & QC
|-- Load RNA-seq (expression matrix)
|-- Load proteomics (protein abundance)
|-- Load methylation (beta values or M-values)
|-- Load variants (CNV, SNV from VCF)
|-- Load metabolomics (metabolite abundance)
|-- Quality control per omics type
|
v
Phase 2: Sample Matching
|-- Match samples across omics by ID
|-- Identify common samples
|-- Handle batch effects
|-- Normalize sample identifiers
|
v
Phase 3: Feature Mapping
|-- Map features to common identifier space (genes, proteins, metabolites)
|-- Link CpG sites to genes (promoter, gene body)
|-- Map variants to genes
|-- Create unified feature matrix
|
v
Phase 4: Cross-Omics Correlation
|-- Gene expression vs protein abundance (translation efficiency)
|-- Promoter methylation vs expression (epigenetic regulation)
|-- CNV vs expression (dosage effect)
|-- eQTL variants vs expression (genetic regulation)
|-- Metabolite vs enzyme expression (metabolic flux)
|
v
Phase 5: Multi-Omics Clustering
|-- MOFA+ (Multi-Omics Factor Analysis) for latent factors
|-- NMF (Non-negative Matrix Factorization) for patient subtypes
|-- Joint clustering across omics
|-- Identify omics-specific vs shared variation
|
v
Phase 6: Pathway-Level Integration
|-- Aggregate omics to pathway level
|-- Score pathway dysregulation (combined evidence)
|-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO)
|-- Identify driver pathways across omics
|
v
Phase 7: Biomarker Discovery
|-- Feature selection across omics
|-- Multi-omics signatures for classification
|-- Cross-validation and performance
|-- Interpretation and biological validation
|
v
Phase 8: Generate Integrated Report
|-- Summary statistics per omics
|-- Cross-omics correlation results
|-- Multi-omics clusters and subtypes
|-- Top dysregulated pathways
|-- Multi-omics biomarkers
|-- Biological interpretation
Objective: Load multiple omics datasets and perform quality control.
Supported omics types:
Data formats:
Quality control per omics:
# RNA-seq QC
- Filter low-count genes (mean counts < threshold)
- Normalize (TPM, FPKM, or DESeq2)
- Log-transform for correlation
# Proteomics QC
- Filter proteins with high missing values
- Impute missing values (minimum, KNN)
- Normalize (median, quantile)
# Methylation QC
- Remove failed probes
- Correct for batch effects (ComBat)
- Filter cross-reactive probes
# Variants QC
- Use variant-analysis skill for VCF QC
- CNV segmentation validation
Objective: Identify common samples across omics datasets.
Sample ID harmonization:
def match_samples_across_omics(omics_data_dict):
"""
Match samples across multiple omics datasets.
Parameters:
omics_data_dict: {
'rnaseq': DataFrame (genes x samples),
'proteomics': DataFrame (proteins x samples),
'methylation': DataFrame (CpGs x samples),
'cnv': DataFrame (genes x samples)
}
Returns:
- common_samples: List of sample IDs present in all omics
- matched_data: Dict of DataFrames with common samples only
"""
# Extract sample IDs from each omics
sample_ids = {
omics_type: set(df.columns)
for omics_type, df in omics_data_dict.items()
}
# Find common samples (intersection)
common_samples = set.intersection(*sample_ids.values())
# Subset each omics to common samples
matched_data = {
omics_type: df[sorted(common_samples)]
for omics_type, df in omics_data_dict.items()
}
return sorted(common_samples), matched_data
Handling missing omics:
Objective: Map features from different omics to common gene-level identifiers.
Gene-centric integration:
# Map all features to genes
feature_mapping = {
'rnaseq': 'gene_symbol', # Already gene-level
'proteomics': 'gene_symbol', # Map protein to gene
'methylation': 'gene_symbol', # Map CpG to gene (promoter)
'cnv': 'gene_symbol', # CNV regions to overlapping genes
'metabolomics': 'enzyme_gene' # Metabolite to enzyme gene
}
CpG to gene mapping:
CNV to gene mapping:
Objective: Correlate features across molecular layers to understand regulation.
Example analyses:
def correlate_rna_protein(rnaseq_data, proteomics_data):
"""
Correlate mRNA and protein levels for each gene.
Expected: Positive correlation (r ~ 0.4-0.6 typical)
Discordance indicates post-transcriptional regulation
"""
# Find common genes
common_genes = set(rnaseq_data.index) & set(proteomics_data.index)
correlations = {}
for gene in common_genes:
rna = rnaseq_data.loc[gene]
protein = proteomics_data.loc[gene]
# Spearman correlation (robust to outliers)
r, p = spearmanr(rna, protein)
correlations[gene] = {'r': r, 'p': p}
# Identify discordant genes (low RNA-protein correlation)
discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2}
return correlations, discordant
def correlate_methylation_expression(methylation_data, rnaseq_data):
"""
Correlate promoter methylation with gene expression.
Expected: Negative correlation (increased methylation → decreased expression)
"""
# For each gene with promoter methylation
results = {}
for gene in methylation_data.index:
if gene in rnaseq_data.index:
meth = methylation_data.loc[gene] # Average promoter beta
expr = rnaseq_data.loc[gene]
r, p = spearmanr(meth, expr)
results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'}
# Identify genes with strong methylation-expression anticorrelation
regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01}
return results, regulated
def correlate_cnv_expression(cnv_data, rnaseq_data):
"""
Correlate copy number with gene expression.
Expected: Positive correlation (gene dosage effect)
"""
results = {}
for gene in cnv_data.index:
if gene in rnaseq_data.index:
cnv = cnv_data.loc[gene] # log2 ratio
expr = rnaseq_data.loc[gene]
r, p = pearsonr(cnv, expr)
results[gene] = {'r': r, 'p': p}
# Genes with dosage effect (CNV drives expression)
dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01}
return results, dosage_genes
Objective: Identify patient subtypes using integrated omics data.
Method 1: MOFA+ (Multi-Omics Factor Analysis)
MOFA+ identifies latent factors that explain variation across omics.
# Conceptual workflow (uses R's MOFA2 package or Python implementation)
# 1. Prepare multi-omics data as list of matrices
# 2. Run MOFA+ to identify factors
# 3. Inspect factor variance explained per omics
# 4. Cluster samples based on factor scores
# Example interpretation:
# Factor 1: Explains 40% variance in RNA-seq, 30% in proteomics → Cell proliferation
# Factor 2: Explains 50% variance in methylation → Epigenetic subtype
# Factor 3: Explains 20% variance in CNV → Genomic instability
Method 2: Joint NMF (Non-negative Matrix Factorization)
Decompose multi-omics matrices into shared latent components.
def joint_nmf_clustering(omics_data_dict, n_clusters=3):
"""
Perform joint NMF across omics for clustering.
Returns patient cluster assignments based on shared factors.
"""
# Concatenate omics matrices (after normalization)
combined_matrix = np.vstack([
omics_data_dict['rnaseq'].values,
omics_data_dict['proteomics'].values,
omics_data_dict['methylation'].values
])
# Run NMF
from sklearn.decomposition import NMF
model = NMF(n_components=n_clusters, init='nndsvd', random_state=42)
W = model.fit_transform(combined_matrix) # Feature loadings
H = model.components_ # Sample coefficients
# Cluster samples based on H (components)
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T)
return clusters, W, H
Method 3: Similarity Network Fusion (SNF)
Integrate omics through patient similarity networks.
Objective: Aggregate multi-omics evidence at the pathway level.
Approach: Score pathway dysregulation using combined evidence from multiple omics.
def integrate_pathway_evidence(omics_results, pathway_genes):
"""
Score pathway dysregulation across omics.
omics_results: {
'rnaseq': {'gene': fold_change},
'proteomics': {'gene': fold_change},
'methylation': {'gene': methylation_diff},
'cnv': {'gene': copy_number}
}
pathway_genes: List of genes in pathway
"""
# For each gene in pathway
pathway_scores = []
for gene in pathway_genes:
gene_score = 0
evidence_count = 0
# RNA-seq evidence
if gene in omics_results['rnaseq']:
gene_score += abs(omics_results['rnaseq'][gene])
evidence_count += 1
# Proteomics evidence
if gene in omics_results['proteomics']:
gene_score += abs(omics_results['proteomics'][gene])
evidence_count += 1
# Methylation evidence (negative correlation)
if gene in omics_results['methylation']:
gene_score += abs(omics_results['methylation'][gene])
evidence_count += 1
# CNV evidence
if gene in omics_results['cnv']:
gene_score += abs(omics_results['cnv'][gene])
evidence_count += 1
if evidence_count > 0:
pathway_scores.append(gene_score / evidence_count)
# Aggregate pathway score (mean of gene scores)
pathway_score = np.mean(pathway_scores) if pathway_scores else 0
return {
'pathway_score': pathway_score,
'n_genes_with_evidence': len(pathway_scores),
'n_omics_types': evidence_count
}
Use ToolUniverse enrichment tools:
# Get pathways for gene set
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Enrichment for genes dysregulated in ANY omics
all_dysregulated_genes = set()
all_dysregulated_genes.update(rnaseq_degs)
all_dysregulated_genes.update(diff_proteins)
all_dysregulated_genes.update(methylation_dmgs)
# Run enrichment
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(all_dysregulated_genes),
"library": "KEGG_2021_Human"
}
})
# Score each pathway with multi-omics evidence
for pathway in enrichment['data']['results']:
pathway_genes = pathway['genes']
pathway['multi_omics_score'] = integrate_pathway_evidence(
omics_results, pathway_genes
)
Objective: Identify multi-omics signatures for disease classification.
Feature selection across omics:
def select_multiomics_features(X_dict, y, n_features=50):
"""
Select top features across omics for classification.
X_dict: {
'rnaseq': DataFrame (samples x genes),
'proteomics': DataFrame (samples x proteins),
'methylation': DataFrame (samples x CpGs)
}
y: Target labels (disease vs control)
Returns: Selected features per omics
"""
from sklearn.feature_selection import SelectKBest, f_classif
selected_features = {}
for omics_type, X in X_dict.items():
selector = SelectKBest(f_classif, k=min(n_features, X.shape[1]))
selector.fit(X, y)
# Get selected feature names
selected_idx = selector.get_support()
selected_features[omics_type] = X.columns[selected_idx].tolist()
return selected_features
Multi-omics classification:
def multiomics_classification(X_dict, y, selected_features):
"""
Train classifier using multi-omics features.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Concatenate selected features from each omics
X_combined = []
for omics_type, features in selected_features.items():
X_combined.append(X_dict[omics_type][features])
X_combined = pd.concat(X_combined, axis=1)
# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc')
return {
'mean_auc': scores.mean(),
'std_auc': scores.std(),
'n_features': X_combined.shape[1],
'features_per_omics': {k: len(v) for k, v in selected_features.items()}
}
Generate comprehensive multi-omics report:
# Multi-Omics Integration Report
## Dataset Summary
- **Omics Types**: RNA-seq, Proteomics, Methylation, CNV
- **Common Samples**: 45 patients (30 disease, 15 control)
- **Features**: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions
## Cross-Omics Correlation
### RNA-Protein Correlation
- **Overall correlation**: r = 0.52 (expected: 0.4-0.6)
- **Highly correlated**: 3,245 genes (45%)
- **Discordant genes**: 890 genes (post-transcriptional regulation)
### Methylation-Expression
- **Promoter methylation**: Anticorrelation r = -0.41
- **Epigenetically regulated genes**: 1,256 genes (p < 0.01)
- **Example**: BRCA1 promoter hypermethylation → 3-fold reduced expression
### CNV-Expression Dosage Effect
- **Genes with dosage effect**: 445 genes (r > 0.5, p < 0.01)
- **Example**: MYC amplification (3 copies) → 2.8-fold increased expression
## Multi-Omics Clustering
### MOFA+ Analysis
- **Factor 1** (25% variance): Cell cycle genes (RNA + protein)
- **Factor 2** (18% variance): Immune signature (RNA + methylation)
- **Factor 3** (15% variance): Metabolic reprogramming (RNA + metabolites)
### Patient Subtypes
- **Subtype 1** (n=18): High proliferation, MYC amplification
- **Subtype 2** (n=15): Immune-enriched, hypomethylation
- **Subtype 3** (n=12): Metabolic dysregulation, mitochondrial dysfunction
## Pathway Integration
### Top Dysregulated Pathways (Multi-Omics Score)
1. **Cell Cycle** (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification)
2. **Immune Response** (score: 7.2) - RNA (↑), Methylation (hypo)
3. **Glycolysis** (score: 6.8) - RNA (↑), Metabolites (↑)
## Multi-Omics Biomarkers
### Classification Performance
- **AUC**: 0.92 ± 0.04 (5-fold CV)
- **Features**: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV)
- **Top biomarkers**:
- MYC expression (RNA)
- CDK1 protein abundance
- BRCA1 promoter methylation
- TP53 CNV status
## Biological Interpretation
The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms:
1. **Proliferative subtype**: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels.
2. **Immune subtype**: Hypomethylation of immune genes leading to increased expression and T-cell infiltration.
3. **Metabolic subtype**: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels.
These subtypes may respond differently to targeted therapies.
This skill orchestrates multiple specialized skills:
| Skill | Used For | Phase |
|---|---|---|
tooluniverse-rnaseq-deseq2 | Load and analyze RNA-seq data | Phase 1, 4 |
tooluniverse-epigenomics | Methylation analysis, ChIP-seq peaks | Phase 1, 4 |
tooluniverse-variant-analysis | CNV and SNV processing | Phase 1, 3, 4 |
tooluniverse-protein-interactions | Protein network context | Phase 6 |
tooluniverse-gene-enrichment | Pathway enrichment | Phase 6 |
tooluniverse-expression-data-retrieval | Public omics data retrieval | Phase 1 |
tooluniverse-target-research | Gene/protein annotation | Phase 3, 8 |
Question: "Integrate TCGA breast cancer RNA-seq, proteomics, methylation, and CNV data"
Workflow:
Question: "How do GWAS variants affect gene expression through methylation?"
Workflow:
Question: "Predict drug response using multi-omics profiles"
Workflow:
For precision medicine applications where patient stratification is goal.
Build integrated networks combining PPI, co-expression, regulatory interactions.
Longitudinal multi-omics data (time-series or treatment response).
Spatial transcriptomics + proteomics for tissue architecture.
| Component | Requirement |
|---|---|
| Omics types | At least 2 omics datasets |
| Common samples | At least 10 samples across omics |
| Cross-correlation | Pearson/Spearman correlation computed |
| Clustering | At least one method (MOFA+, NMF, or SNF) |
| Pathway integration | Enrichment with multi-omics evidence scores |
| Report | Summary, correlations, clusters, pathways, biomarkers |
Methods:
ToolUniverse Skills: