$3c
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse. Designed for both full scRNA-seq workflows (raw counts to annotated cell types) and targeted expression-level analyses (per-cell-type DE, correlation, ANOVA, clustering).
IMPORTANT: This skill handles complex multi-workflow analysis. Most implementation details have been moved to references/ for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.
Apply when users:
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)
NOT for (use other skills instead):
tooluniverse-rnaseq-deseq2tooluniverse-gene-enrichmenttooluniverse-variant-analysistooluniverse-statistical-modeling# Core (MUST be installed)
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from scipy.spatial.distance import pdist
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.multitest import multipletests
# Enrichment (for GO/KEGG/Reactome follow-up)
import gseapy as gp
# Optional
import harmonypy # batch correction
Installation:
pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels
START: User question about scRNA-seq data
│
├─ Q1: What type of analysis is needed?
│ │
│ ├─ FULL PIPELINE (raw counts → annotated clusters)
│ │ └─ Workflow: QC → Normalize → HVG → PCA → Cluster → Annotate → DE
│ │ See: references/scanpy_workflow.md
│ │
│ ├─ DIFFERENTIAL EXPRESSION (per-cell-type comparison)
│ │ └─ Workflow: Load → Normalize → Per-CT DE → Report
│ │ Pattern: Most common BixBench pattern (bix-33)
│ │ See: Section "Per-Cell-Type Differential Expression" below
│ │
│ ├─ CORRELATION ANALYSIS (gene property vs expression)
│ │ └─ Workflow: Load → Filter genes → Compute correlation
│ │ Pattern: Gene length vs expression (bix-22)
│ │ See: Section "Statistical Analysis on Expression Data" below
│ │
│ ├─ CLUSTERING & PCA (expression matrix analysis)
│ │ └─ Workflow: Load → Transform → PCA/Cluster → Report
│ │ See: references/clustering_guide.md
│ │
│ ├─ CELL COMMUNICATION (ligand-receptor interactions)
│ │ └─ Workflow: Load → Get L-R pairs → Score → Identify signaling
│ │ See: references/cell_communication.md (DETAILED)
│ │
│ └─ TRAJECTORY ANALYSIS (pseudotime)
│ └─ Workflow: Load → Normalize → Trajectory → Pseudotime
│ See: references/trajectory_analysis.md
│
├─ Q2: What data format is available?
│ ├─ h5ad file → sc.read_h5ad() → Check contents (counts, metadata, clusters)
│ ├─ 10X files → sc.read_10x_mtx() or sc.read_10x_h5()
│ ├─ CSV/TSV → pd.read_csv() → Convert to AnnData (check orientation!)
│ └─ Other → See: references/scanpy_workflow.md "Data Loading"
│
└─ Q3: Are there pre-computed results to use?
├─ Has cell type annotations → Skip clustering, go to analysis
├─ Has PCA/UMAP → Skip dimensionality reduction
├─ Has DE results → Skip DE, analyze results
└─ Raw counts only → Full pipeline needed
Question: "Which immune cell type has the most DEGs after treatment?"
Workflow:
import scanpy as sc
# Load and normalize
adata = sc.read_h5ad("data.h5ad")
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Per-cell-type DE
cell_types = adata.obs['cell_type'].unique()
de_results = {}
for ct in cell_types:
adata_ct = adata[adata.obs['cell_type'] == ct].copy()
# Check sufficient cells
n_treat = (adata_ct.obs['condition'] == 'treatment').sum()
n_ctrl = (adata_ct.obs['condition'] == 'control').sum()
if n_treat < 3 or n_ctrl < 3:
continue
# Run DE
sc.tl.rank_genes_groups(adata_ct, groupby='condition',
groups=['treatment'], reference='control',
method='wilcoxon')
df = sc.get.rank_genes_groups_df(adata_ct, group='treatment')
# Count significant
sig = df[df['pvals_adj'] < 0.05]
de_results[ct] = {'n_sig': len(sig), 'results': df}
print(f"{ct}: {len(sig)} DEGs")
# Answer: Which has most?
top_ct = max(de_results, key=lambda x: de_results[x]['n_sig'])
print(f"Answer: {top_ct} ({de_results[top_ct]['n_sig']} DEGs)")
BixBench: bix-33
See: references/scanpy_workflow.md "Differential Expression"
Question: "What is the Pearson correlation between gene length and expression in CD4 T cells?"
Workflow:
import scanpy as sc
import pandas as pd
import numpy as np
from scipy import stats
from scipy.sparse import issparse
# Load data
adata = sc.read_h5ad("data.h5ad")
# Load gene annotations
gene_info = pd.read_csv("gene_info.tsv", sep='\t', index_col=0)
common = adata.var_names.intersection(gene_info.index)
adata.var['gene_length'] = gene_info.loc[common, 'gene_length'].reindex(adata.var_names)
adata.var['gene_type'] = gene_info.loc[common, 'gene_type'].reindex(adata.var_names)
# Filter to protein-coding genes
mask = adata.var['gene_type'] == 'protein_coding'
adata_pc = adata[:, mask].copy()
# Per-cell-type correlation
cell_types = ['CD4 T cells', 'CD8 T cells', 'CD14 Monocytes'] # etc.
for ct in cell_types:
adata_ct = adata_pc[adata_pc.obs['cell_type'] == ct]
# Mean expression per gene
X = adata_ct.X.toarray() if issparse(adata_ct.X) else adata_ct.X
mean_expr = np.mean(X, axis=0)
gene_lengths = adata_ct.var['gene_length'].values
# Remove NaN
valid = ~np.isnan(gene_lengths) & ~np.isnan(mean_expr)
# Pearson correlation
r, p = stats.pearsonr(gene_lengths[valid], mean_expr[valid])
print(f"{ct}: r = {r:.6f}, p = {p:.2e}, n = {valid.sum()} genes")
BixBench: bix-22
See: SKILL_OLD.md "Phase 6: Statistical Analysis on Expression Data"
Question: "What percentage of variance is explained by PC1 after log10 transform?"
Workflow:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
# Load expression matrix
df = pd.read_csv("expression.csv", index_col=0)
# Orient: samples as rows, genes as columns
if df.shape[0] > df.shape[1] * 5:
df = df.T # Genes were rows, transpose
# Log10 transform with pseudocount
X = np.log10(df.values + 1)
# Run PCA
n_components = min(X.shape[0], X.shape[1])
pca = PCA(n_components=n_components)
pca.fit(X)
# Variance explained
print(f"PC1: {pca.explained_variance_ratio_[0]*100:.2f}% variance")
print(f"PC1+PC2: {sum(pca.explained_variance_ratio_[:2])*100:.2f}%")
print(f"Top 10 PCs: {sum(pca.explained_variance_ratio_[:10])*100:.2f}%")
BixBench: bix-27
See: references/clustering_guide.md "PCA Analysis"
Question: "What is the t-statistic comparing LFCs between CD4/CD8 and other cell types?"
Workflow:
from scipy import stats
# After running per-cell-type DE (Pattern 1):
# Extract LFCs for different cell type groups
# Group 1: CD4/CD8 cells
cd4_lfc = de_results['CD4 T cells']['results']['log2FoldChange'].values
cd8_lfc = de_results['CD8 T cells']['results']['log2FoldChange'].values
cd4_cd8_lfc = np.concatenate([cd4_lfc, cd8_lfc])
# Group 2: Other cells
other_lfc = []
for ct in ['CD14 Monocytes', 'NK cells', 'B cells']:
other_lfc.append(de_results[ct]['results']['log2FoldChange'].values)
other_lfc = np.concatenate(other_lfc)
# Welch's t-test (unequal variances)
t_stat, p_val = stats.ttest_ind(cd4_cd8_lfc, other_lfc, equal_var=False)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4e}")
BixBench: bix-31
See: SKILL_OLD.md "Phase 6.3: T-Tests Between Groups"
Question: "What is the F-statistic for miRNA expression across immune cell types?"
Workflow:
import pandas as pd
from scipy import stats
# Load miRNA expression
df = pd.read_csv("mirna_expr.csv", index_col=0)
meta = pd.read_csv("metadata.csv", index_col=0)
# Exclude PBMCs
meta_filtered = meta[meta['cell_type'] != 'PBMC']
df_filtered = df[meta_filtered.index]
# Group by cell type
cell_types = meta_filtered['cell_type'].unique()
groups = {}
for ct in cell_types:
samples = meta_filtered[meta_filtered['cell_type'] == ct].index
groups[ct] = df_filtered[samples].values.flatten()
# One-way ANOVA
f_stat, p_val = stats.f_oneway(*groups.values())
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_val:.4e}")
BixBench: bix-36
See: SKILL_OLD.md "Phase 6.4: ANOVA Across Groups"
Question: "Which ligand-receptor interactions are strongest between tumor and T cells?"
Workflow:
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# Step 1: Get ligand-receptor pairs from OmniPath
result = tu.run_tool(
"OmniPath_get_ligand_receptor_interactions",
databases="CellPhoneDB,CellChatDB"
)
lr_pairs = pd.DataFrame(result['data']['interactions'])
# Step 2: Filter to expressed pairs
# (genes present in dataset, mean expression > 0.05)
expressed_lr = lr_pairs[
lr_pairs['source_genesymbol'].isin(adata.var_names) &
lr_pairs['target_genesymbol'].isin(adata.var_names)
]
# Step 3: Score communication between cell types
# (mean ligand expr in sender * mean receptor expr in receiver)
communication_scores = score_cell_communication(
adata, expressed_lr, cell_type_col='cell_type'
)
# Step 4: Filter to tumor-T cell interactions
tumor_tcell = communication_scores[
((communication_scores['sender'] == 'Tumor') &
(communication_scores['receiver'].str.contains('T cell'))) |
((communication_scores['receiver'] == 'Tumor') &
(communication_scores['sender'].str.contains('T cell')))
]
# Step 5: Top interactions
top_interactions = tumor_tcell.nlargest(20, 'score')
print(top_interactions[['sender', 'receiver', 'ligand', 'receptor', 'score']])
See: references/cell_communication.md (COMPLETE workflow with all helper functions)
For users familiar with Seurat (R):
| Operation | Seurat (R) | Scanpy (Python) |
|---|---|---|
| Load data | Read10X() | sc.read_10x_mtx() |
| Normalize | NormalizeData() | sc.pp.normalize_total() + sc.pp.log1p() |
| Find HVGs | FindVariableFeatures() | sc.pp.highly_variable_genes() |
| Scale | ScaleData() | sc.pp.scale() |
| PCA | RunPCA() | sc.tl.pca() |
| Neighbors | FindNeighbors() | sc.pp.neighbors() |
| Cluster | FindClusters() | sc.tl.leiden() or sc.tl.louvain() |
| UMAP | RunUMAP() | sc.tl.umap() |
| Find markers | FindMarkers() | sc.tl.rank_genes_groups() |
| DE test | FindMarkers(test.use="wilcox") | method='wilcoxon' |
| Batch correction | RunHarmony() | harmonypy.run_harmony() |
See: references/seurat_workflow.md for complete Seurat → Scanpy translation
See: references/cell_communication.md for complete OmniPath integration examples
AnnData expects: cells/samples as rows (obs), genes as columns (var)
import scanpy as sc
import pandas as pd
import anndata as ad
# Load h5ad (already oriented)
adata = sc.read_h5ad("data.h5ad")
# Load CSV/TSV (check orientation!)
df = pd.read_csv("counts.csv", index_col=0)
# Heuristic: If genes > samples by 5x, transpose
if df.shape[0] > df.shape[1] * 5:
print("Transposing: genes were rows")
df = df.T
adata = ad.AnnData(df)
meta = pd.read_csv("metadata.csv", index_col=0)
# Align indices
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
adata.obs[col] = meta.loc[common, col]
See: references/scanpy_workflow.md "Phase 1: Data Loading"
# QC metrics
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
# Filter cells
sc.pp.filter_cells(adata, min_genes=200) # Min genes per cell
adata = adata[adata.obs['pct_counts_mt'] < 20].copy() # Max mito %
# Filter genes
sc.pp.filter_genes(adata, min_cells=3) # Min cells per gene
print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")
See: references/scanpy_workflow.md "Phase 2: Quality Control"
Q: What type of DE analysis?
Single-Cell DE (many cells per condition):
├─ Use: sc.tl.rank_genes_groups()
├─ Methods: wilcoxon (default), t-test, logreg
├─ Best for: Per-cell-type DE, marker gene finding
└─ See: references/scanpy_workflow.md "Differential Expression"
Pseudo-Bulk DE (aggregate counts by sample):
├─ Use: DESeq2 via PyDESeq2
├─ Best for: Sample-level comparisons, replicates
└─ See: SKILL_OLD.md "Phase 5.3: DESeq2-based DE"
Statistical Tests Only:
├─ Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
├─ Best for: Correlation, ANOVA, t-tests on summaries
└─ See: "Statistical Analysis on Expression Data" below
For BixBench questions requiring specific statistical tests:
from scipy import stats
# Gene property vs expression
r, p = stats.pearsonr(gene_lengths, mean_expression)
r_s, p_s = stats.spearmanr(gene_lengths, mean_expression)
# Welch's t-test (unequal variance)
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
# Student's t-test (equal variance)
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=True)
# One-way ANOVA across multiple groups
f_stat, p_val = stats.f_oneway(group1, group2, group3, ...)
from statsmodels.stats.multitest import multipletests
# Benjamini-Hochberg (FDR)
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')
# Bonferroni
reject, pvals_adj, _, _ = multipletests(pvals, method='bonferroni')
See: SKILL_OLD.md "Phase 6: Statistical Analysis on Expression Data" for complete examples
# Find marker genes for each cluster
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
# Get results for cluster 0
markers = sc.get.rank_genes_groups_df(adata, group='0')
top_markers = markers.head(10)
# Annotate clusters using markers
marker_dict = {
'T cells': ['CD3D', 'CD3E', 'CD8A'],
'B cells': ['CD19', 'MS4A1', 'CD79A'],
'Monocytes': ['CD14', 'LYZ', 'S100A9'],
}
# Score and assign cell types
See: references/marker_identification.md for complete workflow
import harmonypy
# After PCA
sc.tl.pca(adata, n_comps=50)
# Run Harmony on PCA
ho = harmonypy.run_harmony(
adata.obsm['X_pca'][:, :30],
adata.obs,
'batch', # batch column
random_state=0
)
# Store corrected PCs
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
# Re-cluster on corrected PCs
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
See: references/scanpy_workflow.md "Batch Correction"
Always extract the specific answer to the user's question:
# Example: "Which cell type has the most DEGs?"
report = f"""
# Analysis Results
## Per-Cell-Type Differential Expression
| Cell Type | Significant DEGs (padj < 0.05) |
|-----------|-------------------------------|
{chr(10).join([f"| {ct} | {res['n_sig']} |" for ct, res in de_results.items()])}
## Answer
**{top_ct}** has the highest number of significantly differentially expressed
genes with **{de_results[top_ct]['n_sig']} DEGs** (Wilcoxon test, BH-corrected
p < 0.05).
"""
| Issue | Solution |
|---|---|
ModuleNotFoundError: leidenalg | pip install leidenalg |
| Sparse matrix errors | Use .toarray(): X = adata.X.toarray() if issparse(adata.X) else adata.X |
| Wrong matrix orientation | Check: more genes than samples? Transpose if needed |
| NaN in correlation | Filter: valid = ~np.isnan(x) & ~np.isnan(y) |
| Too few cells for DE | Need >= 3 cells per condition per cell type |
| Gene names don't match | Use MyGene for ID conversion |
| Memory error (large datasets) | Use sc.pp.highly_variable_genes() to reduce features |
See: references/troubleshooting.md for detailed solutions
Core Workflows:
Advanced Topics:
Utility Scripts:
import scanpy as sc
# 1. Load data
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
# 2. QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
# 3. Normalize
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
# 4. HVG + PCA
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)
# 5. Cluster
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
# 6. Find markers
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
markers = sc.get.rank_genes_groups_df(adata, group='0')
# 7. Annotate (manual or automatic)
# 8. Per-cell-type DE (if conditions present)
# 9. Cell communication analysis (if needed)
See: references/scanpy_workflow.md for detailed explanations of each step
This skill provides:
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)
For detailed workflows, see references/ directory.