Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse. Designed for both full scRNA-seq workflows (raw counts to annotated cell types) and targeted expression-level analyses (per-cell-type DE, correlation, ANOVA, clustering).

IMPORTANT: This skill handles complex multi-workflow analysis. Most implementation details have been moved to references/ for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.

When to Use This Skill

Apply when users:

Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
Ask about cell type identification, clustering, or annotation
Need differential expression analysis by cell type or condition
Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
Ask about PCA, UMAP, t-SNE for expression data
Need Leiden/Louvain clustering on expression matrices
Want statistical comparisons between cell types (t-test, ANOVA, fold change)
Ask about marker genes for cell populations
Need batch correction (Harmony, combat)

When to Use This Skill

Apply when users:

Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
Ask about cell type identification, clustering, or annotation
Need differential expression analysis by cell type or condition
Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
Ask about PCA, UMAP, t-SNE for expression data
Need Leiden/Louvain clustering on expression matrices
Want statistical comparisons between cell types (t-test, ANOVA, fold change)
Ask about marker genes for cell populations
Need batch correction (Harmony, combat)

START: User question about scRNA-seq data │ ├─ Q1: What type of analysis is needed? │ │ │ ├─ FULL PIPELINE (raw counts → annotated clusters) │ │ └─ Workflow: QC → Normalize → HVG → PCA → Cluster → Annotate → DE │ │ See: references/scanpy_workflow.md │ │ │ ├─ DIFFERENTIAL EXPRESSION (per-cell-type comparison) │ │ └─ Workflow: Load → Normalize → Per-CT DE → Report │ │ Pattern: Most common BixBench pattern (bix-33) │ │ See: Section "Per-Cell-Type Differential Expression" below │ │ │ ├─ CORRELATION ANALYSIS (gene property vs expression) │ │ └─ Workflow: Load → Filter genes → Compute correlation │ │ Pattern: Gene length vs expression (bix-22) │ │ See: Section "Statistical Analysis on Expression Data" below │ │ │ ├─ CLUSTERING & PCA (expression matrix analysis) │ │ └─ Workflow: Load → Transform → PCA/Cluster → Report │ │ See: references/clustering_guide.md │ │ │ ├─ CELL COMMUNICATION (ligand-receptor interactions) │ │ └─ Workflow: Load → Get L-R pairs → Score → Identify signaling │ │ See: references/cell_communication.md (DETAILED) │ │ │ └─ TRAJECTORY ANALYSIS (pseudotime) │ └─ Workflow: Load → Normalize → Trajectory → Pseudotime │ See: references/trajectory_analysis.md │ ├─ Q2: What data format is available? │ ├─ h5ad file → sc.read_h5ad() → Check contents (counts, metadata, clusters) │ ├─ 10X files → sc.read_10x_mtx() or sc.read_10x_h5() │ ├─ CSV/TSV → pd.read_csv() → Convert to AnnData (check orientation!) │ └─ Other → See: references/scanpy_workflow.md "Data Loading" │ └─ Q3: Are there pre-computed results to use? ├─ Has cell type annotations → Skip clustering, go to analysis ├─ Has PCA/UMAP → Skip dimensionality reduction ├─ Has DE results → Skip DE, analyze results └─ Raw counts only → Full pipeline needed

Operation	Seurat (R)	Scanpy (Python)
Load data	`Read10X()`	`sc.read_10x_mtx()`
Normalize	`NormalizeData()`	`sc.pp.normalize_total() + sc.pp.log1p()`
Find HVGs	`FindVariableFeatures()`	`sc.pp.highly_variable_genes()`
Scale	`ScaleData()`	`sc.pp.scale()`
PCA	`RunPCA()`	`sc.tl.pca()`
Neighbors	`FindNeighbors()`	`sc.pp.neighbors()`
Cluster	`FindClusters()`	`sc.tl.leiden()` or `sc.tl.louvain()`
UMAP	`RunUMAP()`	`sc.tl.umap()`
Find markers	`FindMarkers()`	`sc.tl.rank_genes_groups()`
DE test	`FindMarkers(test.use="wilcox")`	`method='wilcoxon'`
Batch correction	`RunHarmony()`	`harmonypy.run_harmony()`

Issue	Solution
`ModuleNotFoundError: leidenalg`	`pip install leidenalg`
Sparse matrix errors	Use `.toarray()`: `X = adata.X.toarray() if issparse(adata.X) else adata.X`
Wrong matrix orientation	Check: more genes than samples? Transpose if needed
NaN in correlation	Filter: `valid = ~np.isnan(x) & ~np.isnan(y)`
Too few cells for DE	Need >= 3 cells per condition per cell type
Gene names don't match	Use MyGene for ID conversion
Memory error (large datasets)	Use `sc.pp.highly_variable_genes()` to reduce features

Single-Cell Genomics and Expression Matrix Analysis

When to Use This Skill

Single-Cell Genomics and Expression Matrix Analysis

When to Use This Skill

Core Principles

Required Python Packages

High-Level Workflow Decision Tree

Common Analysis Patterns (BixBench)

Pattern 1: Per-Cell-Type Differential Expression

Pattern 2: Gene Property vs Expression Correlation

Pattern 3: PCA on Expression Matrix

Pattern 4: Statistical Comparison Between Cell Types

Pattern 5: ANOVA Across Cell Types

Pattern 6: Cell-Cell Communication Analysis

Scanpy vs Seurat Equivalents

When to Use ToolUniverse Tools

Gene Annotation and Validation

Cell-Cell Communication (NEW)

Enrichment Analysis (Post-DE)

Data Loading Best Practices

Critical: Matrix Orientation

Load Metadata

Quality Control Checklist

Differential Expression Decision Tree

Statistical Analysis on Expression Data

Pearson/Spearman Correlation

T-Tests

ANOVA

Multiple Testing Correction

Marker Gene Identification

Batch Correction with Harmony

Report Generation

Troubleshooting Common Issues

Reference Documentation

Complete Workflow Example

Summary

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy