Production-ready RNA-seq differential expression analysis using PyDESeq2. Performs DESeq2 normalization, dispersion estimation, Wald testing, LFC shrinkage, and result filtering. Handles multi-factor designs, multiple contrasts, batch effects, and integrates with gene enrichment (gseapy) and ToolUniverse annotation tools (UniProt, Ensembl, OpenTargets). Supports CSV/TSV/H5AD input formats and any organism. Use when analyzing RNA-seq count matrices, identifying DEGs, performing differential expression with statistical rigor, or answering questions about gene expression changes.
Comprehensive differential expression analysis of RNA-seq count data using PyDESeq2, with integrated enrichment analysis (gseapy) and gene annotation via ToolUniverse.
BixBench Coverage: Validated on 53 BixBench questions across 15 computational biology projects covering RNA-seq, miRNA-seq, and differential expression analysis tasks.
Apply when users:
# Core (MUST be installed)
import pandas as pd
import numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# Enrichment (optional, for GO/KEGG/Reactome)
import gseapy as gp
# ToolUniverse (optional, for gene annotation)
from tooluniverse import ToolUniverse
Installation:
pip install pydeseq2 gseapy pandas numpy scipy anndata
CRITICAL FIRST STEP: Before writing ANY code, parse the question to identify:
*counts*.csv, *metadata*.csv, *.h5adSee references/question_parsing.md for detailed parsing patterns.
ALWAYS inspect metadata for ALL variables, not just what the question mentions!
Many experiments have hidden batch effects (media conditions, sequencing batches, time points) that MUST be included as covariates. Failing to account for these reduces statistical power and can lead to incorrect results.
Decision process:
~condition (only if no batch variables exist)~batch1 + batch2 + condition (covariates first!)~batch + factor1 + factor2 + factor1:factor2Example (critical real-world case):
Metadata columns: [Strain, Media, Replicate]
Question asks: "strain effects"
❌ WRONG: design="~Strain" (ignores Media!)
✅ CORRECT: design="~Media + Strain" (accounts for media variation)
Why: Even though question doesn't mention media, it systematically
affects expression. DESeq2 must remove media effects to properly test strain.
Rule of thumb: If a column has 2+ levels and represents an experimental condition (not sample ID), include it in design.
Load count matrix and metadata, then validate alignment:
import pandas as pd
from scripts.load_count_matrix import load_count_matrix, validate_inputs
# Load data
counts = load_count_matrix("counts.csv")
metadata = pd.read_csv("metadata.csv", index_col=0)
# Validate and align
counts, metadata, issues = validate_inputs(counts, metadata)
Key considerations:
See references/data_loading.md for detailed data handling.
Before choosing design formula, ALWAYS inspect metadata to identify all experimental factors:
# Print metadata structure
print("Metadata columns and levels:")
for col in metadata.columns:
unique_vals = metadata[col].unique()
print(f" {col}: {len(unique_vals)} levels → {list(unique_vals)[:5]}")
# Example output:
# Strain: 4 levels → ['1', '97', '98', '99']
# Media: 3 levels → ['MMGluFeMinus', 'MMGluFePlus', 'Succinate']
# Replicate: 3 levels → ['A', 'B', 'C']
# Decision:
# - Strain: Biological factor (testing)
# - Media: Batch/covariate (MUST include!)
# - Replicate: Biological replicate (don't include as factor)
# Design: ~Media + Strain
This step prevents missing hidden batch effects that could invalidate your analysis.
Execute differential expression analysis:
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# Setup design (set reference level first)
metadata['condition'] = pd.Categorical(
metadata['condition'],
categories=['control', 'treatment'] # First = reference
)
# Run DESeq2
dds = DeseqDataSet(
counts=counts,
metadata=metadata,
design="~condition",
quiet=True
)
dds.deseq2()
# Extract results
stat_res = DeseqStats(dds, contrast=['condition', 'treatment', 'control'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()
# Apply LFC shrinkage (if needed)
stat_res.lfc_shrink(coeff='condition[T.treatment]')
results = stat_res.results_df
When metadata has multiple experimental variables (e.g., strain + media conditions), ALWAYS include covariates:
# Inspect metadata (from Step 2.5) showed:
# - Strain: 4 levels (factor of interest)
# - Media: 3 levels (batch effect - MUST include!)
# Set reference levels for BOTH factors
metadata['media'] = pd.Categorical(
metadata['media'],
categories=['MMGluFeMinus', 'MMGluFePlus', 'Succinate'] # First = reference
)
metadata['strain'] = pd.Categorical(
metadata['strain'],
categories=['1', '97', '98', '99'] # First = reference (JBX1)
)
# Include covariate in design formula
dds = DeseqDataSet(
counts=counts,
metadata=metadata,
design="~media + strain", # Covariate first, then factor!
quiet=True
)
dds.deseq2()
# Extract strain effect (controlling for media)
stat_res = DeseqStats(dds, contrast=['strain', '98', '1'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()
results = stat_res.results_df
Why this matters: DESeq2 will model media effects separately, removing media-driven variance before testing strain differences. This increases statistical power and prevents false positives/negatives.
When to use what:
See references/pydeseq2_workflow.md for complete PyDESeq2 patterns including batch effects, interaction terms, and complex designs.
Apply thresholds from the question:
# Filter DEGs
sig_genes = results[
(results['padj'] < 0.05) &
(results['log2FoldChange'].abs() > 0.5) &
(results['baseMean'] > 10)
]
# Direction-specific filtering
up_genes = sig_genes[sig_genes['log2FoldChange'] > 0]
down_genes = sig_genes[sig_genes['log2FoldChange'] < 0]
Common filtering patterns:
len(sig_genes)results.loc['GENE_NAME', 'log2FoldChange']See references/result_filtering.md for advanced filtering.
Access dispersion estimates:
# Get dispersion data
disp_data = dds.var # Dispersions stored here
# Common question: "genes below threshold prior to fitting"
genewise = dds.var['genewise_dispersions']
count_below = (genewise < 1e-5).sum()
Dispersion column mapping:
genewise_dispersionsfitted_dispersionsMAP_dispersionsdispersionsSee references/dispersion_analysis.md for diagnostics.
Run pathway enrichment on DEGs:
import gseapy as gp
# Prepare gene list
gene_list = sig_genes.index.tolist()
# Run enrichment
enr = gp.enrich(
gene_list=gene_list,
gene_sets='GO_Biological_Process_2023', # or KEGG, Reactome
background=None, # or specify background
outdir=None,
cutoff=0.05,
no_plot=True,
verbose=False
)
# Extract results
top_pathways = enr.results.head(10)
Library selection:
GO_Biological_Process_2023KEGG_2019_MouseKEGG_2021_HumanReactome_2022See references/enrichment_analysis.md for complete enrichment workflows.
Use ToolUniverse ONLY for gene annotation, not analysis:
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# Gene ID conversion
result = tu.tools.MyGene_query_genes(query="TP53")
# Gene details
result = tu.tools.ensembl_lookup_gene(
gene_id="ENSG00000141510",
species="homo_sapiens"
)
Do NOT use ToolUniverse for:
Match the question's requested format:
# Numeric precision
round(value, 2) # "2 decimal points"
f"{value:.2E}" # "scientific notation"
# Percentages
f"{value * 100:.1f}%" # "as percentage"
# Counts (no decimals)
int(len(sig_genes)) # "how many genes"
See references/output_formatting.md for all format patterns.
Question: "How many genes show significant DE (padj < 0.05, |log2FC| > 0.5)?"
degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5)]
answer = len(degs)
Question: "What is the log2FC of gene X?"
answer = round(results.loc['GENE_X', 'log2FoldChange'], 2)
Question: "How many genes are upregulated?"
up_degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'] > 0)]
answer = len(up_degs)
Question: "How many genes are uniquely DE in condition A?"
degs_A = set(results_A[results_A['padj'] < 0.05].index)
degs_B = set(results_B[results_B['padj'] < 0.05].index)
unique_A = degs_A - degs_B
answer = len(unique_A)
Question: "How many genes have dispersion below 1e-5 prior to fitting?"
genewise = dds.var['genewise_dispersions']
answer = (genewise < 1e-5).sum()
See references/bixbench_examples.md for all 10 patterns with examples.
| Error | Solution |
|---|---|
| "No matching samples" | Check if counts need transposing; strip whitespace |
| "Dispersion trend did not converge" | Use fit_type='mean' |
| "Contrast not found" | Check metadata['factor'].unique() for exact names |
| "Non-integer counts" | Round to integers OR use t-test for normalized data |
| "NaN in padj" | Independent filtering removed genes; exclude from counts |
See references/troubleshooting.md for complete debugging guide.
Every analysis MUST include:
Data Loading:
DESeq2 Analysis:
Results:
Quality Checks:
This skill uses PyDESeq2 (Python implementation) for differential expression analysis. While PyDESeq2 faithfully implements the DESeq2 algorithm, numerical differences exist between Python and R implementations:
Dispersion Estimation:
For Most Analyses: PyDESeq2 provides accurate, publication-quality results. The differences are in numerical precision, not statistical validity.
Enrichment analysis uses Python gseapy by default. Results may differ from R clusterProfiler due to:
For R clusterProfiler Compatibility: If you need results matching R clusterProfiler exactly (e.g., for benchmark reproducibility):
# Install rpy2 and R packages first:
# pip install rpy2
# In R: install.packages(c("clusterProfiler", "org.Hs.eg.db", "enrichplot"))
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri, vectors
pandas2ri.activate()
# Load R packages
clusterProfiler = importr('clusterProfiler')
orgdb = importr('org.Hs.eg.db')
# Convert Python gene list to R vector
gene_list = sig_genes.index.tolist()
r_genes = vectors.StrVector(gene_list)
# Run enrichGO (R's clusterProfiler)
enrich_result = clusterProfiler.enrichGO(
gene=r_genes,
OrgDb=orgdb.org_Hs_eg_db,
ont='BP', # Biological Process
pAdjustMethod='BH',
pvalueCutoff=0.05,
qvalueCutoff=0.2
)
# Apply simplify to remove redundant terms
simplified = clusterProfiler.simplify(enrich_result, cutoff=0.7, by='p.adjust')
# Convert R result back to pandas
results_df = pandas2ri.rpy2py(simplified)
Default behavior: Uses gseapy (no R dependencies required). This is sufficient for most analyses and provides valid enrichment results.
When to use R clusterProfiler:
Detailed documentation for advanced topics:
Helper scripts for common tasks: