Quality control pipeline for bulk RNA-seq datasets including normalization, PCA, sample correlation, and differential expression.
Load the bulk RNA-seq count matrix from the data path using scanpy (sc.read_h5ad). Print the shape (genes x samples). Show the first few rows and sample metadata (adata.obs).
import pandas as pd
import numpy as np
total_counts = counts.sum(axis=0)
detected_genes = (counts > 0).sum(axis=0)
print(f"Samples: {counts.shape[1]}, Genes: {counts.shape[0]}")
print(f"Median total counts/sample: {total_counts.median():.0f}")
print(f"Median detected genes/sample: {detected_genes.median():.0f}")
Apply CPM (counts per million) normalization followed by log2(x+1) transform. Store the normalized matrix. Print the shape.skip_save = True for this instruction.
<!-- mode: instruction -->Perform PCA on the normalized count matrix (samples as observations). Plot PC1 vs PC2 colored by condition using Plotly's Python graphing library. Save the figure.36:["$","$L3c",null,{"content":"$3d","frontMatter":{"name":"bulk_rna_qc","display_name":"Bulk RNA-seq QC Pipeline","description":"Quality control pipeline for bulk RNA-seq datasets including normalization, PCA, sample correlation, and differential expression.","assays":["bulk_rna"],"version":"1.1","inputs":[{"name":"data_path","type":"string","required":true},{"name":"organism","type":"string","default":"human"}],"outputs":[{"name":"counts_normalized","type":"csv"}]}}]