This skill should be used when the user asks to "run initial analysis", "analyze single-cell data", "QC my data", "run bioinformatics pipeline", "generate analysis report", "explore my dataset", "do exploratory data analysis", "initial data analysis", or needs to perform quality control, dimensionality reduction, clustering, or marker analysis on single-cell biology data (CyTOF, scRNA-seq, flow cytometry, proteomics).
Automated 7-step analysis pipeline for high-dimensional single-cell biology data with plain-language report generation.
The pipeline auto-detects input data type:
| Data Type | File Formats | Detection Pattern |
|---|---|---|
| scRNA-seq | .h5ad, .h5 (10X), .mtx + barcodes | Gene names, count matrix |
| CyTOF | .csv, .h5ad | Phospho-markers (p.ERK, p.AKT, etc.) |
| Flow cytometry | .fcs, .csv | Surface markers, scatter channels |
Run the complete 7-step analysis:
python3 scripts/run_pipeline.py <input_path> \
[--data-type auto|cytof|scrnaseq|flow] \
[--subsample 500] \
[--output-dir ./analysis_output] \
[--report-style clinical|technical]
Arguments:
input_path: Path to data file (.h5ad, .csv, .h5) or directory of CSV files--data-type: Data type override (default: auto for auto-detection)--subsample: Max cells per group for tractable analysis (default: 500)--output-dir: Output directory (default: ./analysis_output)--report-style: clinical for plain-language medical summaries, technical for bioinformatics detail (default: clinical)Output Files:
analysis_output/
├── figures/ # All generated plots (PNG)
├── processed/
│ └── adata_processed.h5ad # Processed AnnData object
├── report.html # Complete analysis report
└── analysis_summary.json # Machine-readable summary statistics
For custom workflows, import individual step modules:
from step1_load_data import load_data
from step2_qc import run_qc
from step3_normalize import normalize_data
from step4_dim_reduction import run_dim_reduction
from step5_clustering import run_clustering
from step6_marker_analysis import run_marker_analysis
from step7_report import generate_report
Each step function accepts an AnnData object and returns the modified AnnData plus a dictionary of results/figures.
Load data from various formats into AnnData. For directories of CSVs (e.g., CyTOF per-cell-line files), automatically concatenate with metadata. Apply subsampling if dataset is large.
Data-type-aware QC:
adata.raw for downstream differential analysis.PCA with scree plot and loadings analysis, followed by UMAP visualization colored by all available metadata and key markers.
Leiden graph-based clustering at multiple resolutions. Evaluate with ARI, NMI, Silhouette scores if reference labels exist. Visualize cluster composition across metadata categories.
Wilcoxon rank-sum differential expression per cluster. Marker correlation heatmap. If treatment/condition metadata exists: treatment response analysis with boxplots and effect heatmaps. If time metadata exists: time-course dynamics.
Generate HTML report with embedded figures and interpretations. Two styles:
For detailed guidance, consult:
references/plot_interpretation_guide.md - How to explain each plot type to non-expertsreferences/cytof_specifics.md - CyTOF-specific QC, normalization, and markersreferences/scrnaseq_specifics.md - scRNA-seq-specific processing detailsreferences/statistical_methods.md - Plain-language glossary of statistical methodsnan_to_num after z-score scaling — constant-value features produce NaN.