Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.
OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:
prefetch and fqdump (fasterq-dump wrapper)fastp for adapter trimming and QC reportsSTAR aligner with auto-index buildingfeatureCount (subread featureCounts wrapper)ref and count via kb-python (kallisto/bustools)parallel_fastq_dumpAll functions share a common CLI infrastructure (_cli_utils.py) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.
auto_install=True (default), missing tools are installed via mamba/conda on demand.prefetch, vdb-validate, fasterq-dump, fastp, STAR, samtools, featureCounts, pigz, gzip.kb-python is installed: pip install kb-python.SRA data download (ov.alignment.prefetch + ov.alignment.fqdump)
prefetch first for reliable downloads with integrity validation (vdb-validate).fqdump. It auto-detects single-end vs paired-end.fqdump can also work directly from SRR accessions without prefetch.import omicverse as ov
# Step 1: Prefetch SRA files (optional but recommended)
pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)
# Step 2: Convert to FASTQ
fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
output_dir='fastq', sra_dir='prefetch',
gzip=True, threads=8, jobs=4)
FASTQ quality control (ov.alignment.fastp)
(sample_name, fq1_path, fq2_path_or_None).samples = [
('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
]
clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
STAR alignment (ov.alignment.STAR)
auto_index=True (default) with genome_fasta_files and gtf to build index automatically if missing.strict=False (default) for graceful error handling per sample.# Prepare samples from fastp output
star_samples = [
('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
]
bams = ov.alignment.STAR(
star_samples,
genome_dir='star_index',
output_dir='star_out',
gtf='genes.gtf',
genome_fasta_files=['genome.fa'],
threads=8,
memory='50G',
)
Gene quantification (ov.alignment.featureCount)
auto_fix=True (default) retries with corrected paired-end flag on error.gene_mapping=True maps gene_id to gene_name from the GTF.merge_matrix=True produces a combined count matrix across all samples.bam_items = [
('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
]
counts = ov.alignment.featureCount(
bam_items,
gtf='genes.gtf',
output_dir='counts',
gene_mapping=True,
merge_matrix=True,
threads=8,
)
# counts is a pandas DataFrame (gene_id x samples)
Single-cell path (ov.alignment.ref + ov.alignment.count)
ref() builds a kallisto index and transcript-to-gene mapping.count() quantifies single-cell data with barcode/UMI handling.# Build reference index
ref_result = ov.alignment.ref(
index_path='kb_ref/index.idx',
t2g_path='kb_ref/t2g.txt',
fasta_paths=['genome.fa'],
gtf_paths=['genes.gtf'],
threads=8,
)
# Quantify 10x v3 data
count_result = ov.alignment.count(
index_path='kb_ref/index.idx',
t2g_path='kb_ref/t2g.txt',
technology='10XV3',
fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
output_path='kb_out',
h5ad=True,
filter_barcodes=True,
threads=8,
)
Wiring fastp output into STAR input
sample, clean1, clean2, json, html.star_samples = [
(r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
for r in (clean if isinstance(clean, list) else [clean])
]
Wiring STAR output into featureCount input
sample, bam (or error).bam_items = [
(r['sample'], r['bam'])
for r in (bams if isinstance(bams, list) else [bams])
if 'bam' in r
]
Skipping completed steps
overwrite=False (default).overwrite=True to force re-execution.Troubleshooting
auto_install=True and that conda/mamba is accessible.genome_fasta_files points to uncompressed or gzip FASTA files.auto_fix=True handles most cases automatically.All alignment functions use a consistent sample tuple format:
(sample_name, fq1_path, fq2_path_or_None)(sample_name, bam_path) or (sample_name, bam_path, is_paired_bool)# All functions support these parameters:
auto_install=True # Auto-install missing tools via conda/mamba
overwrite=False # Skip if outputs already exist
threads=8 # Per-tool thread count
jobs=None # Concurrent job count (auto-detected from CPU count)
prefetch -> fqdump -> fastp -> STAR -> featureCount -> pandas DataFrameref -> count with technology='10XV3' -> h5ad AnnDatafastp -> STAR -> featureCount