Name: Fastq Analysis Pipeline
Author: FreedomIntelligence

Search skills.../

Fastq Analysis Pipeline | Skills Pool

SRA data download (ov.alignment.prefetch + ov.alignment.fqdump)

Use prefetch first for reliable downloads with integrity validation (vdb-validate).
Then convert to FASTQ with fqdump. It auto-detects single-end vs paired-end.
fqdump can also work directly from SRR accessions without prefetch.
Both support retry with exponential backoff for network errors.

import omicverse as ov

# Step 1: Prefetch SRA files (optional but recommended)
pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)

# Step 2: Convert to FASTQ
fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
                          output_dir='fastq', sra_dir='prefetch',
                          gzip=True, threads=8, jobs=4)

FASTQ quality control (ov.alignment.fastp)

Runs fastp for adapter trimming, quality filtering, and QC reporting.
Supports single-end and paired-end reads.
Produces per-sample JSON and HTML QC reports.
Sample format: tuple of (sample_name, fq1_path, fq2_path_or_None).

samples = [
    ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
    ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
]
clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)

STAR alignment (ov.alignment.STAR)

Aligns FASTQ reads using the STAR aligner.
Auto-index building: set auto_index=True (default) with genome_fasta_files and gtf to build index automatically if missing.
Produces coordinate-sorted BAM files.
Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
Use strict=False (default) for graceful error handling per sample.

# Prepare samples from fastp output
star_samples = [
    ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
    ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
]
bams = ov.alignment.STAR(
    star_samples,
    genome_dir='star_index',
    output_dir='star_out',
    gtf='genes.gtf',
    genome_fasta_files=['genome.fa'],
    threads=8,
    memory='50G',
)

Gene quantification (ov.alignment.featureCount)

Counts aligned reads per gene using featureCounts (subread).
Auto-detects paired-end from BAM headers (via pysam or samtools).
auto_fix=True (default) retries with corrected paired-end flag on error.
gene_mapping=True maps gene_id to gene_name from the GTF.
merge_matrix=True produces a combined count matrix across all samples.

bam_items = [
    ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
    ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
]
counts = ov.alignment.featureCount(
    bam_items,
    gtf='genes.gtf',
    output_dir='counts',
    gene_mapping=True,
    merge_matrix=True,
    threads=8,
)
# counts is a pandas DataFrame (gene_id x samples)

Single-cell path (ov.alignment.ref + ov.alignment.count)

Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
ref() builds a kallisto index and transcript-to-gene mapping.
count() quantifies single-cell data with barcode/UMI handling.
Supports technologies: 10XV2, 10XV3, BULK, and custom.
Output formats: h5ad, loom, cellranger MTX.

# Build reference index
ref_result = ov.alignment.ref(
    index_path='kb_ref/index.idx',
    t2g_path='kb_ref/t2g.txt',
    fasta_paths=['genome.fa'],
    gtf_paths=['genes.gtf'],
    threads=8,
)

# Quantify 10x v3 data
count_result = ov.alignment.count(
    index_path='kb_ref/index.idx',
    t2g_path='kb_ref/t2g.txt',
    technology='10XV3',
    fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
    output_path='kb_out',
    h5ad=True,
    filter_barcodes=True,
    threads=8,
)

Wiring fastp output into STAR input

fastp output is a list of dicts with keys: sample, clean1, clean2, json, html.
Convert to STAR sample tuples:

star_samples = [
    (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
    for r in (clean if isinstance(clean, list) else [clean])
]

Wiring STAR output into featureCount input

STAR output is a list of dicts with keys: sample, bam (or error).
Convert to featureCount items:

bam_items = [
    (r['sample'], r['bam'])
    for r in (bams if isinstance(bams, list) else [bams])
    if 'bam' in r
]

# All functions support these parameters:
auto_install=True   # Auto-install missing tools via conda/mamba
overwrite=False     # Skip if outputs already exist
threads=8           # Per-tool thread count
jobs=None           # Concurrent job count (auto-detected from CPU count)

Fastq Analysis Pipeline

Overview

Instructions

Fastq Analysis Pipeline

Overview

Instructions

Critical API Reference

Sample Format Convention

Auto-installation

Examples

References

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy