Use this skill when you need genomic file toolkit. for reading/writing sam/bam/cram alignment files, vcf/bcf variant files, fasta/fastq sequences, extracting regions, calculating coverage, suitable for ngs data processing pipelines in a reproducible workflow.
Use this skill when a data analytics task needs a packaged method instead of ad-hoc freeform output.
Use this skill when the user expects a concrete deliverable, validation step, or file-based result.
Use this skill when the documented workflow in this package is the most direct path to complete the request.
Use this skill when you need the pysam package behavior rather than a generic answer.
Key Features
Scope-focused workflow aligned to: Genomic file toolkit. For reading/writing SAM/BAM/CRAM alignment files, VCF/BCF variant files, FASTA/FASTQ sequences, extracting regions, calculating coverage, suitable for NGS data processing pipelines.
Documentation-first workflow with no packaged script requirement.
Reference material available in references/ for task-specific guidance.
Skills relacionados
Structured execution path designed to keep outputs consistent and reviewable.
Dependencies
Python: 3.10+. Repository baseline for current packaged skills.
Third-party packages: not explicitly version-pinned in this skill package. Add pinned versions if this skill needs stricter environment control.
Example Usage
Skill directory: 20260316/scientific-skills/Data Analytics/pysam
No packaged executable script was detected.
Use the documented workflow in SKILL.md together with the references/assets in this folder.
Example run plan:
Read the skill instructions and collect the required inputs.
Follow the documented workflow exactly.
Use packaged references/assets from this folder when the task needs templates or rules.
Return a structured result tied to the requested deliverable.
Implementation Details
See ## Overview above for related details.
Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
Primary implementation surface: instruction-only workflow in SKILL.md.
Reference guidance: references/ contains supporting rules, prompts, or checklists.
Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
Overview
Pysam is a Python module for reading, manipulating, and writing genomic datasets. It provides a Python-style interface to htslib, supporting reading/writing SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences. It can also query tabix-indexed files, perform pileup analysis for coverage calculation, and execute samtools/bcftools commands.
When to Use This Skill
Use this skill in the following scenarios:
Processing sequencing alignment files (BAM/CRAM)
Analyzing genetic variants (VCF/BCF)
Extracting reference sequences or gene regions
Processing raw sequencing data (FASTQ)
Calculating coverage or sequencing depth
Implementing bioinformatics analysis pipelines
Quality control of sequencing data
Variant calling and annotation workflows
Quick Start
Installation
uv pip install pysam
Basic Examples
Reading alignment files:
import pysam
# Open BAM file and fetch reads in specified region
samfile = pysam.AlignmentFile("example.bam", "rb")
for read in samfile.fetch("chr1", 1000, 2000):
print(f"{read.query_name}: {read.reference_start}")
samfile.close()
Reading variant files:
# Open VCF file and iterate through variant sites
vcf = pysam.VariantFile("variants.vcf")
for variant in vcf:
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
vcf.close()
Querying reference sequences:
# Open FASTA and extract sequence
fasta = pysam.FastaFile("reference.fasta")
sequence = fasta.fetch("chr1", 1000, 2000)
print(sequence)
fasta.close()
Core Capabilities
1. Alignment File Operations (SAM/BAM/CRAM)
Use the AlignmentFile class to work with aligned sequencing reads. This is suitable for analyzing alignment results, calculating coverage, extracting reads, or quality control.
Common operations:
Open and read BAM/SAM/CRAM files
Fetch reads from specific genomic regions
Filter reads by alignment quality (mapping quality), flag, or other criteria
Write filtered or modified alignment data
Calculate coverage statistics
Perform pileup analysis (per-base coverage)
Access read sequences, quality values, and alignment information
Reference: For detailed documentation, see references/alignment_files.md:
Opening and reading alignment files
AlignedSegment attributes and methods
Region-based extraction using fetch()
Pileup analysis for coverage analysis
Writing and creating BAM files
Coordinate systems and indexing
Performance optimization tips
2. Variant File Operations (VCF/BCF)
Use the VariantFile class to work with genetic variants from variant calling pipelines. This is suitable for variant analysis, filtering, annotation, or population genetics studies.
Common operations:
Read/write VCF/BCF files
Query variants in specific regions
Access variant information (position, alleles, quality scores)
Extract genotype data for samples
Filter variants by quality, allele frequency, or other criteria
Annotate variants with additional information
Subset samples or regions
Reference: For detailed documentation, see references/variant_files.md:
Opening and reading variant files
VariantRecord attributes and methods
Accessing INFO and FORMAT fields
Handling genotypes and samples
Creating and writing VCF files
Filtering and extracting variant subsets
Multi-sample VCF operations
3. Sequence File Operations (FASTA/FASTQ)
Use FastaFile for random access to reference sequences, and FastxFile for reading raw sequencing data. This is suitable for extracting gene sequences, validating variants against reference, or processing raw reads.
Common operations:
Query reference sequences by genomic coordinates
Extract sequences for genes or regions of interest
Read FASTQ files with quality values
Validate reference alleles for variants
Calculate sequence statistics
Filter reads by quality or length
Convert between FASTA and FASTQ formats
Reference: For detailed documentation, see references/sequence_files.md:
FASTA file access and indexing
Extracting sequences by region
Handling reverse complement sequences for genes
Sequential reading of FASTQ files
Quality score conversion and filtering
Processing tabix-indexed files (BED, GTF, GFF)
Common sequence processing patterns
4. Integrated Bioinformatics Workflows
Pysam excels at integrating multiple file types for comprehensive genomic analysis. Common workflows combine alignment files, variant files, and reference sequences.
Common workflows:
Calculate coverage statistics for specific regions
Verify variants using aligned reads
Annotate variants with coverage information
Extract sequences around variant positions
Filter alignments or variants based on multiple criteria
Generate coverage tracks for visualization
Quality control across multiple data types
Reference: For detailed examples, see references/common_workflows.md:
Quality control workflows (BAM statistics, reference consistency)