$39
Production-ready VCF processing and variant annotation skill combining local bioinformatics computation with ToolUniverse database integration. Designed to answer bioinformatics analysis questions about VCF data, mutation classification, variant filtering, and clinical annotation.
VCF quality filtering must come before interpretation. A variant called at 2x read depth is unreliable regardless of its QUAL score, because stochastic sequencing errors at low depth can mimic true variants. The recommended minimums — depth > 10x, QUAL > 20, allele frequency consistent with expected zygosity — are not conservative; they are the floor below which calls cannot be trusted. Applying lenient filters to "keep more variants" sacrifices accuracy for coverage and produces false positives that propagate through all downstream analyses.
MyVariant_query_variants or EnsemblVEP_annotate_rsid; never cite ClinVar classifications from memory.ClinGen_dosage_by_geneTriggers:
Example Questions:
| Capability | Description |
|---|---|
| VCF Parsing | Pure Python + cyvcf2 parsers. VCF 4.x, gzipped, multi-sample, SNV/indel/SV |
| Mutation Classification | Maps SO terms, SnpEff ANN, VEP CSQ, GATK Funcotator to standard types |
| VAF Extraction | Handles AF, AD, AO/RO, NR/NV, INFO AF formats |
| Filtering | VAF, depth, quality, PASS, variant type, mutation type, consequence, chromosome, SV size |
| Statistics | Ti/Tv ratio, per-sample VAF/depth stats, mutation type distribution, SV size distribution |
| Annotation | MyVariant.info (aggregates ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen) |
| SV/CNV Analysis | gnomAD SV population frequencies, DGVa/dbVar known SVs, ClinGen dosage sensitivity |
| Clinical Interpretation | ACMG/ClinGen CNV pathogenicity classification using haploinsufficiency/triplosensitivity scores |
| DataFrame | Convert to pandas for advanced analytics |
| Reporting | Markdown reports with tables and statistics, SV clinical reports |
Phase 1: Parse VCF → Extract CHROM/POS/REF/ALT/QUAL/FILTER/INFO, per-sample GT/VAF/depth, annotations (ANN/CSQ/FUNCOTATION). Pure Python or cyvcf2.
Phase 2: Classify → Variant type (SNV/INS/DEL/MNV/SV), mutation type (missense/nonsense/synonymous/frameshift/splice/etc.), impact (HIGH/MODERATE/LOW/MODIFIER).
Phase 3: Filter → VAF range, depth, quality, PASS, variant/mutation type, consequence exclusion, population frequency, chromosome, SV size.
Phase 4: Statistics → Type/mutation/impact/chromosome distributions, Ti/Tv ratio, per-sample VAF/depth, gene mutation counts.
Phase 5: Annotate (optional) → MyVariant.info (ClinVar/dbSNP/gnomAD/CADD), Ensembl VEP consequence prediction.
Phase 6: Report → Markdown tables, direct answers, DataFrame export.
Phase 7: SV/CNV Analysis (if applicable) → gnomAD SV frequencies, ClinGen dosage sensitivity, ACMG pathogenicity classification.
Use pandas for:
Use python_implementation tools for:
Key functions:
vcf_data = parse_vcf("input.vcf") # Pure Python (always works)
vcf_data = parse_vcf_cyvcf2("input.vcf") # Fast C-based (if installed)
df = variants_to_dataframe(vcf_data.variants, sample="TUMOR") # For pandas
Automatic classification from annotations:
Mutation types supported: missense, nonsense, synonymous, frameshift, splice_site, splice_region, inframe_insertion, inframe_deletion, intronic, intergenic, UTR_5, UTR_3, upstream, downstream, stop_lost, start_lost
See references/mutation_classification_guide.md for full details
Common filtering patterns:
# Somatic-like variants
criteria = FilterCriteria(
min_vaf=0.05, max_vaf=0.95,
min_depth=20, pass_only=True,
exclude_consequences=["intronic", "intergenic", "upstream", "downstream"]
)
# High-confidence germline
criteria = FilterCriteria(
min_vaf=0.25, min_depth=30, pass_only=True,
chromosomes=["1", "2", ..., "22", "X", "Y"]
)
# Rare pathogenic candidates
criteria = FilterCriteria(
min_depth=20, pass_only=True,
mutation_types=["missense", "nonsense", "frameshift"]
)
See references/vcf_filtering.md for all filter options
Use python_implementation for standard stats (Ti/Tv, type distributions, per-sample VAF/depth); pandas for custom aggregations. For annotation, prefer MyVariant.info (batch: ClinVar + dbSNP + gnomAD + CADD); limit to 50-100 variants per batch. Reports include type/mutation/impact/chromosome distributions, VAF stats, clinical significance, and top mutated genes.
See references/annotation_guide.md for detailed examples
When VCF contains SV calls (SVTYPE=DEL/DUP/INV/BND):
clingen = ClinGen_dosage_by_gene(gene_symbol="BRCA1")
# Returns: haploinsufficiency_score, triplosensitivity_score
gnomad_sv = gnomad_get_sv_by_gene(gene_symbol="BRCA1")
# Returns: SVs with AF, AC, AN
ClinGen dosage score interpretation:
See references/sv_cnv_analysis.md for full SV workflow
Question: "What fraction of variants with VAF < X are annotated as Y mutations?"
result = answer_vaf_mutation_fraction(
vcf_path="input.vcf",
max_vaf=0.3,
mutation_type="missense",
sample="TUMOR"
)
# Returns: fraction, total_below_vaf, matching_mutation_type
Question: "What is the difference in mutation frequency between cohorts?"
result = answer_cohort_comparison(
vcf_paths=["cohort1.vcf", "cohort2.vcf"],
mutation_type="missense",
cohort_names=["Treatment", "Control"]
)
# Returns: cohorts, frequency_difference
Question: "After filtering X, how many Y remain?"
result = answer_non_reference_after_filter(
vcf_path="input.vcf",
exclude_intronic_intergenic=True
)
# Returns: total_input, non_reference, remaining
| Tool | When to Use | Parameters | Response |
|---|---|---|---|
MyVariant_query_variants | Batch annotation | query (rsID/HGVS) | ClinVar, dbSNP, gnomAD, CADD |
dbsnp_get_variant_by_rsid | Population frequencies | rsid | Frequencies, clinical significance |
gnomad_get_variant | gnomAD metadata | variant_id (CHR-POS-REF-ALT) | Basic variant info |
EnsemblVEP_annotate_rsid | Consequence prediction | variant_id (rsID) | Transcript impact |
| Tool | When to Use | Parameters | Response |
|---|---|---|---|
gnomad_get_sv_by_gene | SV population frequency | gene_symbol | SVs with AF, AC, AN |
gnomad_get_sv_by_region | Regional SV search | chrom, start, end | SVs in region |
ClinGen_dosage_by_gene | Dosage sensitivity | gene_symbol | HI/TS scores, disease |
ClinGen_dosage_region_search | Dosage-sensitive genes in region | chromosome, start, end | All genes with HI/TS scores |
ensembl_get_structural_variants | Known SVs from DGVa/dbVar | chrom, start, end, species | Clinical significance |
See references/annotation_guide.md for detailed tool usage examples
# Quick summary
report = variant_analysis_pipeline("input.vcf", output_file="report.md")
# Filtered analysis
report = variant_analysis_pipeline("input.vcf",
filters=FilterCriteria(min_vaf=0.1, min_depth=20, pass_only=True))
# Annotated report (top 50 variants with ClinVar/gnomAD/CADD)
report = variant_analysis_pipeline("input.vcf", annotate=True, max_annotate=50)
pandas vs python_implementation: Use python_implementation for parsing/classification/annotation, then convert to DataFrame for custom aggregations:
vcf_data = parse_vcf("input.vcf")
passing, _ = filter_variants(vcf_data.variants, criteria)
df = variants_to_dataframe(passing, sample="TUMOR")
scripts/parse_vcf.py, scripts/filter_variants.py, scripts/annotate_variants.pyQUICK_START.md