Detect copy number variations from whole genome sequencing data and generate publication-quality genome-wide CNV plots. Supports CNV calling, segmentation, and visualization for cancer genomics and rare disease analysis.
Detect copy number variations (CNVs) from whole genome sequencing (WGS) data and generate genome-wide visualization plots for cancer genomics, rare disease analysis, and population genetics studies. Provides CNV calling, segmentation analysis, and publication-ready visualization.
Key Capabilities:
✅ Use this skill when:
❌ Do NOT use when:
structural-variant-callervariant-caller for small variant detectionRelated Skills:
fastqc-report-interpreter, alignment-quality-checker, variant-callercircos-plot-generator, go-kegg-enrichment, heatmap-beautifierUpstream Skills:
fastqc-report-interpreter: Assess sequencing quality before CNV calling; low quality data may produce unreliable CNVsalignment-quality-checker: Verify BAM file quality and coverage uniformity; uneven coverage causes CNV artifactsvariant-caller: Generate SNV/indel calls for combined CNV-SNV analysis in cancer samplesDownstream Skills:
circos-plot-generator: Create circular genome plots integrating CNVs with other genomic featuresgo-kegg-enrichment: Perform pathway enrichment on genes within CNV regionsheatmap-beautifier: Visualize CNV profiles across multiple samplesComplete Workflow:
Raw WGS Data → fastqc-report-interpreter → alignment-quality-checker → cnv-caller-plotter → circos-plot-generator → Publication Figures
Identify genomic regions with copy number gains (amplifications) or losses (deletions) from WGS data by analyzing read depth patterns.
from scripts.main import CNVCaller
# Initialize CNV caller with bin size
caller = CNVCaller(bin_size=1000)
# Call CNVs from BAM file
cnv_calls = caller.call_cnvs(
input_file="sample.bam",
reference="hg38.fa"
)
# Review detected CNVs
for cnv in cnv_calls:
print(f"{cnv['chrom']}:{cnv['start']}-{cnv['end']}")
print(f" Copy Number: {cnv['cn']}")
if cnv['cn'] > 2:
print(f" Type: Amplification (gain)")
elif cnv['cn'] < 2:
print(f" Type: Deletion (loss)")
Parameters:
| Parameter | Type | Required | Description | Default |
|---|---|---|---|---|
input_file | str | Yes | Path to input BAM or VCF file | None |
reference | str | Yes | Path to reference genome FASTA | None |
bin_size | int | No | Size of genomic bins for segmentation (bp) | 1000 |
CNV Calling Strategy:
| Approach | Best For | Sensitivity | Specificity |
|---|---|---|---|
| Read Depth Analysis | Large CNVs (>10kb) | High | Medium |
| Paired-end Mapping | Medium CNVs (1-10kb) | Medium | High |
| Split-read Analysis | Small CNVs (<1kb) | Medium | High |
| Combined Approach | Comprehensive detection | High | High |
Best Practices:
Common Issues and Solutions:
Issue: False positive CNVs in repetitive regions
Issue: Low sensitivity for small CNVs
Divide the genome into windows/bins for copy number estimation, enabling systematic analysis of the entire genome.
from scripts.main import CNVCaller
# Different bin sizes for different applications
bin_configs = {
"high_resolution": 100, # For small CNV detection
"standard": 1000, # Default for WGS
"low_resolution": 10000 # For large-scale alterations
}
for config_name, bin_size in bin_configs.items():
caller = CNVCaller(bin_size=bin_size)
print(f"\n{config_name} (bin_size={bin_size}bp):")
# Calculate approximate number of bins for human genome
genome_size = 3_000_000_000 # 3 Gb
num_bins = genome_size // bin_size
print(f" Estimated bins: ~{num_bins:,}")
print(f" Resolution: {bin_size}bp")
Bin Size Selection Guide:
| Bin Size | Resolution | Use Case | Coverage Required |
|---|---|---|---|
| 100 bp | High | Small CNVs (<5kb) | >30x |
| 1000 bp | Standard | General WGS analysis | >15x |
| 10000 bp | Low | Large chromosomal alterations | >5x |
| Variable | Adaptive | Mixed resolution | >20x |
Best Practices:
Common Issues and Solutions:
Issue: Noisy segmentation due to small bins
Issue: Missing large CNVs with large bins
Generate publication-quality plots showing copy number profiles across all chromosomes for visual interpretation and presentation.
from scripts.main import CNVCaller
caller = CNVCaller(bin_size=1000)
# Example CNV calls for plotting
cnv_calls = [
{"chrom": "chr1", "start": 1000000, "end": 2000000, "cn": 3}, # Gain
{"chrom": "chr7", "start": 50000000, "end": 55000000, "cn": 1}, # Loss
{"chrom": "chr17", "start": 35000000, "end": 36000000, "cn": 4} # High-level amplification
]
# Generate plots in different formats
output_dir = "./cnv_results"
for fmt in ["png", "pdf", "svg"]:
plot_file = caller.plot_genome_wide(
cnv_calls=cnv_calls,
output_path=output_dir,
fmt=fmt
)
print(f"Generated: {plot_file}")
# Plot features:
# - Genome-wide view with all chromosomes
# - Copy number on Y-axis (0-6 typical range)
# - Chromosomal position on X-axis
# - Color coding: red=loss, blue=gain, black=neutral
Output Formats:
| Format | Extension | Best For | File Size |
|---|---|---|---|
| PNG | .png | Web, presentations, quick viewing | Medium |
| Publications, high-quality printing | Large | ||
| SVG | .svg | Vector editing, scalable graphics | Small |
Best Practices:
Common Issues and Solutions:
Issue: Plot too crowded with many CNVs
Issue: ChrY not displayed for female samples
Export CNV calls in standard BED format for compatibility with genome browsers and downstream analysis tools.
from scripts.main import CNVCaller
caller = CNVCaller()
# Example CNV calls
cnv_calls = [
{"chrom": "chr1", "start": 1000000, "end": 2000000, "cn": 3},
{"chrom": "chr7", "start": 50000000, "end": 55000000, "cn": 1},
]
# Export to BED format
bed_file = caller.save_bed(cnv_calls, "./output")
# BED format structure:
# chrom start end name score strand
# chr1 1000000 2000000 CN=3 . .
# chr7 50000000 55000000 CN=1 . .
print(f"BED file saved: {bed_file}")
# Read and display BED content
with open(bed_file, 'r') as f:
print("\nBED file content:")
for line in f:
print(line.strip())
BED Format Specification:
| Column | Field | Description | Example |
|---|---|---|---|
| 1 | chrom | Chromosome name | chr1, chrX |
| 2 | start | Start position (0-based) | 1000000 |
| 3 | end | End position (1-based) | 2000000 |
| 4 | name | CNV annotation | CN=3 |
| 5 | score | Optional quality score | . |
| 6 | strand | Strand info (usually .) | . |
Best Practices:
bedtools or genome browser before distributionCommon Issues and Solutions:
Issue: BED file rejected by genome browser
Issue: Coordinate system confusion
Compare CNV profiles between tumor and matched normal samples to identify somatic copy number alterations (SCNAs).
from scripts.main import CNVCaller
caller = CNVCaller(bin_size=1000)
# Call CNVs in tumor and normal samples
tumor_cnvs = caller.call_cnvs("tumor.bam", "hg38.fa")
normal_cnvs = caller.call_cnvs("normal.bam", "hg38.fa")
# Identify somatic CNVs (present in tumor, not in normal)
def find_somatic_cnvs(tumor_calls, normal_calls):
"""Identify CNVs present in tumor but not normal."""
somatic_cnvs = []
for t_cnv in tumor_calls:
is_somatic = True
# Check if similar CNV exists in normal
for n_cnv in normal_calls:
if (t_cnv['chrom'] == n_cnv['chrom'] and
abs(t_cnv['start'] - n_cnv['start']) < 10000 and
abs(t_cnv['end'] - n_cnv['end']) < 10000 and
t_cnv['cn'] == n_cnv['cn']):
is_somatic = False
break
if is_somatic:
somatic_cnvs.append(t_cnv)
return somatic_cnvs
somatic_cnvs = find_somatic_cnvs(tumor_cnvs, normal_cnvs)
print(f"Total tumor CNVs: {len(tumor_cnvs)}")
print(f"Somatic CNVs: {len(somatic_cnvs)}")
# Categorize somatic alterations
amplifications = [c for c in somatic_cnvs if c['cn'] > 2]
deletions = [c for c in somatic_cnvs if c['cn'] < 2]
print(f" Amplifications: {len(amplifications)}")
print(f" Deletions: {len(deletions)}")
Somatic vs Germline Classification:
| Category | Tumor CN | Normal CN | Interpretation |
|---|---|---|---|
| Somatic Amplification | >2 | 2 | Tumor-specific gain |
| Somatic Deletion | <2 | 2 | Tumor-specific loss |
| Germline CNV | ≠2 | ≠2 | Inherited CNV |
| LOH | 1 | 2 | Loss of heterozygosity |
Best Practices:
Common Issues and Solutions:
Issue: Normal sample contamination in tumor
Issue: Germline CNVs misclassified as somatic
Apply quality filters to remove artifactual CNV calls and improve result reliability.
from scripts.main import CNVCaller
caller = CNVCaller()
# Example raw CNV calls with QC metrics
cnv_calls = [
{
"chrom": "chr1", "start": 1000000, "end": 2000000, "cn": 3,
"quality_score": 50, "supporting_reads": 150
},
{
"chrom": "chr7", "start": 50000000, "end": 50001000, "cn": 0,
"quality_score": 10, "supporting_reads": 5 # Likely artifact
},
]
# Apply quality filters
def filter_cnvs(cnv_list, min_quality=20, min_size=1000, min_support=20):
"""Filter CNVs based on quality metrics."""
filtered = []
for cnv in cnv_list:
size = cnv['end'] - cnv['start']
quality = cnv.get('quality_score', 0)
support = cnv.get('supporting_reads', 0)
# Apply filters
if quality < min_quality:
continue
if size < min_size:
continue
if support < min_support:
continue
filtered.append(cnv)
return filtered
# Filter with different stringencies
for min_q in [10, 20, 30]:
filtered = filter_cnvs(cnv_calls, min_quality=min_q)
print(f"Quality >= {min_q}: {len(filtered)} CNVs retained")
# Additional filters to consider:
# - Exclude segmental duplications
# - Exclude centromeres and telomeres
# - Minimum number of supporting bins
# - Concordance with paired-end or split-read signals
Quality Metrics:
| Metric | Threshold | Purpose |
|---|---|---|
| Quality Score | >20 | Overall confidence in CNV call |
| Size | >1kb | Remove small artifactual calls |
| Supporting Reads | >20 | Sufficient evidence depth |
| Log2 Ratio | 0.3 | |
| Mappability | >0.8 | Reliable unique mapping |
Best Practices:
Common Issues and Solutions:
Issue: Too many low-quality CNV calls
Issue: True CNVs filtered out
From WGS data to CNV visualization:
# Step 1: Call CNVs from tumor sample
python scripts/main.py \
--input tumor_sample.bam \
--reference hg38.fa \
--output tumor_cnv/ \
--bin-size 1000 \
--plot-format pdf
# Step 2: Call CNVs from matched normal
python scripts/main.py \
--input normal_sample.bam \
--reference hg38.fa \
--output normal_cnv/ \
--bin-size 1000
# Step 3: Compare and identify somatic CNVs
# (Use Python API for comparison logic)
# Step 4: Generate final plots
python scripts/main.py \
--input tumor_sample.bam \
--reference hg38.fa \
--output final_results/ \
--plot-format pdf
Python API Usage:
from scripts.main import CNVCaller
from pathlib import Path
def analyze_cancer_genome(
tumor_bam: str,
normal_bam: str,
reference: str,
output_dir: str
) -> dict:
"""
Complete cancer genome CNV analysis workflow.
"""
caller = CNVCaller(bin_size=1000)
# Create output directory
Path(output_dir).mkdir(parents=True, exist_ok=True)
# Call CNVs in both samples
print("Calling CNVs in tumor sample...")
tumor_cnvs = caller.call_cnvs(tumor_bam, reference)
print("Calling CNVs in normal sample...")
normal_cnvs = caller.call_cnvs(normal_bam, reference)
# Identify somatic alterations
somatic_cnvs = identify_somatic(tumor_cnvs, normal_cnvs)
# Generate outputs
tumor_bed = caller.save_bed(tumor_cnvs, output_dir)
somatic_bed = caller.save_bed(somatic_cnvs, f"{output_dir}/somatic")
plot_file = caller.plot_genome_wide(tumor_cnvs, output_dir, "pdf")
# Calculate statistics
stats = {
"total_tumor_cnvs": len(tumor_cnvs),
"somatic_cnvs": len(somatic_cnvs),
"amplifications": len([c for c in somatic_cnvs if c['cn'] > 2]),
"deletions": len([c for c in somatic_cnvs if c['cn'] < 2]),
"output_files": {
"tumor_bed": tumor_bed,
"somatic_bed": somatic_bed,
"genome_plot": plot_file
}
}
return stats
# Execute workflow
results = analyze_cancer_genome(
tumor_bam="tumor.bam",
normal_bam="normal.bam",
reference="hg38.fa",
output_dir="./cnv_analysis"
)
print(f"\nAnalysis complete!")
print(f"Total tumor CNVs: {results['total_tumor_cnvs']}")
print(f"Somatic CNVs: {results['somatic_cnvs']}")
print(f" Amplifications: {results['amplifications']}")
print(f" Deletions: {results['deletions']}")
Expected Output Files:
cnv_analysis/
├── cnv_calls.bed # All CNV calls in BED format
├── somatic/
│ └── cnv_calls.bed # Somatic CNVs only
├── cnv_plot.pdf # Genome-wide visualization
└── analysis_summary.json # Statistics and metadata
Scenario: Identify somatic copy number alterations in a cancer sample compared to matched normal tissue.
{
"analysis_type": "cancer_genome",
"samples": {
"tumor": "tumor_wgs.bam",
"normal": "blood_normal.bam"
},
"reference": "hg38.fa",
"parameters": {
"bin_size": 1000,
"min_cnv_size": 10000,
"plot_format": "pdf"
},
"expected_outputs": [
"Somatic CNV calls (BED format)",
"Genome-wide CNV profile plot",
"CNV statistics and summary"
]
}
Workflow:
Output Example:
Somatic CNV Summary:
Total alterations: 47
Amplifications: 12 (including MYC, EGFR)
Deletions: 35 (including TP53, PTEN)
High-impact alterations:
chr8:128000000-129000000 CN=8 (MYC amplification)
chr17:7000000-8000000 CN=0 (TP53 deletion)
Scenario: Detect pathogenic CNVs in a patient with suspected genomic disorder.
{
"analysis_type": "rare_disease",
"sample": "patient.bam",
"reference": "hg38.fa",
"parameters": {
"bin_size": 500,
"min_cnv_size": 1000,
"max_frequency": 0.01
},
"annotation": [
"OMIM genes",
"ClinVar pathogenic variants",
"Decipher syndromes"
]
}
Workflow:
Output Example:
Rare CNV Findings:
chr22:19000000-21000000 CN=1 (22q11.2 deletion syndrome)
Size: 2.0 Mb
Genes: TBX1, COMT, etc.
Frequency: <0.1% in population
Phenotype match: Cardiac, thymic, facial anomalies
Classification: Pathogenic
Scenario: Compare CNV profiles across multiple samples to identify recurrent alterations.
{
"analysis_type": "population",
"samples": [
"sample1.bam", "sample2.bam", "sample3.bam",
...
],
"cohorts": {
"cases": 50,
"controls": 50
},
"parameters": {
"bin_size": 1000,
"plot_format": "png"
},
"analysis": [
"Recurrent CNV detection",
"Burden analysis",
"Association testing"
]
}
Workflow:
Output Example:
Population CNV Analysis:
Samples analyzed: 100
Total CNVs detected: 2,847
Recurrent alterations:
chr1:1000000-2000000: 23% frequency
chr16:15000000-16000000: 18% frequency
Case vs Control association:
Significant enrichment: 3 CNV regions
Most significant: chr8:128000000-129000000 (p=0.001)
Scenario: Characterize CNV profile of a cancer cell line for research or quality control.
{
"analysis_type": "cell_line",
"sample": "mcf7_cell_line.bam",
"reference": "hg38.fa",
"parameters": {
"bin_size": 1000,
"plot_format": "pdf"
},
"comparison": {
"reference_profile": "mcf7_ccle_cnvs.bed",
"expected_alterations": ["chr8_MYC_amp", "chr20_ZNF217_amp"]
}
}
Workflow:
Output Example:
Cell Line: MCF-7
Identity confirmed: Yes (99.2% match to reference)
Expected alterations detected:
chr8:128000000-129000000: CN=8 (MYC) ✓
chr20:50000000-52000000: CN=6 (ZNF217) ✓
Additional alterations:
chr17:35000000-37000000: CN=3 (ERBB2) ✓
Ploidy: 2.8 (aneuploid)
Genome instability score: High
Pre-analysis Checks:
During Analysis:
Post-analysis Verification:
Before Clinical or Publication Use:
Input Data Issues:
❌ Using low coverage data → Noisy CNV calls with many false positives
❌ Mismatched reference genomes → CNVs called in wrong coordinates
❌ Not using matched normal for tumors → Cannot distinguish somatic vs germline
❌ Poor coverage uniformity → GC bias causes false CNVs
Analysis Parameter Issues:
❌ Bin size too large → Miss small CNVs (<10kb)
❌ Bin size too small → Excessive noise in low coverage regions
❌ Inadequate quality filtering → Too many false positive CNVs
❌ Not filtering common CNVs → Report common polymorphisms as pathogenic
Interpretation Issues:
❌ Ignoring tumor purity → Misinterpret subclonal CNVs
❌ Not validating key findings → Report false positive driver alterations
❌ Over-interpreting small CNVs → Single-exon deletions are often artifacts
❌ Ignoring parental data → Cannot determine inheritance in rare disease
Output and Reporting Issues:
❌ Unclear coordinate system → Confusion between 0-based and 1-based
❌ Missing quality metrics → Cannot assess confidence in CNV calls
❌ Not archiving raw data → Results cannot be reproduced
❌ Inadequate documentation → Others cannot interpret results
Problem: No CNVs detected
Problem: Too many CNV calls (hundreds or thousands)
Problem: False positives in repetitive regions
Problem: CNV signals too weak in tumor samples
Problem: Sex chromosomes have unexpected copy numbers
Problem: Batch effects in multi-sample analysis
Problem: Cannot install or run tool
pip install pysam numpy matplotlib pandassamtools faidx reference.fasample.bam.baiAvailable in references/ directory:
External Resources:
Located in scripts/ directory:
main.py - Main CNV calling and plotting engine| Method | Input | Sensitivity | Resolution | Best For |
|---|---|---|---|---|
| Read Depth (this tool) | BAM | Medium | 1-10 kb | Large CNVs, WGS |
| Paired-end Mapping | BAM | Medium | 100bp-10kb | Deletions, insertions |
| Split-read Analysis | BAM | High | 1bp-1kb | Breakpoint detection |
| SNP Array | CEL/IDAT | High | 5-25kb | Cost-effective screening |
| Optical Mapping | Bionano | High | 500bp+ | Very large SVs |
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
--input, -i | string | - | Yes | Input BAM/VCF file |
--reference, -r | string | - | Yes | Reference genome FASTA |
--output, -o | string | ./cnv_output | No | Output directory |
--bin-size | int | 1000 | No | Bin size for analysis |
--plot-format | string | png | No | Plot format (png, pdf, svg) |
# Call CNVs from BAM file
python scripts/main.py --input sample.bam --reference hg38.fa
# Custom output directory and bin size
python scripts/main.py --input sample.bam --reference hg38.fa --output ./results --bin-size 500
# Generate PDF plots
python scripts/main.py --input sample.bam --reference hg38.fa --plot-format pdf
| Risk Indicator | Assessment | Level |
|---|---|---|
| Code Execution | Python script executed locally | Low |
| Network Access | No external API calls | Low |
| File System Access | Read BAM/VCF, write results | Low |
| Data Exposure | Processes genomic data | Medium |
| PHI Risk | May process patient genetic data | High |
# Python 3.7+
# No additional packages required (uses standard library)
Last Updated: 2026-02-09
Skill ID: 162
Version: 2.0 (K-Dense Standard)