Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis.
Reference examples tested with: bcftools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Left-align indels and split multiallelic sites using bcftools norm.
The same variant can be represented multiple ways:
# Same deletion, different representations
chr1 100 ATCG A (right-aligned)
chr1 100 ATC A (left-aligned, normalized)
chr1 101 TCG T (different position)
Normalization ensures consistent representation for:
Goal: Left-align indels and check reference allele consistency.
Approach: Use bcftools norm with a reference FASTA to shift indels to the leftmost position and optionally fix/exclude REF mismatches.
"Normalize my VCF before comparing callers" → Left-align indel representations and split multiallelic sites for consistent variant comparison.
bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz
Requires reference FASTA to determine left-most representation.
bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null
# Reports REF allele mismatches
Check modes (-c):
w - Warn on mismatch (default)e - Error on mismatchx - Exclude mismatchess - Set correct REF from referenceGoal: Convert multiallelic sites to biallelic records or vice versa.
Approach: Use bcftools norm -m flags to split (decompose) or join (merge) multiallelic records.
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
Before:
chr1 100 . A G,T 30 PASS . GT 1/2
After:
chr1 100 . A G 30 PASS . GT 1/0
chr1 100 . A T 30 PASS . GT 0/1
bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz
bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz
bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz
| Option | Description |
|---|---|
-m-any | Split all multiallelic sites |
-m-snps | Split multiallelic SNPs only |
-m-indels | Split multiallelic indels only |
-m-both | Split SNPs and indels separately |
-m+any | Join biallelic sites into multiallelic |
-m+snps | Join biallelic SNPs |
-m+indels | Join biallelic indels |
-m+both | Join SNPs and indels separately |
Goal: Left-align indels and split multiallelic sites in a single pass.
Approach: Combine -f (reference) and -m-any (split) flags in one bcftools norm invocation.
bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
This:
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz
Duplicate removal options (-d):
exact - Remove exact duplicatessnps - Remove duplicate SNPsindels - Remove duplicate indelsboth - Remove duplicate SNPs and indelsall - Remove all duplicatesnone - Keep duplicates (default)Goal: Correct or remove variants whose REF allele does not match the reference genome.
Approach: Use bcftools norm -c with mode s (set correct REF) or x (exclude mismatches).
bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz
This sets REF alleles to match the reference genome.
bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz
Removes variants where REF doesn't match reference.
Goal: Decompose multi-nucleotide polymorphisms (MNPs) into individual SNP records.
Approach: Use bcftools norm --atomize to break complex substitutions into atomic single-base changes.
bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz
Before:
chr1 100 . ATG GCA 30 PASS
After:
chr1 100 . A G 30 PASS
chr1 101 . T C 30 PASS
chr1 102 . G A 30 PASS
bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz
bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz
Tags original record for reference.
Goal: Apply normalization as a preprocessing step for downstream analyses.
Approach: Normalize both VCFs identically before comparison, annotation, or GWAS preparation.
# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
base=$(basename "$vcf" .vcf.gz)
bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
bcftools index "${base}.norm.vcf.gz"
done
# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz
bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
# Now annotate against dbSNP, ClinVar, etc.
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz
Goal: Assess how many variants require normalization before running bcftools norm.
Approach: Iterate with cyvcf2 and count multiallelic sites and complex (MNP) variants.
from cyvcf2 import VCF
def needs_normalization(variant):
# Check for multiallelic
if len(variant.ALT) > 1:
return True
# Check for complex variants (potential MNPs)
ref, alt = variant.REF, variant.ALT[0]
if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
return True
return False
count = 0
for variant in VCF('input.vcf.gz'):
if needs_normalization(variant):
count += 1
print(f'Variants needing normalization: {count}')
from cyvcf2 import VCF
multiallelic = 0
total = 0
for variant in VCF('input.vcf.gz'):
total += 1
if len(variant.ALT) > 1:
multiallelic += 1
print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')
| Task | Command |
|---|---|
| Left-align indels | bcftools norm -f ref.fa in.vcf.gz |
| Split multiallelic | bcftools norm -m-any in.vcf.gz |
| Join to multiallelic | bcftools norm -m+any in.vcf.gz |
| Full normalization | bcftools norm -f ref.fa -m-any in.vcf.gz |
| Fix REF alleles | bcftools norm -f ref.fa -c s in.vcf.gz |
| Remove duplicates | bcftools norm -d exact in.vcf.gz |
| Atomize MNPs | bcftools norm --atomize in.vcf.gz |
| Error | Cause | Solution |
|---|---|---|
REF does not match | Wrong reference | Use same reference as caller |
not sorted | Unsorted input | Run bcftools sort first |
duplicate records | Same position twice | Use -d to remove |