Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data.
Reference examples tested with: GATK 4.5+, bcftools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Merge, concat, sort, and compare VCF files using bcftools.
| Operation | Command | Use Case |
|---|---|---|
| Merge | bcftools merge | Combine samples from multiple VCFs |
| Concat |
bcftools concat| Combine regions from multiple VCFs |
| Sort | bcftools sort | Sort unsorted VCF |
| Intersect | bcftools isec | Compare/intersect call sets |
| Subset | bcftools view | Extract samples or regions |
Goal: Combine VCF files from different samples into a single multi-sample VCF.
Approach: Use bcftools merge to join files with different sample columns at shared genomic positions.
"Merge my per-sample VCFs into one file" → Combine variant records from multiple samples into a single multi-sample VCF.
Combine multiple VCF files with different samples at the same positions.
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
bcftools merge *.vcf.gz -Oz -o all_samples.vcf.gz
# files.txt: one VCF path per line
bcftools merge -l files.txt -Oz -o merged.vcf.gz
# Output missing genotypes as ./. (default)
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
# Output missing as reference (0/0)
bcftools merge --missing-to-ref sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
When sample names conflict:
bcftools merge --force-samples sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
bcftools merge -r chr1:1000000-2000000 sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
Goal: Concatenate VCF files that cover different genomic regions for the same samples.
Approach: Use bcftools concat to join region-split files (e.g., per-chromosome VCFs) in order.
Combine VCF files with same samples from different regions.
bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -Oz -o genome.vcf.gz
bcftools concat chr*.vcf.gz -Oz -o genome.vcf.gz
# files.txt: one VCF path per line (in order)
bcftools concat -f files.txt -Oz -o concatenated.vcf.gz
bcftools concat -a chr1_part1.vcf.gz chr1_part2.vcf.gz -Oz -o chr1.vcf.gz
bcftools concat -a -d all file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gz
Options for -d:
snps - Remove duplicate SNPsindels - Remove duplicate indelsboth - Remove duplicate SNPs and indelsall - Remove all duplicatesexact - Remove exact duplicates onlyGoal: Sort a VCF file by chromosome and position.
Approach: Use bcftools sort with optional temp directory and memory limits for large files.
Sort VCF by chromosome and position.
bcftools sort input.vcf -Oz -o sorted.vcf.gz
For large files:
bcftools sort -T /tmp input.vcf.gz -Oz -o sorted.vcf.gz
bcftools sort -m 4G input.vcf.gz -Oz -o sorted.vcf.gz
Goal: Identify shared and private variants between two or more VCF files.
Approach: Use bcftools isec to partition variants into private-to-each-file and shared subsets.
"Find variants called by both GATK and bcftools" → Intersect two call sets to identify concordant and discordant variants.
Intersect and compare VCF files.
bcftools isec -p output_dir sample1.vcf.gz sample2.vcf.gz
Creates:
0000.vcf - Private to sample10001.vcf - Private to sample20002.vcf - Shared (sample1 records)0003.vcf - Shared (sample2 records)bcftools isec -p output_dir -Oz sample1.vcf.gz sample2.vcf.gz
bcftools isec -p output_dir -n=2 sample1.vcf.gz sample2.vcf.gz
# Only outputs variants present in exactly 2 files
| Flag | Description |
|---|---|
-n=2 | Present in exactly 2 files |
-n+2 | Present in 2 or more files |
-n-2 | Present in fewer than 2 files |
-n~11 | Boolean: file1 AND file2 |
-n~10 | Boolean: file1 AND NOT file2 |
# Variants in both files
bcftools isec -n=2 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o shared.vcf.gz
# Variants only in sample1
bcftools isec -n~10 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o only_sample1.vcf.gz
# Variants in file1 not in file2
bcftools isec -C sample1.vcf.gz sample2.vcf.gz -Oz -o unique.vcf.gz
Goal: Extract a subset of samples or regions from a multi-sample VCF.
Approach: Use bcftools view with -s (samples) or -r/-R (regions) flags to create targeted subsets.
bcftools view -s sample1,sample2 input.vcf.gz -Oz -o subset.vcf.gz
bcftools view -s ^sample3 input.vcf.gz -Oz -o without_sample3.vcf.gz
# samples.txt: one sample name per line
bcftools view -S samples.txt input.vcf.gz -Oz -o subset.vcf.gz
bcftools view -r chr1:1000000-2000000 input.vcf.gz -Oz -o region.vcf.gz
bcftools view -R regions.bed input.vcf.gz -Oz -o targets.vcf.gz
Goal: Rename sample columns in a VCF header.
Approach: Use bcftools reheader with a mapping file of old-to-new sample names.
echo "old_name new_name" > rename.txt
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
# rename.txt format: old_name new_name
cat > rename.txt << EOF
sample1 patient_001
sample2 patient_002
sample3 patient_003
EOF
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
Goal: Split a multi-sample or multi-chromosome VCF into separate files.
Approach: Iterate over samples or chromosomes and extract each with bcftools view.
for sample in $(bcftools query -l input.vcf.gz); do
bcftools view -s "$sample" input.vcf.gz -Oz -o "${sample}.vcf.gz"
done
for chr in $(bcftools view -h input.vcf.gz | grep "^##contig" | sed 's/.*ID=\([^,]*\).*/\1/'); do
bcftools view -r "$chr" input.vcf.gz -Oz -o "${chr}.vcf.gz"
done
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
Goal: Execute typical multi-step VCF manipulation tasks.
Approach: Chain merge, concat, isec, and view operations for cohort assembly, caller comparison, and filtering.
# Create file list
ls *.vcf.gz > files.txt
# Merge all samples
bcftools merge -l files.txt -Oz -o cohort.vcf.gz
bcftools index cohort.vcf.gz
# After parallel variant calling by chromosome
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz -Oz -o genome.vcf.gz
bcftools index genome.vcf.gz
# Find variants called by both GATK and bcftools
bcftools isec -p comparison gatk.vcf.gz bcftools.vcf.gz
# Count results
wc -l comparison/*.vcf
bcftools view -f PASS input.vcf.gz -Oz -o pass_only.vcf.gz
bcftools index pass_only.vcf.gz
Goal: Perform VCF set operations programmatically in Python.
Approach: Use cyvcf2 for position-based comparisons and record concatenation; use bcftools merge for true multi-sample merging.
Note: True VCF merging (combining samples at matching positions) is complex.
Use bcftools merge for production work. cyvcf2 is better for filtering/querying.
from cyvcf2 import VCF, Writer
# WARNING: This concatenates records, not a true merge
# For actual merging of samples, use bcftools merge
vcf1 = VCF('file1.vcf.gz')
writer = Writer('combined.vcf', vcf1)
for variant in vcf1:
writer.write_record(variant)
writer.close()
vcf1.close()
from cyvcf2 import VCF
# Load positions from first VCF
vcf1_positions = set()
for variant in VCF('sample1.vcf.gz'):
vcf1_positions.add((variant.CHROM, variant.POS))
# Check second VCF
shared = 0
unique = 0
for variant in VCF('sample2.vcf.gz'):
if (variant.CHROM, variant.POS) in vcf1_positions:
shared += 1
else:
unique += 1
print(f'Shared: {shared}')
print(f'Unique to sample2: {unique}')
| Task | Command |
|---|---|
| Merge samples | bcftools merge s1.vcf.gz s2.vcf.gz -Oz -o merged.vcf.gz |
| Concat regions | bcftools concat chr1.vcf.gz chr2.vcf.gz -Oz -o all.vcf.gz |
| Sort VCF | bcftools sort input.vcf -Oz -o sorted.vcf.gz |
| Intersect | bcftools isec -p dir a.vcf.gz b.vcf.gz |
| Extract samples | bcftools view -s sample1 input.vcf.gz |
| Rename samples | bcftools reheader -s names.txt input.vcf.gz |
| Error | Cause | Solution |
|---|---|---|
different samples | merge vs concat confusion | Use merge for samples, concat for regions |
not sorted | Unsorted input to concat | Sort first or use -a flag |
sample name conflict | Duplicate sample names | Use --force-samples |
index required | Missing index for merge/isec | Run bcftools index first |