Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.
Reference examples tested with: picard 3.1+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Remove PCR duplicates from my BAM file" → Mark or remove duplicate reads using the fixmate-sort-markdup pipeline to prevent duplicate bias in variant calling.
samtools fixmate, samtools markdup (samtools)pysam.fixmate(), pysam.markdup() (pysam)Mark and remove PCR/optical duplicates using samtools.
PCR duplicates are identical copies of the same original molecule, created during library preparation. They:
Optical duplicates are clusters read multiple times due to their proximity on the flowcell.
Goal: Mark PCR/optical duplicates so they can be excluded from downstream variant calling and coverage analysis.
Approach: Name-sort, add mate tags with fixmate, coordinate-sort, then run markdup. The pipeline version avoids intermediate files.
Reference (samtools 1.19+):
# 1. Sort by name (required for fixmate)
samtools sort -n -o namesort.bam input.bam
# 2. Add mate information with fixmate
samtools fixmate -m namesort.bam fixmate.bam
# 3. Sort by coordinate (required for markdup)
samtools sort -o coordsort.bam fixmate.bam
# 4. Mark duplicates
samtools markdup coordsort.bam marked.bam
# 5. Index result
samtools index marked.bam
samtools sort -n input.bam | \
samtools fixmate -m - - | \
samtools sort - | \
samtools markdup - marked.bam
samtools index marked.bam
Adds mate information required by markdup. Must be run on name-sorted BAM.
samtools fixmate namesorted.bam fixmate.bam
# Required for markdup to work correctly
samtools fixmate -m namesorted.bam fixmate.bam
samtools fixmate -m -@ 4 namesorted.bam fixmate.bam
samtools fixmate -r -m namesorted.bam fixmate.bam
Marks or removes duplicate alignments. Requires coordinate-sorted BAM with mate tags from fixmate.
samtools markdup input.bam marked.bam
samtools markdup -r input.bam deduped.bam
samtools markdup -s input.bam marked.bam 2> markdup_stats.txt
# Set pixel distance for optical duplicate detection (default: 100)
samtools markdup -d 2500 input.bam marked.bam
samtools markdup -@ 4 input.bam marked.bam
samtools markdup -f stats.txt input.bam marked.bam
samtools flagstat marked.bam
# Look for "duplicates" line
# Count reads with duplicate flag
samtools view -c -f 1024 marked.bam
total=$(samtools view -c marked.bam)
dups=$(samtools view -c -f 1024 marked.bam)
echo "scale=2; $dups * 100 / $total" | bc
import pysam
# Sort by name
pysam.sort('-n', '-o', 'namesort.bam', 'input.bam')
# Fixmate
pysam.fixmate('-m', 'namesort.bam', 'fixmate.bam')
# Sort by coordinate
pysam.sort('-o', 'coordsort.bam', 'fixmate.bam')
# Mark duplicates
pysam.markdup('coordsort.bam', 'marked.bam')
# Index
pysam.index('marked.bam')
import pysam
with pysam.AlignmentFile('marked.bam', 'rb') as bam:
total = 0
duplicates = 0
for read in bam:
total += 1
if read.is_duplicate:
duplicates += 1
print(f'Total: {total}')
print(f'Duplicates: {duplicates}')
print(f'Rate: {duplicates/total*100:.2f}%')
import pysam
with pysam.AlignmentFile('marked.bam', 'rb') as infile:
with pysam.AlignmentFile('nodup.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if not read.is_duplicate:
outfile.write(read)
import pysam
from collections import defaultdict
def simple_markdup(input_bam, output_bam):
seen = defaultdict(set)
with pysam.AlignmentFile(input_bam, 'rb') as infile:
with pysam.AlignmentFile(output_bam, 'wb', header=infile.header) as outfile:
for read in infile:
if read.is_unmapped:
outfile.write(read)
continue
key = (read.reference_id, read.reference_start, read.is_reverse,
read.next_reference_id, read.next_reference_start)
if key in seen:
read.is_duplicate = True
else:
seen[key].add(read.query_name)
outfile.write(read)
simple_markdup('sorted.bam', 'marked.bam')
Some aligners can mark duplicates directly:
bwa-mem2 mem ref.fa R1.fq R2.fq | \
samblaster | \
samtools sort -o marked.bam
java -jar picard.jar MarkDuplicates \
I=input.bam \
O=marked.bam \
M=metrics.txt
| Task | Command |
|---|---|
| Full workflow | sort -n | fixmate -m | sort | markdup |
| Mark duplicates | samtools markdup in.bam out.bam |
| Remove duplicates | samtools markdup -r in.bam out.bam |
| Count duplicates | samtools view -c -f 1024 marked.bam |
| View non-duplicates | samtools view -F 1024 marked.bam |
| Get stats | samtools markdup -s in.bam out.bam |
| Flag | Value | Meaning |
|---|---|---|
| 0x400 | 1024 | PCR or optical duplicate |
# View only duplicates
samtools view -f 1024 marked.bam
# View non-duplicates only
samtools view -F 1024 marked.bam
# Count non-duplicates
samtools view -c -F 1024 marked.bam
| Error | Cause | Solution |
|---|---|---|
mate not found | Input not name-sorted | Run samtools sort -n first |
no MC tag | fixmate not run with -m | Re-run fixmate with -m flag |
not coordinate sorted | Input to markdup not sorted | Run samtools sort after fixmate |