Name: Bio Read Qc Umi Processing
Author: mdbabumiamssm

UMI Processing

UMIs (Unique Molecular Identifiers) are short random sequences added during library preparation to tag individual molecules before PCR amplification. This enables accurate PCR duplicate removal and molecule counting.

UMI Workflow Overview

Raw FASTQ with UMIs
    |
    v
[umi_tools extract] --> Move UMI to read header
    |
    v
[Alignment] --> bwa/STAR/bowtie2
    |
    v
[umi_tools dedup] --> Remove PCR duplicates based on UMI + position
    |
    v
Deduplicated BAM

Extract UMIs from Reads

UMI in Read Sequence

# UMI at start of R1 (8bp UMI)
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN

# UMI at start of R2
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern2=NNNNNNNN

# UMI in both reads
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN \
    --bc-pattern2=NNNNNNNN

UMI Processing

UMI Workflow Overview

Raw FASTQ with UMIs
    |
    v
[umi_tools extract] --> Move UMI to read header
    |
    v
[Alignment] --> bwa/STAR/bowtie2
    |
    v
[umi_tools dedup] --> Remove PCR duplicates based on UMI + position
    |
    v
Deduplicated BAM

Extract UMIs from Reads

UMI in Read Sequence

# UMI at start of R1 (8bp UMI)
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN

# UMI at start of R2
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern2=NNNNNNNN

# UMI in both reads
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN \
    --bc-pattern2=NNNNNNNN

Pattern	Meaning
`N`	UMI base (extracted)
`C`	Cell barcode (extracted, kept separate)
`X`	Discard base
`NNNNNNNN`	8bp UMI
`CCCCCCCCNNNNNNNN`	8bp cell barcode + 8bp UMI
`NNNXXXNNN`	3bp UMI, skip 3bp, 3bp UMI

Method	Use Case	Speed
`directional`	Standard RNA-seq, most cases	Fast
`unique`	Very high diversity, PCR-free	Fastest
`cluster`	Low diversity, high errors	Slow
`adjacency`	Balance of accuracy/speed	Medium
`percentile`	Extremely high duplication	Fast

Bio Read Qc Umi Processing

UMI Processing

UMI Workflow Overview

Extract UMIs from Reads

UMI in Read Sequence

Bio Read Qc Umi Processing

UMI Processing

UMI Workflow Overview

Extract UMIs from Reads

UMI in Read Sequence

UMI Pattern Syntax

Complex Patterns

UMI in Separate Index Read

Quality Filtering During Extraction

Deduplication

Basic Deduplication

Deduplication Methods

Method Selection Guide

Paired-End Deduplication

Gene-Level Deduplication

UMI Counting

Count UMIs per Gene

Count Table Format

Group UMIs Without Deduplication

Complete Workflows

Standard RNA-seq with UMIs

Single-Cell Workflow (Post-CellRanger)

Statistics and QC

Deduplication Stats

Interpret Deduplication Rate

Performance Tips

Alternative: fastp UMI Handling

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope

Bio Read Qc Umi Processing

UMI Processing

UMI Workflow Overview

Extract UMIs from Reads

UMI in Read Sequence

Bio Read Qc Umi Processing

UMI Processing

UMI Workflow Overview

Extract UMIs from Reads

UMI in Read Sequence

UMI Pattern Syntax

Complex Patterns

UMI in Separate Index Read

Quality Filtering During Extraction

Deduplication

Basic Deduplication

Deduplication Methods

Method Selection Guide

Paired-End Deduplication

Gene-Level Deduplication

UMI Counting

Count UMIs per Gene

Count Table Format

Group UMIs Without Deduplication

Complete Workflows

Standard RNA-seq with UMIs

Single-Cell Workflow (Post-CellRanger)

Statistics and QC

Deduplication Stats

Interpret Deduplication Rate

Performance Tips

Alternative: fastp UMI Handling

Related Skills

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope