Ultra-fast RNA-seq transcript and gene-level quantification using quasi-mapping (no BAM required). Builds a k-mer index from a transcriptome FASTA, then quantifies reads in minutes. Outputs transcript-level TPM/count tables (quant.sf) with optional GC-bias and sequence-bias correction. Integrates directly with tximeta/tximport for DESeq2 or edgeR. Use STAR instead when a genome-aligned BAM is required for variant calling or visualization.
Salmon quantifies transcript abundance from RNA-seq reads using quasi-mapping — matching reads to a k-mer index of the transcriptome without full genome alignment. This makes Salmon 20–50× faster than alignment-based tools while producing accurate TPM and estimated count values. Salmon corrects for sequence-specific bias (--seqBias), GC-content bias (--gcBias), and fragment length distribution automatically. Output quant.sf files integrate directly with tximeta (R) or pydeseq2 (Python) for differential expression analysis. For improved accuracy, decoy-aware indexing uses the full genome to identify spurious quasi-mappings.
--gcBias --seqBias--numBootstrapspandas for parsing output; pydeseq2 for differential expressionCheck before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v salmonfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run salmonrather than baresalmon.
# Install with conda (recommended)
conda install -c bioconda salmon
# Verify
salmon --version
# salmon 1.10.3
# Or download pre-compiled binary
wget https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz
tar xzvf salmon-1.10.0_linux_x86_64.tar.gz
export PATH="$PWD/salmon-latest_linux_x86_64/bin:$PATH"
# 1. Build transcriptome index (~5 min)
salmon index -t transcriptome.fa -i salmon_index/ -p 8
# 2. Quantify paired-end reads (~2-5 min per sample)
salmon quant \
-i salmon_index/ \
-l A \
-1 sample_R1.fastq.gz \
-2 sample_R2.fastq.gz \
-p 8 \
--gcBias --validateMappings \
-o results/sample1/
# Output: results/sample1/quant.sf
head results/sample1/quant.sf
Fetch a transcript FASTA from GENCODE or Ensembl (cDNA sequences only — not genome).
# Human transcriptome from GENCODE (recommended)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.transcripts.fa.gz
gunzip gencode.v47.transcripts.fa.gz
# Count transcripts
grep -c "^>" gencode.v47.transcripts.fa
# ~252,000 transcripts
echo "Reference ready."
ls -lh gencode.v47.transcripts.fa
Index the transcriptome for quasi-mapping. Add genome decoys for improved accuracy.
# Standard index (fast, sufficient for most analyses)
salmon index \
-t gencode.v47.transcripts.fa \
-i salmon_index/ \
-p 8
echo "Standard index complete."
# Decoy-aware index (recommended for accuracy — uses full genome as decoy)
# Step 1: create decoy list from genome chromosome names
grep "^>" GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | sed 's/>//' > decoys.txt
# Step 2: concatenate transcriptome + genome
cat gencode.v47.transcripts.fa GRCh38.primary_assembly.genome.fa > gentrome.fa
# Step 3: build decoy-aware index
salmon index \
-t gentrome.fa \
-d decoys.txt \
-i salmon_decoy_index/ \
-p 8
echo "Decoy-aware index complete."
Run Salmon on single-end FASTQ files.
# Single-end quantification
salmon quant \
-i salmon_index/ \
-l A \
-r sample1.fastq.gz \
-p 8 \
--seqBias \
--validateMappings \
-o results/sample1/
echo "Mapping rate: $(grep 'Mapping rate' results/sample1/logs/salmon_quant.log | tail -1)"
echo "Output: results/sample1/quant.sf"
Run Salmon on paired-end FASTQ files with recommended bias correction flags.
# Paired-end with GC bias + sequence bias correction
salmon quant \
-i salmon_decoy_index/ \
-l A \
-1 sample1_R1.fastq.gz \
-2 sample1_R2.fastq.gz \
-p 8 \
--gcBias \
--seqBias \
--validateMappings \
--numBootstraps 100 \
-o results/sample1/
# quant.sf columns: Name, Length, EffectiveLength, TPM, NumReads
head results/sample1/quant.sf
Parse quant.sf to build a gene-level count matrix for differential expression.
import pandas as pd
from pathlib import Path
# Load single-sample output
quant = pd.read_csv("results/sample1/quant.sf", sep="\t")
print(f"Transcripts quantified: {len(quant)}")
print(f"Total estimated reads: {quant['NumReads'].sum():.0f}")
print(f"Transcripts with TPM > 1: {(quant['TPM'] > 1).sum()}")
print(quant.sort_values("TPM", ascending=False).head())
# Build a multi-sample TPM matrix
samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
tpm_matrix = pd.DataFrame({
s: pd.read_csv(f"results/{s}/quant.sf", sep="\t").set_index("Name")["TPM"]
for s in samples
})
print(f"\nTPM matrix: {tpm_matrix.shape}")
tpm_matrix.to_csv("tpm_matrix.tsv", sep="\t")
Summarize transcript-level estimates to gene level and perform differential expression.
import pandas as pd
import re
from pathlib import Path
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
# Aggregate transcript counts to gene level using Ensembl gene IDs
# quant.sf Name format: "ENST00000456328.2|ENSG00000223972.6|..."
def extract_gene_id(transcript_id):
parts = transcript_id.split("|")
return parts[1].split(".")[0] if len(parts) > 1 else transcript_id
samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
count_frames = []
for s in samples:
df = pd.read_csv(f"results/{s}/quant.sf", sep="\t")
df["gene_id"] = df["Name"].apply(extract_gene_id)
gene_counts = df.groupby("gene_id")["NumReads"].sum().round().astype(int)
count_frames.append(gene_counts.rename(s))
count_matrix = pd.DataFrame(count_frames).fillna(0).astype(int)
metadata = pd.DataFrame({
"condition": ["control", "control", "treated", "treated"]
}, index=samples)
# Run DESeq2
dds = DeseqDataSet(counts=count_matrix, metadata=metadata,
design_factors="condition",
inference=DefaultInference(n_cpus=4))
dds.deseq2()
stat_res = DeseqStats(dds, contrast=["condition", "treated", "control"],
inference=DefaultInference())
stat_res.summary()
results = stat_res.results_df
print(f"DE genes (padj < 0.05): {(results['padj'] < 0.05).sum()}")
print(results[results['padj'] < 0.05].sort_values('log2FoldChange').head())
| Parameter | Default | Range/Options | Effect |
|---|---|---|---|
-l / --libType | required | A (auto), SF, SR, IU, IS, MS, MR | Library strandedness; A auto-detects from first reads |
-p / --threads | 1 | 1–64 | CPU threads; 8–16 is typical |
--gcBias | off | flag | Correct for GC-content bias in fragment selection; recommended for most samples |
--seqBias | off | flag | Correct for sequence-specific bias at read starts; recommended |
--validateMappings | off | flag | Use selective alignment for improved accuracy; slight speed cost |
--numBootstraps | 0 | 0–200 | Bootstrap replicates for uncertainty estimation; enables Sleuth/Swish |
--dumpCsvCounts | off | flag | Dump raw counts to CSV alongside quant.sf |
-d / --decoys | — | file | Decoy sequence list for decoy-aware indexing |
--rangeFactorizationBins | 4 | 1–8 | Bins for range-factorization model; increases accuracy at small speed cost |
--skipQuant | off | flag | Build index and exit; useful for cluster pipelines |
#!/bin/bash
# Quantify all paired-end samples with recommended settings
INDEX="salmon_decoy_index"
DATA="data"
OUT="results"
THREADS=12
SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2)
mkdir -p "$OUT"
for sample in "${SAMPLES[@]}"; do
echo "Quantifying: $sample"
salmon quant \
-i "$INDEX" \
-l A \
-1 "$DATA/${sample}_R1.fastq.gz" \
-2 "$DATA/${sample}_R2.fastq.gz" \
-p "$THREADS" \
--gcBias --seqBias --validateMappings \
-o "$OUT/$sample/"
echo "Done: $sample — mapping $(grep 'Mapping rate' $OUT/$sample/logs/salmon_quant.log | tail -1)"
done
echo "All samples quantified."
# Snakefile — Salmon quantification rule