Guided metagenomics profiling pipeline on Galaxy -- taxonomic classification, functional profiling, and diversity analysis
Guided metagenomics analysis on Galaxy -- from raw reads to community composition profiles and functional annotations.
Targeted sequencing of marker genes. Cheaper, good for "who's there?" questions.
Sequence everything. More expensive, but answers "who's there?" AND "what are they doing?"
Ask the user which type of data they have before proceeding.
Raw FASTQ reads
│
▼
[1] FastQC ──── QC checkpoint
│
▼
[2] fastp ──── Trim and filter
│
▼
[3] Host depletion (optional) ──── Remove human reads if needed
│
├───────────────────┐
▼ ▼
[4a] Kraken2 [4b] MetaPhlAn
(fast, k-mer) (marker gene)
│ │
▼ ▼
[5a] Bracken Taxonomic profile
(abundance re-est.)
│
▼
[6] HUMAnN ──── Functional profiling (pathways, gene families)
│
▼
Community composition + functional capacity
Same as other pipelines. Quality is important -- metagenomic classification is sensitive to sequencing errors.
When to do this: Human microbiome samples (gut, skin, oral) contain significant human DNA. Remove it before classification.
Tool: BWA-MEM2 against the human reference, then extract unmapped reads with samtools.
Why: Human reads would be classified as "unknown" or misclassified. Removing them speeds up classification and improves accuracy.
Tool: toolshed.g2.bx.psu.edu/repos/iuc/kraken2/kraken2
Key parameters:
Standard -- bacteria, archaea, viruses, human (most common)PlusPF -- standard + protozoa + fungiPlusPFP -- everything including plants0.0 (default). Increase to 0.1 for fewer false positives.2 (default). Increase for more specificity.Why Kraken2? Fastest classifier (millions of reads/minute). Good for initial screening. Not the most accurate at species level -- use MetaPhlAn for that.
Tool: toolshed.g2.bx.psu.edu/repos/iuc/metaphlan/metaphlan
MetaPhlAn uses clade-specific marker genes. Slower than Kraken2 but more accurate at species level. Better for quantitative abundance estimates.
When to choose MetaPhlAn over Kraken2:
Tool: toolshed.g2.bx.psu.edu/repos/iuc/bracken/bracken
Bracken refines Kraken2's output by redistributing reads classified at higher taxonomic levels. Use it after Kraken2 for better species-level abundance estimates.
Key parameter: --level S (species level, default and recommended)
Tool: toolshed.g2.bx.psu.edu/repos/iuc/humann/humann
HUMAnN profiles metabolic pathways and gene families. Answers "what is the community doing?"
Outputs:
For amplicon data, the pipeline is different:
Paired FASTQ (16S amplicon)
│
▼
[1] DADA2 / QIIME2 dada2 ──── Denoise, error-correct, make ASVs
│
▼
[2] Taxonomic classification ──── SILVA/Greengenes/UNITE database
│
▼
[3] Diversity analysis ──── Alpha/beta diversity, ordination
│
▼
Community composition + diversity metrics
Galaxy has QIIME2 tools for the full amplicon workflow. Use the galaxy-tools skill to search for "qiime2" or "dada2".
| Problem | Likely Cause | Fix |
|---|---|---|
| Most reads unclassified | Wrong database or host contamination | Try a more comprehensive database; check for host reads |
| Unrealistic species | Database contamination or low confidence | Increase confidence threshold; filter low-abundance taxa |
| Very low diversity | Over-aggressive quality filtering | Relax fastp parameters; check for primer contamination |
| Batch effects in diversity | DNA extraction method differences | Include extraction batch as covariate |
| Criterion | Kraken2 | MetaPhlAn |
|---|---|---|
| Speed | Very fast | Moderate |
| Accuracy (species) | Good | Better |
| Quantification | Relative (use Bracken) | Directly quantitative |
| Database | General genomic | Marker genes only |
| False positives | More | Fewer |
| Recommendation | Initial screening, large datasets | Publication-quality abundance |