Reproducible antimicrobial resistance gene detection and functional annotation from bacterial whole-genome sequencing data. Downloads public NCBI reads, performs quality control, genome assembly, annotation, and AMR profiling with validated open-source tools.
This skill executes a complete, end-to-end bioinformatics pipeline for detecting antimicrobial resistance (AMR) genes from bacterial whole-genome sequencing (WGS) data. It downloads publicly available Illumina paired-end reads from NCBI SRA, performs quality control, de novo genome assembly, gene annotation, and comprehensive AMR gene detection using multiple complementary databases.
Input: NCBI SRA accession number (default: SRR10971381, an ESBL-producing Escherichia coli isolate) Output: AMR gene report, annotated genome, assembly quality metrics, and resistance class summary
Install all required tools via conda/mamba. This ensures version-pinned, reproducible dependencies.
# Create isolated conda environment with all dependencies
mamba create -n resistome_profiler -y -c bioconda -c conda-forge \
sra-tools=3.1.1 \
fastp=0.23.4 \
spades=4.0.0 \
quast=5.2.0 \
prokka=1.14.6 \
ncbi-amrfinderplus=4.0.3 \
abricate=1.0.1 \
mlst=2.23.0 \
seqkit=2.8.2 \
csvtk=0.30.0
conda activate resistome_profiler
# Update AMRFinderPlus database to latest version
amrfinder --update
Expected output: Environment created successfully, AMRFinderPlus database updated.
Validation: Run conda list -n resistome_profiler | grep -E "fastp|spades|prokka|amrfinderplus" — all four tools should appear with correct versions.
Download paired-end Illumina reads for the target isolate. Default accession is SRR10971381 (ESBL-producing E. coli clinical isolate, ~200x coverage).
# Set the SRA accession (change this to analyze a different isolate)
ACCESSION="SRR10971381"
WORKDIR="resistome_profiler_output"
mkdir -p ${WORKDIR}/raw_reads
cd ${WORKDIR}
# Download paired-end FASTQ files
fasterq-dump --split-files --threads 4 --outdir raw_reads/ ${ACCESSION}
# Verify download integrity
echo "=== Read counts ==="
seqkit stats raw_reads/${ACCESSION}_1.fastq raw_reads/${ACCESSION}_2.fastq
Expected output: Two FASTQ files (_1.fastq and _2.fastq) with millions of paired-end reads. seqkit stats should report read counts and average read lengths (typically 150 bp for Illumina).
Validation: Both files should exist, have equal read counts, and total file size > 100 MB.
Remove adapters, low-quality bases, and short reads using fastp. This step is critical for accurate downstream assembly.
mkdir -p qc_reads
fastp \
--in1 raw_reads/${ACCESSION}_1.fastq \
--in2 raw_reads/${ACCESSION}_2.fastq \
--out1 qc_reads/${ACCESSION}_trimmed_1.fastq.gz \
--out2 qc_reads/${ACCESSION}_trimmed_2.fastq.gz \
--json qc_reads/fastp_report.json \
--html qc_reads/fastp_report.html \
--thread 4 \
--qualified_quality_phred 20 \
--length_required 50 \
--detect_adapter_for_pe \
--correction \
--cut_front \
--cut_tail \
--cut_window_size 4 \
--cut_mean_quality 20
# Summarize QC results
echo "=== Post-QC Read Statistics ==="
seqkit stats qc_reads/${ACCESSION}_trimmed_1.fastq.gz qc_reads/${ACCESSION}_trimmed_2.fastq.gz
Expected output: Trimmed FASTQ files with adapter sequences removed. fastp HTML report shows >90% reads passing filters. Average quality score >Q30.
Validation: Check fastp_report.json — filtering_result.passed_filter_reads should be >80% of total input reads.
Assemble the bacterial genome using SPAdes in isolate mode (optimized for single-isolate bacterial WGS).
mkdir -p assembly
spades.py \
--isolate \
-1 qc_reads/${ACCESSION}_trimmed_1.fastq.gz \
-2 qc_reads/${ACCESSION}_trimmed_2.fastq.gz \
-o assembly/${ACCESSION} \
--threads 4 \
--memory 8
# Copy final scaffolds
cp assembly/${ACCESSION}/scaffolds.fasta assembly/${ACCESSION}_scaffolds.fasta
Expected output: Assembled genome in scaffolds.fasta. For E. coli, expect ~4.5-5.5 Mbp total assembly size.
Evaluate assembly quality using QUAST. This provides N50, total length, number of contigs, and other key metrics.
mkdir -p quality
quast \
assembly/${ACCESSION}_scaffolds.fasta \
-o quality/${ACCESSION}_quast \
--min-contig 500 \
--threads 4
# Display key metrics
echo "=== Assembly Quality Metrics ==="
cat quality/${ACCESSION}_quast/report.txt
Expected output: QUAST report showing:
Validation: Total assembly length should be within 20% of expected genome size for the species. N50 > 20,000 bp indicates acceptable assembly quality.
Annotate the assembled genome using Prokka to identify coding sequences, rRNAs, tRNAs, and assign functional annotations.
mkdir -p annotation
prokka \
--outdir annotation/${ACCESSION} \
--prefix ${ACCESSION} \
--kingdom Bacteria \
--genus Escherichia \
--species coli \
--cpus 4 \
--force \
assembly/${ACCESSION}_scaffolds.fasta
echo "=== Annotation Summary ==="
cat annotation/${ACCESSION}/${ACCESSION}.txt
Expected output: Prokka output directory containing .gff, .gbk, .faa, .ffn files. Summary should show ~4,500-5,500 CDS for E. coli.
Validation: Number of CDS should be proportional to genome size (~1 gene per 1 kbp). At least 3 rRNA operons expected for E. coli.
Run NCBI AMRFinderPlus for gold-standard AMR gene detection using both nucleotide and protein sequences.
mkdir -p amr_results
# Run AMRFinderPlus with both nucleotide and protein input for maximum sensitivity
amrfinder \
--nucleotide assembly/${ACCESSION}_scaffolds.fasta \
--protein annotation/${ACCESSION}/${ACCESSION}.faa \
--gff annotation/${ACCESSION}/${ACCESSION}.gff \
--organism Escherichia \
--plus \
--threads 4 \
--output amr_results/${ACCESSION}_amrfinder.tsv \
--name ${ACCESSION}
echo "=== AMRFinderPlus Results ==="
echo "Total AMR/stress/virulence genes detected:"
wc -l < amr_results/${ACCESSION}_amrfinder.tsv
echo ""
echo "AMR genes by class:"
tail -n +2 amr_results/${ACCESSION}_amrfinder.tsv | \
csvtk -t cut -f "Class" | sort | uniq -c | sort -rn
echo ""
echo "Top hits:"
head -20 amr_results/${ACCESSION}_amrfinder.tsv | csvtk -t pretty
Expected output: TSV file listing detected AMR genes with gene symbol, sequence name, element type, element subtype, drug class, and percent identity. For an ESBL-producing E. coli, expect detection of beta-lactamase genes (e.g., blaCTX-M, blaTEM), and potentially aminoglycoside, quinolone, and tetracycline resistance genes.
Run ABRicate against multiple curated databases to cross-validate AMRFinderPlus results and detect additional resistance genes.
# Screen against multiple AMR databases for comprehensive coverage
for DB in ncbi card resfinder argannot megares vfdb; do
abricate \
--db ${DB} \
--minid 80 \
--mincov 60 \
--threads 4 \
assembly/${ACCESSION}_scaffolds.fasta \
> amr_results/${ACCESSION}_abricate_${DB}.tsv
echo "=== ABRicate ${DB} database: $(tail -n +2 amr_results/${ACCESSION}_abricate_${DB}.tsv | wc -l) genes detected ==="
done
# Generate summary across all databases
abricate --summary amr_results/${ACCESSION}_abricate_*.tsv > amr_results/${ACCESSION}_abricate_summary.tsv
echo "=== Cross-Database Summary ==="
cat amr_results/${ACCESSION}_abricate_summary.tsv | csvtk -t pretty
Expected output: Gene hits from each database. The ncbi and resfinder databases typically show the highest concordance with AMRFinderPlus. The vfdb database additionally reports virulence factors.
Validation: Core AMR genes (e.g., beta-lactamases) should be detected by at least 3 of the 5 AMR databases, confirming high-confidence calls.
Determine the sequence type (ST) of the isolate for epidemiological context.
mlst assembly/${ACCESSION}_scaffolds.fasta > amr_results/${ACCESSION}_mlst.tsv
echo "=== MLST Result ==="
cat amr_results/${ACCESSION}_mlst.tsv
Expected output: Sequence type assignment (e.g., ST131, ST410) with allele profiles. ST131 is a globally disseminated ESBL-producing E. coli lineage.
Compile all results into a single, human- and machine-readable summary report.
mkdir -p report
cat > report/${ACCESSION}_resistome_report.md << 'REPORT_HEADER'
# ResistomeProfiler Report
REPORT_HEADER
cat >> report/${ACCESSION}_resistome_report.md << EOF
**Accession:** ${ACCESSION}
**Date:** $(date -u +"%Y-%m-%d %H:%M:%S UTC")
## 1. Assembly Quality
\`\`\`
$(cat quality/${ACCESSION}_quast/report.txt)
\`\`\`
## 2. Annotation Summary
\`\`\`
$(cat annotation/${ACCESSION}/${ACCESSION}.txt)
\`\`\`
## 3. MLST
\`\`\`
$(cat amr_results/${ACCESSION}_mlst.tsv)
\`\`\`
## 4. AMR Genes (AMRFinderPlus)
Total genes/elements detected: $(tail -n +2 amr_results/${ACCESSION}_amrfinder.tsv | wc -l)
### By Drug Class:
\`\`\`
$(tail -n +2 amr_results/${ACCESSION}_amrfinder.tsv | csvtk -t cut -f "Class" | sort | uniq -c | sort -rn)
\`\`\`
### Full AMRFinderPlus Results:
\`\`\`
$(head -30 amr_results/${ACCESSION}_amrfinder.tsv | csvtk -t pretty)
\`\`\`
## 5. Cross-Database Validation (ABRicate)
\`\`\`
$(cat amr_results/${ACCESSION}_abricate_summary.tsv | csvtk -t pretty)
\`\`\`
## 6. Reproducibility
- **Environment:** conda (resistome_profiler)
- **Tools:** fastp 0.23.4, SPAdes 4.0.0, QUAST 5.2.0, Prokka 1.14.6, AMRFinderPlus 4.0.3, ABRicate 1.0.1
- **Input data:** NCBI SRA ${ACCESSION}
- **All tools are open-source and version-pinned for full reproducibility.**
EOF
echo "=== Report generated at report/${ACCESSION}_resistome_report.md ==="
cat report/${ACCESSION}_resistome_report.md
Expected output: A complete markdown report summarizing all analysis results, ready for inspection or publication.
To analyze a different bacterial isolate:
ACCESSION="SRR_YOUR_ACCESSION" in Step 2--genus and --species in Step 6 (Prokka) and --organism in Step 7 (AMRFinderPlus)--isolate with --meta in Step 4 (SPAdes)This skill generalizes to any bacterial species with WGS data in NCBI SRA, including Klebsiella pneumoniae, Staphylococcus aureus, Pseudomonas aeruginosa, Acinetobacter baumannii, and others.