Annotate bacterial and archaeal genomes with Bakta for comprehensive structural and functional annotation, or Prokka for lightweight annotation. Generates GFF3, GenBank, and FASTA outputs with NCBI-compatible locus tags. Use when annotating a newly assembled prokaryotic genome or preparing annotations for NCBI submission.
Reference examples tested with: BUSCO 5.5+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Annotate my bacterial genome" → Predict and functionally annotate coding sequences, rRNAs, tRNAs, and other features in a prokaryotic genome assembly.
bakta --db db/ assembly.fa (preferred), prokka --outdir annot assembly.fa (legacy)Annotate prokaryotic genomes with Bakta (preferred) or Prokka (legacy). Bakta provides more comprehensive functional annotation through up-to-date databases and NCBI-compatible output formatting.
# Download the full database (~30 GB, recommended for comprehensive annotation)
bakta_db download --output /path/to/bakta_db --type full
# Lightweight database (~1.5 GB, faster but less comprehensive)
bakta_db download --output /path/to/bakta_db --type light
# Update existing database
bakta_db update --db /path/to/bakta_db
bakta \
--db /path/to/bakta_db \
--output bakta_out \
--prefix my_genome \
--locus-tag MYORG \
--threads 8 \
assembly.fasta
| Option | Description |
|---|---|
--db | Path to Bakta database |
--output | Output directory |
--prefix | Output file prefix |
--locus-tag | NCBI-compatible locus tag prefix |
--genus / --species | Organism taxonomy |
--strain | Strain designation |
--complete | Flag for complete genomes (enables oriC/oriV detection) |
--gram | Gram type (+ or -) for signal peptide prediction |
--threads | CPU threads |
--min-contig-length | Minimum contig length to annotate (default: 1) |
--translation-table | Genetic code (default: 11 for bacteria) |
bakta \
--db /path/to/bakta_db \
--output bakta_out \
--prefix ecoli_k12 \
--locus-tag ECK12 \
--genus Escherichia --species coli --strain K-12 \
--gram - \
--complete \
--threads 16 \
assembly.fasta
bakta_out/
├── my_genome.gff3 # GFF3 annotation (primary output)
├── my_genome.gbff # GenBank format
├── my_genome.ffn # Nucleotide CDS sequences
├── my_genome.faa # Protein sequences
├── my_genome.fna # Annotated genome sequence
├── my_genome.embl # EMBL format
├── my_genome.tsv # Tab-separated feature table
├── my_genome.json # Machine-readable JSON
└── my_genome.txt # Summary statistics
Prokka is lighter weight and faster but uses older databases. Prefer Bakta for new projects.
prokka \
--outdir prokka_out \
--prefix my_genome \
--locustag MYORG \
--genus Escherichia --species coli \
--cpus 8 \
--rfam \
assembly.fasta
| Feature | Bakta | Prokka |
|---|---|---|
| Database updates | Active (2024+) | Unmaintained since 2021 |
| Functional annotation | Comprehensive (UniProt, COG, Pfam) | Basic (UniProt) |
| ncRNA detection | Infernal + Rfam 14.x | Infernal + Rfam 12.x |
| NCBI compatibility | Full SQN output | Requires tbl2asn |
| Speed | Moderate | Fast |
Goal: Load Bakta/Prokka GFF3 output into a queryable database to extract CDS features and compute annotation quality metrics like coding density.
Approach: Create a gffutils in-memory database from the GFF3 file, iterate CDS features to extract locus tags and product names, and calculate coding density as total CDS bp divided by genome length.
import gffutils
def load_annotation(gff_file):
'''Load GFF3 into a queryable database.'''
db = gffutils.create_db(gff_file, ':memory:', merge_strategy='merge')
return db
def extract_cds_features(db):
'''Extract all CDS features with product annotations.'''
features = []
for cds in db.features_of_type('CDS'):
features.append({
'id': cds.id,
'seqid': cds.seqid,
'start': cds.start,
'end': cds.end,
'strand': cds.strand,
'product': cds.attributes.get('product', ['unknown'])[0],
'locus_tag': cds.attributes.get('locus_tag', [''])[0]
})
return features
def compute_coding_density(db, genome_length):
'''Compute fraction of genome encoding proteins.
Typical prokaryotic coding density: 85-95%.
Values below 80% may indicate pseudogenes or annotation gaps.
Values above 95% may indicate overlapping annotations.
'''
coding_bp = sum(cds.end - cds.start + 1 for cds in db.features_of_type('CDS'))
return coding_bp / genome_length
db = load_annotation('bakta_out/my_genome.gff3')
cds_features = extract_cds_features(db)
print(f'Total CDSs: {len(cds_features)}')
| Genome Size | Expected Genes | Coding Density |
|---|---|---|
| 1-2 Mb | 900-2,000 | 85-92% |
| 2-5 Mb | 1,800-5,000 | 85-90% |
| 5-10 Mb | 4,500-9,000 | 82-88% |
# Count annotated features
grep -c $'\tCDS\t' bakta_out/my_genome.gff3
grep -c $'\ttRNA\t' bakta_out/my_genome.gff3
grep -c $'\trRNA\t' bakta_out/my_genome.gff3
# Check for hypothetical proteins (ideally <40% of total CDSs)
grep -c 'hypothetical protein' bakta_out/my_genome.tsv
busco -i bakta_out/my_genome.faa -m proteins -l bacteria_odb10 -o busco_proteins
--compliant flag for NCBI-ready output