End-to-end genome annotation pipeline from assembled contigs to functional annotation, covering repeat masking, gene prediction, and functional assignment for both prokaryotic and eukaryotic genomes. Use when annotating a newly assembled genome from scratch.
Reference examples tested with: BRAKER3 3.0+, BUSCO 5.5+, Bakta 1.9+, Infernal 1.1+, InterProScan 5.66+, Prokka 1.14+, RepeatMasker 4.1+, RepeatModeler 2.0+, eggNOG-mapper 2.1+, pandas 2.2+, tRNAscan-SE 2.0+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Annotate my genome assembly" → Orchestrate prokaryotic (Bakta) or eukaryotic (BRAKER3) gene prediction, repeat masking (RepeatMasker), functional annotation (eggNOG-mapper, InterProScan), and ncRNA annotation (Infernal).
Complete workflow from assembled contigs to functional annotation for prokaryotic or eukaryotic genomes.
Assembled contigs
|
v
[0. Assembly QC] ----------> QUAST, BUSCO (confirm assembly quality)
|
+----- Prokaryotic? -----> Path A: Bakta (one-step annotation)
| |
| v
| Annotated genome (GFF3, GenBank, FASTA)
|
+----- Eukaryotic? ------> Path B: Multi-step pipeline
|
v
[1. Repeat Masking] ----> RepeatModeler + RepeatMasker
|
v
[2. Gene Prediction] ---> BRAKER3 (RNA-seq + protein evidence)
|
v
[3. Functional Annotation] -> eggNOG-mapper + InterProScan
|
v
[4. ncRNA Annotation] ---> Infernal + tRNAscan-SE
|
v
Annotated genome (GFF3, proteins, functional tables)
Bakta provides comprehensive one-step annotation for bacteria and archaea. Preferred over Prokka for new projects.
bakta_db download --output /path/to/bakta_db --type full
bakta \
--db /path/to/bakta_db \
--output bakta_out \
--prefix my_genome \
--locus-tag MYORG \
--genus Escherichia --species "coli" \
--strain K12 \
--gram - \
--complete \
--threads 8 \
assembly.fasta
import subprocess
import json
def validate_prokaryotic_annotation(bakta_dir, prefix, expected_cds_range=(500, 8000)):
'''
QC gates for prokaryotic annotation.
- CDS count in expected range for genome size
- tRNA count >= 20 (typical minimum for free-living bacteria)
- rRNA operons detected
'''
gff_file = f'{bakta_dir}/{prefix}.gff3'
feature_counts = {'CDS': 0, 'tRNA': 0, 'rRNA': 0, 'tmRNA': 0, 'ncRNA': 0}
with open(gff_file) as f:
for line in f:
if line.startswith('#'):
continue
fields = line.strip().split('\t')
if len(fields) >= 3 and fields[2] in feature_counts:
feature_counts[fields[2]] += 1
qc_pass = True
if not (expected_cds_range[0] <= feature_counts['CDS'] <= expected_cds_range[1]):
print(f'WARNING: CDS count {feature_counts["CDS"]} outside expected range {expected_cds_range}')
qc_pass = False
if feature_counts['tRNA'] < 20:
print(f'WARNING: Only {feature_counts["tRNA"]} tRNAs detected (expect >= 20)')
qc_pass = False
print(f'Feature summary: {feature_counts}')
return qc_pass, feature_counts
# Build species-specific repeat library
RepeatModeler -database mygenome -pa 8 -LTRStruct
# Combine with known repeats
cat mygenome-families.fa /path/to/RepeatMasker/Libraries/RepeatMaskerLib.h5 > combined_lib.fa
# Mask the genome
RepeatMasker \
-lib combined_lib.fa \
-pa 8 \
-xsmall \
-gff \
-dir repeat_out \
assembly.fasta
def check_repeat_content(repeatmasker_tbl, taxon='vertebrate'):
'''
Verify repeat content is within expected range for taxon.
Typical ranges:
- Vertebrate: 30-60%
- Insect: 15-45%
- Plant: 20-85%
- Fungus: 3-20%
'''
expected_ranges = {
'vertebrate': (30, 60), 'insect': (15, 45),
'plant': (20, 85), 'fungus': (3, 20)
}
low, high = expected_ranges.get(taxon, (5, 80))
with open(repeatmasker_tbl) as f:
for line in f:
if 'total interspersed' in line.lower():
pct = float(line.strip().split()[-1].replace('%', ''))
break
qc_pass = low <= pct <= high
if not qc_pass:
print(f'WARNING: Repeat content {pct:.1f}% outside expected range ({low}-{high}%) for {taxon}')
return qc_pass, pct
# BRAKER3 combines GeneMark-ETP, AUGUSTUS, and TSEBRA
# Uses both RNA-seq and protein evidence for best results
braker.pl \
--genome=assembly.fasta.masked \
--bam=rnaseq_sorted.bam \
--prot_seq=proteins.fa \
--softmasking \
--threads 8 \
--species=my_species \
--gff3 \
--workingdir=braker_out
# If only RNA-seq evidence available
braker.pl \
--genome=assembly.fasta.masked \
--bam=rnaseq_sorted.bam \
--softmasking \
--threads 8 \
--species=my_species \
--gff3
# If only protein evidence available (use OrthoDB proteins)
braker.pl \
--genome=assembly.fasta.masked \
--prot_seq=orthodb_proteins.fa \
--softmasking \
--threads 8 \
--species=my_species \
--gff3
# BUSCO completeness check on predicted proteins
busco \
-i braker_out/braker.aa \
-l eukaryota_odb10 \
-o busco_annotation \
-m protein \
--cpu 8
def check_gene_prediction(braker_gff, busco_summary, expected_genes_range=(15000, 35000)):
'''
QC gates after gene prediction.
- Gene count within expected range for genome
- BUSCO completeness > 90%
- Mean exons per gene > 1 (spliced genes expected in eukaryotes)
'''
gene_count = 0
exon_count = 0
with open(braker_gff) as f:
for line in f:
if line.startswith('#'):
continue
feature = line.strip().split('\t')[2] if len(line.strip().split('\t')) >= 3 else ''
if feature == 'gene':
gene_count += 1
elif feature == 'exon':
exon_count += 1
mean_exons = exon_count / gene_count if gene_count > 0 else 0
with open(busco_summary) as f:
for line in f:
if line.strip().startswith('C:'):
completeness = float(line.strip().split('C:')[1].split('%')[0])
break
issues = []
if not (expected_genes_range[0] <= gene_count <= expected_genes_range[1]):
issues.append(f'Gene count {gene_count} outside expected range {expected_genes_range}')
if completeness < 90:
issues.append(f'BUSCO completeness {completeness:.1f}% < 90%')
if mean_exons < 2:
issues.append(f'Mean exons/gene {mean_exons:.1f} is low for eukaryote')
print(f'Genes: {gene_count}, Mean exons/gene: {mean_exons:.1f}, BUSCO: {completeness:.1f}%')
return len(issues) == 0, issues
# eggNOG-mapper for comprehensive functional annotation
emapper.py \
-i braker_out/braker.aa \
--output eggnog_results \
--cpu 8 \
-m diamond \
--tax_scope auto \
--go_evidence non-electronic \
--target_orthologs all \
--seed_ortholog_evalue 1e-5 \
--override
# InterProScan for domain annotation (complementary to eggNOG)
interproscan.sh \
-i braker_out/braker.aa \
-o interpro_results.tsv \
-f tsv,gff3 \
-goterms \
-pa \
-cpu 8
import pandas as pd
def check_functional_annotation(eggnog_annotations, total_genes):
'''
QC gate: > 60% of genes should have functional assignment.
Below 50% suggests database issues or highly divergent organism.
'''
df = pd.read_csv(eggnog_annotations, sep='\t', comment='#')
annotated = len(df[df['Description'] != '-'])
pct_annotated = annotated / total_genes * 100
has_go = len(df[df['GOs'] != '-'])
has_kegg = len(df[df['KEGG_ko'] != '-'])
print(f'Annotated: {annotated}/{total_genes} ({pct_annotated:.1f}%)')
print(f'With GO terms: {has_go}, With KEGG: {has_kegg}')
if pct_annotated < 60:
print('WARNING: <60% annotated. Check database version or use broader taxonomy scope.')
return pct_annotated >= 60
# tRNAscan-SE for tRNA genes
tRNAscan-SE \
-E \
--thread 8 \
-o trna_results.txt \
--gff trna.gff \
assembly.fasta
# Infernal for Rfam-based ncRNA annotation
# Download Rfam covariance models first
cmscan \
--cpu 8 \
--tblout rfam_results.tbl \
--fmt 2 \
--clanin Rfam.clanin \
Rfam.cm \
assembly.fasta
def merge_annotations(braker_gff, trna_gff, rfam_tbl, eggnog_tsv, output_gff):
'''Merge gene predictions, ncRNAs, and functional annotations into final GFF3.'''
import subprocess
# Use AGAT for GFF merging and validation
subprocess.run([
'agat_sp_merge_annotations.pl',
'--gff', braker_gff,
'--gff', trna_gff,
'-o', output_gff
], check=True)
# Validate final GFF3
subprocess.run([
'agat_sp_statistics.pl',
'--gff', output_gff,
'-o', output_gff.replace('.gff3', '_stats.txt')
], check=True)
print(f'Merged annotations written to {output_gff}')
| Issue | Likely Cause | Solution |
|---|---|---|
| Low gene count | Repeat masking too aggressive | Use -xsmall (soft-masking) not -x (hard-masking) |
| BUSCO < 80% | Poor assembly or missing evidence | Add RNA-seq data; check assembly contiguity |
| Many partial genes | Fragmented assembly | Scaffold first; use --min_contig in BRAKER |
| < 60% annotated | Divergent organism | Use broader --tax_scope; try InterProScan |
| Too many genes | Gene prediction artifacts | Increase --min_intron_len; filter short ORFs |
| Missing ncRNAs | Wrong Rfam models | Verify Rfam version matches genome build |
#!/bin/bash
set -e
GENOME="assembly.fasta"
RNASEQ_BAM="rnaseq_sorted.bam"
PROTEINS="orthodb_proteins.fa"
BAKTA_DB="/path/to/bakta_db"
THREADS=8
# Determine organism type
ORGANISM_TYPE="${1:-eukaryotic}" # prokaryotic or eukaryotic
if [ "$ORGANISM_TYPE" == "prokaryotic" ]; then
echo "Running prokaryotic annotation with Bakta"
bakta --db $BAKTA_DB --output bakta_out --prefix genome \
--locus-tag MYORG --threads $THREADS $GENOME
echo "Done. Results in bakta_out/"
else
echo "Running eukaryotic annotation pipeline"
echo "Step 1: Repeat masking"
RepeatModeler -database mygenome -pa $THREADS -LTRStruct
RepeatMasker -lib mygenome-families.fa -pa $THREADS -xsmall -gff -dir repeat_out $GENOME
echo "Step 2: Gene prediction with BRAKER3"
braker.pl --genome=repeat_out/$(basename $GENOME).masked \
--bam=$RNASEQ_BAM --prot_seq=$PROTEINS \
--softmasking --threads $THREADS --gff3 --workingdir=braker_out
echo "Step 3: BUSCO QC"
busco -i braker_out/braker.aa -l eukaryota_odb10 -o busco_check -m protein --cpu $THREADS
echo "Step 4: Functional annotation"
emapper.py -i braker_out/braker.aa --output eggnog_out --cpu $THREADS -m diamond
echo "Step 5: ncRNA annotation"
tRNAscan-SE -E --thread $THREADS -o trna_out.txt --gff trna.gff $GENOME
cmscan --cpu $THREADS --tblout rfam.tbl --fmt 2 Rfam.cm $GENOME
echo "Done. Check braker_out/, eggnog_out*, trna.gff, rfam.tbl"
fi