스킬 파일

Version Compatibility

Name: Version Compatibility
Author: John-Wang-0809

End-to-end genome annotation pipeline from assembled contigs to functional annotation, covering repeat masking, gene prediction, and functional assignment for both prokaryotic and eukaryotic genomes. Use when annotating a newly assembled genome from scratch.

John-Wang-08090 스타2026. 3. 16.

직업
카테고리: 생물정보학

스킬 내용

Version Compatibility

Reference examples tested with: BRAKER3 3.0+, BUSCO 5.5+, Bakta 1.9+, Infernal 1.1+, InterProScan 5.66+, Prokka 1.14+, RepeatMasker 4.1+, RepeatModeler 2.0+, eggNOG-mapper 2.1+, pandas 2.2+, tRNAscan-SE 2.0+

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Genome Annotation Pipeline

"Annotate my genome assembly" → Orchestrate prokaryotic (Bakta) or eukaryotic (BRAKER3) gene prediction, repeat masking (RepeatMasker), functional annotation (eggNOG-mapper, InterProScan), and ncRNA annotation (Infernal).

Complete workflow from assembled contigs to functional annotation for prokaryotic or eukaryotic genomes.

Pipeline Overview

관련 스킬

Version Compatibility | Skills Pool

Assembled contigs
    |
    v
[0. Assembly QC] ----------> QUAST, BUSCO (confirm assembly quality)
    |
    +----- Prokaryotic? -----> Path A: Bakta (one-step annotation)
    |                                |
    |                                v
    |                          Annotated genome (GFF3, GenBank, FASTA)
    |
    +----- Eukaryotic? ------> Path B: Multi-step pipeline
                                    |
                                    v
                              [1. Repeat Masking] ----> RepeatModeler + RepeatMasker
                                    |
                                    v
                              [2. Gene Prediction] ---> BRAKER3 (RNA-seq + protein evidence)
                                    |
                                    v
                              [3. Functional Annotation] -> eggNOG-mapper + InterProScan
                                    |
                                    v
                              [4. ncRNA Annotation] ---> Infernal + tRNAscan-SE
                                    |
                                    v
                              Annotated genome (GFF3, proteins, functional tables)

bakta_db download --output /path/to/bakta_db --type full

bakta \
    --db /path/to/bakta_db \
    --output bakta_out \
    --prefix my_genome \
    --locus-tag MYORG \
    --genus Escherichia --species "coli" \
    --strain K12 \
    --gram - \
    --complete \
    --threads 8 \
    assembly.fasta

import subprocess
import json

def validate_prokaryotic_annotation(bakta_dir, prefix, expected_cds_range=(500, 8000)):
    '''
    QC gates for prokaryotic annotation.
    - CDS count in expected range for genome size
    - tRNA count >= 20 (typical minimum for free-living bacteria)
    - rRNA operons detected
    '''
    gff_file = f'{bakta_dir}/{prefix}.gff3'

    feature_counts = {'CDS': 0, 'tRNA': 0, 'rRNA': 0, 'tmRNA': 0, 'ncRNA': 0}
    with open(gff_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            fields = line.strip().split('\t')
            if len(fields) >= 3 and fields[2] in feature_counts:
                feature_counts[fields[2]] += 1

    qc_pass = True
    if not (expected_cds_range[0] <= feature_counts['CDS'] <= expected_cds_range[1]):
        print(f'WARNING: CDS count {feature_counts["CDS"]} outside expected range {expected_cds_range}')
        qc_pass = False
    if feature_counts['tRNA'] < 20:
        print(f'WARNING: Only {feature_counts["tRNA"]} tRNAs detected (expect >= 20)')
        qc_pass = False

    print(f'Feature summary: {feature_counts}')
    return qc_pass, feature_counts

# Build species-specific repeat library
RepeatModeler -database mygenome -pa 8 -LTRStruct

# Combine with known repeats
cat mygenome-families.fa /path/to/RepeatMasker/Libraries/RepeatMaskerLib.h5 > combined_lib.fa

# Mask the genome
RepeatMasker \
    -lib combined_lib.fa \
    -pa 8 \
    -xsmall \
    -gff \
    -dir repeat_out \
    assembly.fasta

def check_repeat_content(repeatmasker_tbl, taxon='vertebrate'):
    '''
    Verify repeat content is within expected range for taxon.
    Typical ranges:
    - Vertebrate: 30-60%
    - Insect: 15-45%
    - Plant: 20-85%
    - Fungus: 3-20%
    '''
    expected_ranges = {
        'vertebrate': (30, 60), 'insect': (15, 45),
        'plant': (20, 85), 'fungus': (3, 20)
    }
    low, high = expected_ranges.get(taxon, (5, 80))

    with open(repeatmasker_tbl) as f:
        for line in f:
            if 'total interspersed' in line.lower():
                pct = float(line.strip().split()[-1].replace('%', ''))
                break

    qc_pass = low <= pct <= high
    if not qc_pass:
        print(f'WARNING: Repeat content {pct:.1f}% outside expected range ({low}-{high}%) for {taxon}')
    return qc_pass, pct

# BRAKER3 combines GeneMark-ETP, AUGUSTUS, and TSEBRA
# Uses both RNA-seq and protein evidence for best results
braker.pl \
    --genome=assembly.fasta.masked \
    --bam=rnaseq_sorted.bam \
    --prot_seq=proteins.fa \
    --softmasking \
    --threads 8 \
    --species=my_species \
    --gff3 \
    --workingdir=braker_out

# If only RNA-seq evidence available
braker.pl \
    --genome=assembly.fasta.masked \
    --bam=rnaseq_sorted.bam \
    --softmasking \
    --threads 8 \
    --species=my_species \
    --gff3

# If only protein evidence available (use OrthoDB proteins)
braker.pl \
    --genome=assembly.fasta.masked \
    --prot_seq=orthodb_proteins.fa \
    --softmasking \
    --threads 8 \
    --species=my_species \
    --gff3

# BUSCO completeness check on predicted proteins
busco \
    -i braker_out/braker.aa \
    -l eukaryota_odb10 \
    -o busco_annotation \
    -m protein \
    --cpu 8

def check_gene_prediction(braker_gff, busco_summary, expected_genes_range=(15000, 35000)):
    '''
    QC gates after gene prediction.
    - Gene count within expected range for genome
    - BUSCO completeness > 90%
    - Mean exons per gene > 1 (spliced genes expected in eukaryotes)
    '''
    gene_count = 0
    exon_count = 0
    with open(braker_gff) as f:
        for line in f:
            if line.startswith('#'):
                continue
            feature = line.strip().split('\t')[2] if len(line.strip().split('\t')) >= 3 else ''
            if feature == 'gene':
                gene_count += 1
            elif feature == 'exon':
                exon_count += 1

    mean_exons = exon_count / gene_count if gene_count > 0 else 0

    with open(busco_summary) as f:
        for line in f:
            if line.strip().startswith('C:'):
                completeness = float(line.strip().split('C:')[1].split('%')[0])
                break

    issues = []
    if not (expected_genes_range[0] <= gene_count <= expected_genes_range[1]):
        issues.append(f'Gene count {gene_count} outside expected range {expected_genes_range}')
    if completeness < 90:
        issues.append(f'BUSCO completeness {completeness:.1f}% < 90%')
    if mean_exons < 2:
        issues.append(f'Mean exons/gene {mean_exons:.1f} is low for eukaryote')

    print(f'Genes: {gene_count}, Mean exons/gene: {mean_exons:.1f}, BUSCO: {completeness:.1f}%')
    return len(issues) == 0, issues

# eggNOG-mapper for comprehensive functional annotation
emapper.py \
    -i braker_out/braker.aa \
    --output eggnog_results \
    --cpu 8 \
    -m diamond \
    --tax_scope auto \
    --go_evidence non-electronic \
    --target_orthologs all \
    --seed_ortholog_evalue 1e-5 \
    --override

# InterProScan for domain annotation (complementary to eggNOG)
interproscan.sh \
    -i braker_out/braker.aa \
    -o interpro_results.tsv \
    -f tsv,gff3 \
    -goterms \
    -pa \
    -cpu 8

import pandas as pd

def check_functional_annotation(eggnog_annotations, total_genes):
    '''
    QC gate: > 60% of genes should have functional assignment.
    Below 50% suggests database issues or highly divergent organism.
    '''
    df = pd.read_csv(eggnog_annotations, sep='\t', comment='#')
    annotated = len(df[df['Description'] != '-'])
    pct_annotated = annotated / total_genes * 100

    has_go = len(df[df['GOs'] != '-'])
    has_kegg = len(df[df['KEGG_ko'] != '-'])

    print(f'Annotated: {annotated}/{total_genes} ({pct_annotated:.1f}%)')
    print(f'With GO terms: {has_go}, With KEGG: {has_kegg}')

    if pct_annotated < 60:
        print('WARNING: <60% annotated. Check database version or use broader taxonomy scope.')
    return pct_annotated >= 60

# tRNAscan-SE for tRNA genes
tRNAscan-SE \
    -E \
    --thread 8 \
    -o trna_results.txt \
    --gff trna.gff \
    assembly.fasta

# Infernal for Rfam-based ncRNA annotation
# Download Rfam covariance models first
cmscan \
    --cpu 8 \
    --tblout rfam_results.tbl \
    --fmt 2 \
    --clanin Rfam.clanin \
    Rfam.cm \
    assembly.fasta

def merge_annotations(braker_gff, trna_gff, rfam_tbl, eggnog_tsv, output_gff):
    '''Merge gene predictions, ncRNAs, and functional annotations into final GFF3.'''
    import subprocess

    # Use AGAT for GFF merging and validation
    subprocess.run([
        'agat_sp_merge_annotations.pl',
        '--gff', braker_gff,
        '--gff', trna_gff,
        '-o', output_gff
    ], check=True)

    # Validate final GFF3
    subprocess.run([
        'agat_sp_statistics.pl',
        '--gff', output_gff,
        '-o', output_gff.replace('.gff3', '_stats.txt')
    ], check=True)

    print(f'Merged annotations written to {output_gff}')

Issue	Likely Cause	Solution
Low gene count	Repeat masking too aggressive	Use `-xsmall` (soft-masking) not `-x` (hard-masking)
BUSCO < 80%	Poor assembly or missing evidence	Add RNA-seq data; check assembly contiguity
Many partial genes	Fragmented assembly	Scaffold first; use `--min_contig` in BRAKER
< 60% annotated	Divergent organism	Use broader `--tax_scope`; try InterProScan
Too many genes	Gene prediction artifacts	Increase `--min_intron_len`; filter short ORFs
Missing ncRNAs	Wrong Rfam models	Verify Rfam version matches genome build

#!/bin/bash
set -e

GENOME="assembly.fasta"
RNASEQ_BAM="rnaseq_sorted.bam"
PROTEINS="orthodb_proteins.fa"
BAKTA_DB="/path/to/bakta_db"
THREADS=8

# Determine organism type
ORGANISM_TYPE="${1:-eukaryotic}"  # prokaryotic or eukaryotic

if [ "$ORGANISM_TYPE" == "prokaryotic" ]; then
    echo "Running prokaryotic annotation with Bakta"
    bakta --db $BAKTA_DB --output bakta_out --prefix genome \
          --locus-tag MYORG --threads $THREADS $GENOME
    echo "Done. Results in bakta_out/"

else
    echo "Running eukaryotic annotation pipeline"

    echo "Step 1: Repeat masking"
    RepeatModeler -database mygenome -pa $THREADS -LTRStruct
    RepeatMasker -lib mygenome-families.fa -pa $THREADS -xsmall -gff -dir repeat_out $GENOME

    echo "Step 2: Gene prediction with BRAKER3"
    braker.pl --genome=repeat_out/$(basename $GENOME).masked \
              --bam=$RNASEQ_BAM --prot_seq=$PROTEINS \
              --softmasking --threads $THREADS --gff3 --workingdir=braker_out

    echo "Step 3: BUSCO QC"
    busco -i braker_out/braker.aa -l eukaryota_odb10 -o busco_check -m protein --cpu $THREADS

    echo "Step 4: Functional annotation"
    emapper.py -i braker_out/braker.aa --output eggnog_out --cpu $THREADS -m diamond

    echo "Step 5: ncRNA annotation"
    tRNAscan-SE -E --thread $THREADS -o trna_out.txt --gff trna.gff $GENOME
    cmscan --cpu $THREADS --tblout rfam.tbl --fmt 2 Rfam.cm $GENOME

    echo "Done. Check braker_out/, eggnog_out*, trna.gff, rfam.tbl"
fi

Version Compatibility

Version Compatibility

Genome Annotation Pipeline

Pipeline Overview

Version Compatibility

Version Compatibility

Genome Annotation Pipeline

Pipeline Overview

Path A: Prokaryotic Annotation (Bakta)

Database Setup

Run Bakta

Prokaryotic QC Checkpoint

Path B: Eukaryotic Annotation

Step 1: Repeat Masking

Repeat Masking QC Checkpoint

Step 2: Gene Prediction with BRAKER3

Gene Prediction QC Checkpoint

Step 3: Functional Annotation

Functional Annotation QC Checkpoint

Step 4: ncRNA Annotation

Merging Annotations

Troubleshooting

Complete Pipeline Script

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy

Version Compatibility

Version Compatibility

Genome Annotation Pipeline

Pipeline Overview

Version Compatibility

Version Compatibility

Genome Annotation Pipeline

Pipeline Overview

Path A: Prokaryotic Annotation (Bakta)

Database Setup

Run Bakta

Prokaryotic QC Checkpoint

Path B: Eukaryotic Annotation

Step 1: Repeat Masking

Repeat Masking QC Checkpoint

Step 2: Gene Prediction with BRAKER3

Gene Prediction QC Checkpoint

Step 3: Functional Annotation

Functional Annotation QC Checkpoint

Step 4: ncRNA Annotation

Merging Annotations

Troubleshooting

Complete Pipeline Script

Related Skills

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy