Predict protein subcellular localization from amino acid sequence using BioT5. Use this skill when: (1) You have a protein sequence and want to know where it localizes in the cell, (2) You need to identify cellular compartment (nucleus, cytoplasm, membrane, etc.), (3) You want quick localization prediction without experimental data.
Predict subcellular localization for proteins from their amino acid sequences using the BioT5 model.
from open_biomed.data import Protein, Text
from open_biomed.core.pipeline import InferencePipeline
# Create protein from FASTA sequence
protein = Protein.from_fasta("YOUR_AMINO_ACID_SEQUENCE")
# Create the question for subcellular localization
question = Text.from_str(
"Please provide information about the subcellular localization of this protein."
)
# Load the BioT5 model for protein question answering
pipeline = InferencePipeline(
task="protein_question_answering",
model="biot5",
model_ckpt="./checkpoints/server/protein_question_answering_biot5.ckpt",
device="cuda:0"
)
# Run inference to get localization prediction
outputs = pipeline.run(protein=protein, text=question)
localization = outputs[0][0].str
print(localization)
See examples/basic_example.py for a complete runnable script.
The model returns subcellular localization information:
| Output | Description |
|---|---|
| Cytoplasm | Cytoplasmic proteins, soluble enzymes |
| Nucleus | Nuclear proteins, transcription factors |
| Membrane | Membrane-bound proteins, receptors |
| Secreted | Extracellular proteins, secreted factors |
| Mitochondria | Mitochondrial proteins |
| Peroxisome | Peroxisomal enzymes |
| Endoplasmic reticulum | ER-resident proteins |
| Golgi apparatus | Golgi-localized proteins |
Cytoplasm
The skill accepts protein sequences in FASTA format (amino acid string):
# From raw sequence string
protein = Protein.from_fasta("MRVGVIRFPGSNCDRDVHHVLELAGAEPEYVWW...")
# From UniProt (get sequence first)
from open_biomed.tools.tool_registry import TOOLS
tool = TOOLS["protein_uniprot_request"]
protein, _ = tool.run(accession="P00533") # Example: EGFR
| Error | Cause | Solution |
|---|---|---|
FileNotFoundError | Model checkpoint not found | Download checkpoint to ./checkpoints/server/ |
CUDA out of memory | GPU memory insufficient | Use smaller batch or CPU device |
Sequence too long | Exceeds 512 amino acid limit | Truncate sequence or use sliding window |
protein-function-annotation: For function predictionprotein-mutation-analysis: For mutation effect predictionuniprot-query: For retrieving protein metadata from UniProt