Name: Similar Protein Retrieval
Author: PharMolix

Retrieve proteins with similar structures, sequences, or from the same family using FoldSeek (structure) or MSA (sequence).

When to Use

User provides a protein and wants to find similar proteins
User asks for homologs or orthologs of a protein
User wants proteins with similar 3D structure
User wants to search by sequence similarity
User provides UniProt ID, PDB ID, FASTA, or PDB file as input

Workflow

Step 1: Parse Input and Load Protein

Detect input type and load the protein appropriately.

import os
import requests
from open_biomed.data import Protein
from open_biomed.tools.tool_registry import TOOLS

def parse_input(user_input):
    """Parse input and return Protein object with structure info."""
    # Check if it's a file path
    if os.path.isfile(user_input):
        if user_input.endswith('.pdb'):
            return Protein.from_pdb_file(user_input), True, "pdb_file"
        elif user_input.endswith(('.fasta', '.fa')):
            with open(user_input) as f:
                seq = ''.join(l.strip() for l in f if not l.startswith('>'))
            return Protein.from_fasta(seq), False, "fasta_file"

    # Check if it's a UniProt ID (e.g., P0DTC2)
    if len(user_input) in [6, 10] and user_input[0].isalpha():
        return query_uniprot(user_input)

    # Check if it's a PDB ID (4 characters, e.g., 6LZG)
    if len(user_input) == 4 and user_input[0].isdigit():
        return query_pdb(user_input)

    # Assume it's a FASTA sequence
    return Protein.from_fasta(user_input), False, "fasta_string"

Step	Output	Description
Input Parse	Protein object	Loaded protein with sequence
UniProt Query	Protein + PDB refs	Sequence and cross-references
MSA	.a3m file	Multiple sequence alignment results
FoldSeek	.m8 file	Similar structures with scores

Similar Protein Retrieval

Similar Protein Retrieval

When to Use

Workflow

Step 1: Parse Input and Load Protein

Step 2a: Query UniProt (if UniProt ID)

Step 2b: Query PDB (if PDB ID)

Step 3: Choose Similarity Search Method

Step 4a: Run MSA (Sequence Similarity)

Step 4b: Run FoldSeek (Structure Similarity)

Step 5: Parse and Display Results

Expected Outputs

Error Handling

Invalid Input Format

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns