BLAST: Sequence Similarity Searching

Complicated moments explained

Choosing the right BLAST program: The five programs differ by query type and database type. blastn: nucleotide vs nucleotide. blastp: protein vs protein. blastx: nucleotide query (translated) vs protein database — use when you have a novel nucleotide sequence and want to find protein homologs. tblastn: protein query vs nucleotide database (translated) — use when you want to find unannotated protein-coding genes. tblastx: nucleotide query (translated) vs nucleotide database (translated) — slowest, rarely needed.
E-value depends on database size: The same alignment gives a much lower (better) E-value against a small database than against nr. A hit with E=1e-5 in SwissProt is not the same as E=1e-5 in nr. Always report which database you searched.
E-value is not a p-value: E-value is the expected number of hits with that score by chance. E=0.01 means you expect 0.01 random hits of that quality — not that there is a 1% chance of a false positive. For database searches, use E < 1e-5 as a typical significance threshold, but always consider alignment length and percent identity.

# Demonstration: word matching and neighborhood concept from Bio.Align import substitution_matrices import numpy as np blosum62 = substitution_matrices.load("BLOSUM62") def word_neighborhood(query_word, matrix, threshold=11): """ Find all 3-letter amino acid words whose BLOSUM62 score with query_word exceeds the threshold T. This is what blastp does during the seeding phase. """ amino_acids = "ACDEFGHIKLMNPQRSTVWY" neighbors = [] for a1 in amino_acids: for a2 in amino_acids: for a3 in amino_acids: word = a1 + a2 + a3 score = (matrix[query_word[0], word[0]] + matrix[query_word[1], word[1]] + matrix[query_word[2], word[2]]) if score >= threshold: neighbors.append((word, score)) return sorted(neighbors, key=lambda x: -x[1]) def xdrop_extend(seq1, seq2, match=2, mismatch=-1, x_drop=5): """ Ungapped X-drop extension from the center of two sequences. Returns the score of the highest-scoring ungapped alignment. In real BLAST, this starts from a seed position and extends in both directions. """ best_score = 0 score = 0 for i in range(min(len(seq1), len(seq2))): score += match if seq1[i] == seq2[i] else mismatch if score > best_score: best_score = score if best_score - score > x_drop: break return best_score # Example 1: show word neighborhoods for LEW print("=== BLAST Word Neighborhood (threshold T=11) ===") query_word = "LEW" neighbors = word_neighborhood(query_word, blosum62, threshold=11) print(f"Query word: {query_word}") print(f"Neighborhood size: {len(neighbors)} words") print(f"Top 10 neighbors:") for word, score in neighbors[:10]: print(f" {word}: score = {score}") # Example 2: compare neighborhood sizes for different thresholds print() print("=== Sensitivity vs Speed: Word Neighborhood Size vs Threshold ===") print(f"{'Threshold T':>15} {'Neighborhood size':>20} {'Relative sensitivity'}") print("-" * 55) for T in [7, 9, 11, 13, 15]: n = len(word_neighborhood(query_word, blosum62, threshold=T)) rel = n / len(word_neighborhood(query_word, blosum62, threshold=7)) * 100 print(f"{T:>15} {n:>20} {rel:>18.1f}%") print() print("Higher T = smaller neighborhood = faster but less sensitive BLAST.") # Example 3: X-drop extension demonstration print() print("=== X-drop Extension Demonstration ===") seq_ref = "MVLSPADKTNVKAAWGKVGAHAG" seq_hit = "MVLSGEDKSNIKAAWGKIGGHGAE" score = xdrop_extend(seq_ref, seq_hit, match=2, mismatch=-1, x_drop=5) print(f"Query: {seq_ref}") print(f"Database: {seq_hit}") print(f"Ungapped X-drop score (X=5): {score}") print() print("The X-drop extension terminates when the running score drops more") print("than X below the best score seen — avoiding wasted computation") print("on sequences that diverge after a good seed.")

Parameter	Default	Effect
Word size ($W$)	3 (protein), 11 (DNA)	Larger → faster, less sensitive
Word threshold ($T$)	11 (blastp)	Higher → faster, less sensitive
Ungapped X-drop ($X_g$)	20	Higher → extends seeds further
Gapped X-drop ($X_u$)	50	Affects final extension
E-value cutoff	10	Higher → more hits reported

Program	Query	Database	Translation	Use Case
blastn	Nucleotide	Nucleotide	None	Gene finding, species ID, primer design
blastp	Protein	Protein	None	Protein homology, function prediction
blastx	Nucleotide	Protein	Query in 6 frames	EST annotation, finding coding regions
tblastn	Protein	Nucleotide	DB in 6 frames	Finding unannotated genes in genomes
tblastx	Nucleotide	Nucleotide	Both in 6 frames	Comparing unannotated genomes

Variant	Word size	Best for
megablast	28	Highly similar sequences (>95% identity), same species
blastn	11	Somewhat similar sequences, cross-species
discontiguous megablast	11-12	More dissimilar sequences (~80% identity)

Bio Core Blast Searching

Bio Core Blast Searching

BLAST: Sequence Similarity Searching

Complicated moments explained

Environment check (run this first)

How BLAST Works: The Seed-and-Extend Heuristic

Phase 1: Find Seeds (Word Matching)

Phase 2: Extend Seeds (Ungapped Extension)

Phase 3: Gapped Extension (Full Alignment)

Why BLAST Is Approximate

Summary of Key Parameters

BLAST Program Variants

Decision Guide

Nucleotide BLAST Sub-variants

Pitfalls

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy