Name: Biopython: Computational Molecular Biology Toolkit
Author: jaechang-hits

Overview

Biopython is the standard open-source Python library for computational molecular biology, providing modular APIs for sequence handling, biological file parsing, NCBI database access, BLAST searches, protein structure analysis, and phylogenetics. It supports Python 3 and requires NumPy.

When to Use

Parse and convert biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, PHYLIP)
Fetch sequences or publications from NCBI databases (GenBank, PubMed, Protein) programmatically
Run and parse BLAST searches (remote NCBI or local BLAST+)
Perform pairwise or multiple sequence alignments with custom scoring
Analyze 3D protein structures — distances, angles, DSSP, superimposition
Build and visualize phylogenetic trees from sequence alignments
Calculate sequence statistics (GC content, molecular weight, melting temperature)
Batch-process thousands of sequences with custom filtering logic
Use pysam instead for reading SAM/BAM/CRAM alignment files and working with mapped reads; use scikit-bio instead for advanced ecological diversity metrics

Overview

When to Use

Parse and convert biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, PHYLIP)
Fetch sequences or publications from NCBI databases (GenBank, PubMed, Protein) programmatically
Run and parse BLAST searches (remote NCBI or local BLAST+)
Perform pairwise or multiple sequence alignments with custom scoring
Analyze 3D protein structures — distances, angles, DSSP, superimposition
Build and visualize phylogenetic trees from sequence alignments
Calculate sequence statistics (GC content, molecular weight, melting temperature)
Batch-process thousands of sequences with custom filtering logic
Use pysam instead for reading SAM/BAM/CRAM alignment files and working with mapped reads; use scikit-bio instead for advanced ecological diversity metrics

Parameter	Module	Default	Range / Options	Effect
`Seq.translate(table=)`	Bio.Seq	`1` (Standard)	`1`-`33`	NCBI genetic code table for translation
`Seq.translate(to_stop=)`	Bio.Seq	`False`	`True`, `False`	Stop at first stop codon vs translate entire sequence
`SeqIO.parse(format=)`	Bio.SeqIO	—	`"fasta"`, `"genbank"`, `"fastq"`, `"phylip"`	File format for parsing
`Entrez.email`	Bio.Entrez	(required)	Valid email	NCBI requires email for tracking; set before any Entrez call
`Entrez.api_key`	Bio.Entrez	`None`	NCBI API key string	Increases rate limit from 3 to 10 requests/second
`PairwiseAligner.mode`	Bio.Align	`"global"`	`"global"`, `"local"`	Needleman-Wunsch vs Smith-Waterman algorithm
`PairwiseAligner.open_gap_score`	Bio.Align	`-1`	`-20` to `0`	Penalty for opening a gap; more negative = fewer gaps
`PairwiseAligner.substitution_matrix`	Bio.Align	None	`"BLOSUM62"`, `"BLOSUM45"`, `"PAM250"`	Scoring matrix for protein alignment
`PDBParser(QUIET=)`	Bio.PDB	`False`	`True`, `False`	Suppress parser warnings for non-standard PDB files
`DistanceCalculator(model=)`	Bio.Phylo	`"identity"`	`"identity"`, `"blosum62"`	Distance model for tree construction

Problem	Cause	Solution
`HTTPError 400` from Entrez	Invalid accession/ID or malformed query	Validate accessions; check query syntax with NCBI web interface first
`HTTPError 429` (Too Many Requests)	Exceeding NCBI rate limit (3 req/s)	Set `Entrez.api_key` for 10 req/s; add `time.sleep(0.4)` between calls
`ValueError: No records found` in `SeqIO.read()`	Empty file or wrong format string	Use `SeqIO.parse()` to check if file has records; verify format matches actual content
`Bio.PDB.PDBExceptions.PDBConstructionWarning`	Non-standard atoms or occupancy issues	Use `PDBParser(QUIET=True)` or fix PDB with `pdb-tools`; check for alternate conformations
BLAST search times out	Large query or busy NCBI servers	Use local BLAST+ for large-scale searches; set `NCBIWWW.qblast(hitlist_size=N)` to limit results
`Alignment has sequences of different lengths`	Unaligned sequences passed to AlignIO	Align sequences first with MUSCLE/Clustal before loading as alignment
`SeqIO.index()` raises `ValueError`	Duplicate IDs in FASTA file	Deduplicate IDs with `SeqIO.to_dict()` or pre-process with `awk`
`ImportError: No module named Bio`	Biopython not installed in active environment	`pip install biopython`; verify with `python -c "import Bio; print(Bio.__version__)"`
`translate()` gives unexpected `*`	Stop codons in middle of sequence	Check reading frame; use `Seq.translate(table=N)` with correct genetic code
GenomeDiagram blank output	No features matched filter criteria	Check `feature.type` values in your GenBank file; print types to debug

Biopython: Computational Molecular Biology Toolkit

Overview

When to Use

Biopython: Computational Molecular Biology Toolkit

Overview

When to Use

Prerequisites

Quick Start

Core API

Module 1: Sequence Objects (Bio.Seq)

Module 2: Sequence I/O (Bio.SeqIO)

Module 3: NCBI Database Access (Bio.Entrez)

Module 4: BLAST Operations (Bio.Blast)

Module 5: Pairwise Alignment (Bio.Align)

Module 6: Protein Structure Analysis (Bio.PDB)

Module 7: Phylogenetics (Bio.Phylo)

Module 8: Sequence Utilities (Bio.SeqUtils)

Common Workflows

Workflow 1: Gene Sequence Retrieval and Analysis Pipeline

Workflow 2: Comparative Sequence Analysis with BLAST and Phylogeny

Workflow 3: Batch Sequence Processing and Quality Filtering

Key Parameters

Common Recipes

Recipe: Restriction Enzyme Analysis

Recipe: Motif Discovery and Position Weight Matrix

Recipe: GenomeDiagram — Visualize Genomic Features

Recipe: Batch PubMed Literature Search

Troubleshooting

References

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns