Query and retrieve protein/nucleic acid structures from RCSB PDB. Use when you need to search the PDB database for structures or metadata. Supports text, sequence, and structure-based searches, coordinate downloads, and metadata retrieval for structural biology workflows.
The Protein Data Bank hosts over 200,000 experimentally determined macromolecular structures plus computed models from AlphaFold and ModelArchive. This skill provides programmatic access to search, download, and analyze structural data.
| Task | Examples |
|---|---|
| Structure Retrieval | Download coordinates for a known PDB ID |
| Similarity Search | Find structures similar by sequence or 3D fold |
| Metadata Access | Get resolution, method, organism, ligands |
| Dataset Building | Compile structures for ML training or analysis |
| Drug Discovery | Identify ligand-bound structures for a target |
| Quality Filtering | Select high-resolution, well-refined structures |
pip install rcsb-api requests
The rcsb-api package (v1.5.0+) provides:
rcsbapi.search - Query construction and executionrcsbapi.data - DataQuery for batch retrievalfrom rcsbapi.search import TextQuery, AttributeQuery, SeqSimilarityQuery, StructSimilarityQuery
# Text search
results = list(TextQuery("kinase inhibitor")())
# Filter by organism (use string attribute paths)
human = AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
)
# Filter by resolution
high_res = AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.0
)
# Filter by experimental method
xray = AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
# Combine queries: & (AND), | (OR), ~ (NOT)
results = list((TextQuery("kinase") & human & high_res)())
# Sequence similarity (MMseqs2) - minimum 25 residues required
seq_query = SeqSimilarityQuery(
value="VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
evalue_cutoff=1e-5,
identity_cutoff=0.7
)
# Structure similarity (3D fold)
struct_query = StructSimilarityQuery(
structure_search_type="entry_id",
entry_id="4HHB"
)
The rcsb-api package's data module has limited functionality. Use the REST API directly for metadata:
import requests
def fetch_entry(pdb_id: str) -> dict:
"""Fetch entry metadata from RCSB REST API."""
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
return resp.json()
# Example usage
data = fetch_entry("4HHB")
print(data["struct"]["title"])
print(data["rcsb_entry_info"]["resolution_combined"])
# Polymer entity (chain info + sequence)
def fetch_polymer_entity(pdb_id: str, entity_id: int = 1) -> dict:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}")
resp.raise_for_status()
return resp.json()
entity = fetch_polymer_entity("4HHB", 1)
sequence = entity["entity_poly"]["pdbx_seq_one_letter_code"]
| Format | URL Pattern |
|---|---|
| PDB | https://files.rcsb.org/download/{ID}.pdb |
| mmCIF | https://files.rcsb.org/download/{ID}.cif |
| Assembly | https://files.rcsb.org/download/{ID}.pdb1 |
| FASTA | https://www.rcsb.org/fasta/entry/{ID} |
import requests
from pathlib import Path
def download_structure(pdb_id: str, fmt: str = "cif", outdir: str = ".") -> Path:
url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
resp = requests.get(url)
resp.raise_for_status()
outpath = Path(outdir) / f"{pdb_id}.{fmt}"
outpath.write_text(resp.text)
return outpath
from rcsbapi.search import TextQuery, AttributeQuery
query = (
TextQuery("receptor") &
AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
) &
AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.5
) &
AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
)
results = list(query())
import requests
import time
def fetch_batch(pdb_ids: list, delay: float = 0.3) -> dict:
"""Fetch metadata with rate limiting."""
results = {}
for pdb_id in pdb_ids:
try:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
data = resp.json()
results[pdb_id] = {
"title": data["struct"]["title"],
"resolution": data.get("rcsb_entry_info", {}).get("resolution_combined"),
"method": data.get("exptl", [{}])[0].get("method"),
}
except Exception as e:
results[pdb_id] = {"error": str(e)}
time.sleep(delay)
return results
from rcsbapi.search import AttributeQuery
# Find structures containing imatinib (ligand ID: STI)
query = AttributeQuery(
attribute="rcsb_nonpolymer_entity_instance_container_identifiers.comp_id",
operator="exact_match",
value="STI"
)
drug_complexes = list(query())
import requests
query = """
{
entry(entry_id: "4HHB") {
struct { title }
rcsb_entry_info {
resolution_combined
deposited_atom_count
}
polymer_entities {
rcsb_polymer_entity { pdbx_description }
entity_poly { pdbx_seq_one_letter_code }
}
}
}
"""
response = requests.post(
"https://data.rcsb.org/graphql",
json={"query": query}
)
result = response.json()["data"]["entry"]
| Term | Definition |
|---|---|
| PDB ID | 4-character alphanumeric code (e.g., "4HHB"). AlphaFold uses "AF_" prefix |
| Entity | Distinct molecular species. A homodimer has one entity appearing twice |
| Resolution | Quality metric in angstroms. Lower is better; <2.0 Å is high quality |
| Biological Assembly | Functional oligomeric state (may differ from asymmetric unit) |
| mmCIF | Modern format replacing legacy PDB; required for large structures |
Use these string paths with AttributeQuery:
| Attribute | Description |
|---|---|
rcsb_entity_source_organism.scientific_name | Source organism (e.g., "Homo sapiens") |
rcsb_entry_info.resolution_combined | Resolution in angstroms |
exptl.method | Experimental method (X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR) |
rcsb_nonpolymer_entity_instance_container_identifiers.comp_id | Ligand/small molecule ID |
struct.title | Structure title |
rcsb_accession_info.deposit_date | Deposition date |
| Practice | Rationale |
|---|---|
| Use mmCIF format | PDB format has atom count limits |
| Filter by resolution | <2.5 Å for most analyses; <2.0 Å for detailed work |
| Check experimental method | X-ray vs cryo-EM vs NMR have different quality metrics |
| Rate limit requests | 2-3 req/s to avoid 429 errors |
| Cache downloads | Structures rarely change after release |
| Prefer GraphQL | Reduces requests for complex data needs |
| Issue | Resolution |
|---|---|
| 404 on entry fetch | Entry may be obsoleted; check RCSB website for superseding ID |
| 429 Too Many Requests | Implement exponential backoff; reduce request rate |
| Empty search results | Check query syntax; use query.to_dict() to debug |
| Large structure fails | Use mmCIF format instead of PDB |
| Missing sequence data | Query polymer entity endpoint, not entry |
See references/api-reference.md for: