Query the PubChem PUG-REST API for chemical compound data including molecular properties, structures, synonyms, safety/hazard data, bioassay results, and cross-references. Trigger for any mention of PubChem, CID, chemical compound lookup, molecular formula, molecular weight, SMILES, InChI, InChIKey, chemical structure, compound properties, chemical safety, GHS hazards, bioassay, compound similarity, substructure search, CAS number lookup, XLogP, TPSA, hydrogen bond donors/acceptors, drug-likeness, Lipinski rules, or toxicology data. Also trigger when the user needs to identify a compound by name or structure, compare molecular properties, find structurally similar compounds, look up safety data for a chemical, or cross-reference compound identifiers across databases. This skill serves toxicologists, chemists, reviewers, and anyone evaluating compound safety profiles or chemical data at FDA.
The PubChem PUG-REST API (https://pubchem.ncbi.nlm.nih.gov/rest/pug/) provides free, no-auth access to PubChem's database of 100M+ chemical compounds, 300M+ substance records, and 1M+ bioassay records. PubChem is the world's largest open-access chemical database, maintained by NCBI at the National Library of Medicine.
Base URL (PUG-REST): https://pubchem.ncbi.nlm.nih.gov/rest/pug/
Base URL (PUG-View): https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/
API Docs: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
Web UI: https://pubchem.ncbi.nlm.nih.gov
No API key required. Rate limit: 5 requests/second, 400 requests/minute. Add between calls.
time.sleep(0.25)What this data is: Molecular structures, computed and deposited properties (molecular weight, formula, LogP, TPSA, hydrogen bonding), synonyms (including CAS numbers), safety/hazard data (GHS classification), bioassay results, and cross-references to other databases. Computed properties use standardized algorithms applied to every compound.
What this data is NOT: Not clinical trial data (use ClinicalTrials.gov). Not drug labeling (use DailyMed). Not literature citations (use PubMed). PubChem is chemical structure and property data, not clinical or regulatory content.
import urllib.request, urllib.parse, json, time
PUG = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
PUG_VIEW = "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view"
def pubchem_get(url):
"""GET request with error handling."""
req = urllib.request.Request(url, headers={"Accept": "application/json"})
try:
with urllib.request.urlopen(req, timeout=20) as resp:
return json.loads(resp.read().decode())
except urllib.error.HTTPError as e:
body = e.read().decode()
try:
fault = json.loads(body).get("Fault", {})
return {"error": fault.get("Code",""), "message": fault.get("Message","")}
except:
return {"error": str(e.code), "message": body[:200]}
def compound_properties(identifier, id_type="name", properties=None):
"""Get computed properties for a compound.
id_type: 'name', 'cid', 'smiles', 'inchikey', 'formula'
properties: comma-separated property names (see Properties Reference)
"""
if properties is None:
properties = "MolecularFormula,MolecularWeight,IUPACName,CanonicalSMILES,InChIKey,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,HeavyAtomCount,Complexity,Charge,ExactMass"
# Formula uses fastformula endpoint
if id_type == "formula":
url = f"{PUG}/compound/fastformula/{urllib.parse.quote(str(identifier))}/property/{properties}/JSON"
else:
url = f"{PUG}/compound/{id_type}/{urllib.parse.quote(str(identifier))}/property/{properties}/JSON"
data = pubchem_get(url)
if "error" in data:
return data
return data.get("PropertyTable", {}).get("Properties", [])
PUG-REST follows a consistent 3-part URL pattern:
/rest/pug/{domain}/{namespace}/{identifiers}/{operation}/{output}
| Part | Values | Example |
|---|---|---|
| domain | compound, substance, assay | compound |
| namespace | cid, name, smiles, inchikey, fastformula, fastsimilarity_2d, fastsubstructure | name |
| identifiers | The actual value(s) | aspirin or 2244 |
| operation | property, synonyms, sids, xrefs, description, classification, assaysummary, PNG, JSON | property/MolecularWeight |
| output | JSON, XML, CSV, TXT, SDF, PNG, SVG | JSON |
Pattern: GET /compound/{namespace}/{id}/property/{properties}/JSON
Retrieve computed molecular properties. Multiple properties are comma-separated.
Namespaces for compound lookup:
| Namespace | Example | Description |
|---|---|---|
name | /compound/name/aspirin/... | Search by drug/chemical name |
cid | /compound/cid/2244/... | Direct CID lookup |
smiles | /compound/smiles/{encoded}/... | Search by SMILES string (URL-encode) |
inchikey | /compound/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/... | Search by InChIKey |
fastformula | /compound/fastformula/C9H8O4/... | Search by molecular formula |
Available computed properties:
| Property | Description | Example |
|---|---|---|
MolecularFormula | Molecular formula | C9H8O4 |
MolecularWeight | Molecular weight (g/mol) | 180.16 |
IUPACName | IUPAC systematic name | 2-acetyloxybenzoic acid |
CanonicalSMILES | Canonical SMILES string (response key: ConnectivitySMILES) | CC(=O)OC1=CC=CC=C1C(=O)O |
IsomericSMILES | SMILES with stereochemistry (response key: SMILES) | |
InChI | IUPAC International Chemical Identifier | |
InChIKey | Hashed InChI (27-char) | BSYNRYMUTXBXSQ-UHFFFAOYSA-N |
XLogP | Predicted octanol-water partition coefficient | 1.2 |
TPSA | Topological polar surface area (A^2) | 63.6 |
ExactMass | Monoisotopic exact mass | 180.04226 |
HBondDonorCount | Hydrogen bond donors | 1 |
HBondAcceptorCount | Hydrogen bond acceptors | 4 |
RotatableBondCount | Rotatable bonds | 3 |
HeavyAtomCount | Non-hydrogen atoms | 13 |
Complexity | Bertz complexity index | 212 |
Charge | Formal charge | 0 |
Volume3D | 3D molecular volume | 136 |
Multiple CIDs: Separate with commas: /compound/cid/2244,3672,5988/property/.../JSON
Pattern: GET /compound/{namespace}/{id}/synonyms/JSON
Returns all known names for a compound, including CAS numbers, brand names, IUPAC names, and registry IDs.
def compound_synonyms(identifier, id_type="name"):
url = f"{PUG}/compound/{id_type}/{urllib.parse.quote(str(identifier))}/synonyms/JSON"
data = pubchem_get(url)
if "error" in data:
return data
info = data.get("InformationList", {}).get("Information", [{}])[0]
return {"cid": info.get("CID"), "synonyms": info.get("Synonym", [])}
Pattern: GET /compound/cid/{cid}/description/JSON
Returns text descriptions from various sources (ChEBI, OEHHA, EPA, etc.).
def compound_description(cid):
url = f"{PUG}/compound/cid/{cid}/description/JSON"
data = pubchem_get(url)
if "error" in data:
return data
return [{"source": d.get("DescriptionSourceName",""),
"description": d.get("Description","")}
for d in data.get("InformationList",{}).get("Information",[]) if d.get("Description")]
Pattern: GET /compound/fastsimilarity_2d/{namespace}/{id}/property/.../JSON?Threshold={pct}&MaxRecords={n}
Finds structurally similar compounds by 2D Tanimoto fingerprint similarity.
def similarity_search(identifier, id_type="cid", threshold=90, max_records=10, properties=None):
if properties is None:
properties = "IUPACName,MolecularWeight,MolecularFormula"
url = (f"{PUG}/compound/fastsimilarity_2d/{id_type}/{urllib.parse.quote(str(identifier))}"
f"/property/{properties}/JSON?Threshold={threshold}&MaxRecords={max_records}")
data = pubchem_get(url)
if "error" in data:
return data
return data.get("PropertyTable", {}).get("Properties", [])
Pattern: GET /compound/fastsubstructure/{namespace}/{id}/cids/JSON?MaxRecords={n}
Finds compounds containing a given substructure.
def substructure_search(smiles, max_records=10):
url = (f"{PUG}/compound/fastsubstructure/smiles/{urllib.parse.quote(smiles)}"
f"/cids/JSON?MaxRecords={max_records}")
data = pubchem_get(url)
if "error" in data:
return data
return data.get("IdentifierList", {}).get("CID", [])
Pattern: GET /compound/fastformula/{formula}/property/.../JSON?MaxRecords={n}
Finds all compounds with a given molecular formula.
def formula_search(formula, max_records=10, properties=None):
if properties is None:
properties = "IUPACName,MolecularWeight"
url = f"{PUG}/compound/fastformula/{formula}/property/{properties}/JSON?MaxRecords={max_records}"
data = pubchem_get(url)
if "error" in data:
return data
return data.get("PropertyTable", {}).get("Properties", [])
Pattern: GET /compound/cid/{cid}/assaysummary/JSON
Returns all bioassay results for a compound (active/inactive calls across assays).
def assay_summary(cid):
url = f"{PUG}/compound/cid/{cid}/assaysummary/JSON"
data = pubchem_get(url)
if "error" in data:
return data
table = data.get("Table", {})
columns = table.get("Columns", {}).get("Column", [])
rows = table.get("Row", [])
return {"columns": columns, "row_count": len(rows),
"sample_rows": [r.get("Cell",[]) for r in rows[:5]]}
Pattern: GET /compound/cid/{cid}/xrefs/{xref_type}/JSON
| xref_type | Returns |
|---|---|
RegistryID | External registry IDs (CAS numbers, etc.) |
MMDBID | MMDB structure IDs |
ProteinGI | Protein GI numbers |
GeneID | NCBI Gene IDs |
PatentID | Patent IDs |
Pattern: GET /compound/{namespace}/{id}/PNG or /SVG
Returns 2D structure diagram. Use directly as an image URL.
def structure_image_url(cid, fmt="PNG", size=300):
return f"{PUG}/compound/cid/{cid}/{fmt}?image_size={size}x{size}"
Pattern: GET /pug_view/data/compound/{cid}/JSON?heading={heading}
Returns curated data organized by section headings. Useful for safety, pharmacology, and toxicology data that is NOT in computed properties.
def pug_view_section(cid, heading):
url = f"{PUG_VIEW}/data/compound/{cid}/JSON?heading={urllib.parse.quote(heading)}"
data = pubchem_get(url)
if "error" in data:
return data
return data.get("Record", {}).get("Section", [])
Key headings:
| Heading | Content |
|---|---|
Safety and Hazards | GHS classification, hazard statements, precautions |
Pharmacology and Biochemistry | Pharmacodynamics, MeSH classification, mechanism |
Toxicity | LD50, LC50, toxicological data |
Drug and Medication Information | Therapeutic uses, drug classes |
Chemical and Physical Properties | Computed and experimental properties |
def compound_profile(name):
"""Get a complete profile for a compound by name."""
props = compound_properties(name, id_type="name",
properties="MolecularFormula,MolecularWeight,IUPACName,CanonicalSMILES,InChIKey,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,Complexity,Charge,ExactMass")
if isinstance(props, dict) and "error" in props:
return props
if not props:
return {"error": "No results"}
p = props[0]
cid = p.get("CID")
time.sleep(0.25)
syns = compound_synonyms(str(cid), id_type="cid")
synonym_list = syns.get("synonyms", [])[:10] if not isinstance(syns, dict) or "error" not in syns else []
return {
"cid": cid,
"name": p.get("IUPACName"),
"formula": p.get("MolecularFormula"),
"molecular_weight": p.get("MolecularWeight"),
"smiles": p.get("CanonicalSMILES"),
"inchikey": p.get("InChIKey"),
"xlogp": p.get("XLogP"),
"tpsa": p.get("TPSA"),
"hbd": p.get("HBondDonorCount"),
"hba": p.get("HBondAcceptorCount"),
"rotatable_bonds": p.get("RotatableBondCount"),
"complexity": p.get("Complexity"),
"synonyms": synonym_list,
"image_url": structure_image_url(cid) if cid else None
}
def lipinski_check(name):
"""Check Lipinski's Rule of Five for drug-likeness."""
props = compound_properties(name, id_type="name",
properties="MolecularWeight,XLogP,HBondDonorCount,HBondAcceptorCount")
if isinstance(props, dict) and "error" in props:
return props
if not props:
return {"error": "No results"}
p = props[0]
violations = 0
checks = {
"MW <= 500": float(p.get("MolecularWeight",0)) <= 500,
"XLogP <= 5": (p.get("XLogP") or 0) <= 5,
"HBD <= 5": (p.get("HBondDonorCount") or 0) <= 5,
"HBA <= 10": (p.get("HBondAcceptorCount") or 0) <= 10,
}
violations = sum(1 for v in checks.values() if not v)
return {"compound": p.get("CID"), "checks": checks,
"violations": violations, "drug_like": violations <= 1}
def cas_to_compound(cas_number):
"""Look up a compound by CAS registry number."""
return compound_properties(cas_number, id_type="name",
properties="IUPACName,MolecularFormula,MolecularWeight,CanonicalSMILES,InChIKey")
def compare_compounds(names):
"""Compare properties across multiple compounds."""
all_props = []
for name in names:
props = compound_properties(name, id_type="name",
properties="MolecularFormula,MolecularWeight,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount")
if isinstance(props, list) and props:
all_props.append({"query": name, **props[0]})
time.sleep(0.25)
return all_props
SMILES contain characters like =, (, ), # that break URLs. Always URL-encode: urllib.parse.quote(smiles). The = in CC(=O) must become CC(%3DO).
PubChem returns HTTP 200 for some errors with a Fault JSON object containing Code and Message. Common codes: PUGREST.NotFound (compound not found), PUGREST.BadRequest (malformed query), PUGREST.ServerBusy (rate limited).
Stricter than most NCBI APIs. Use time.sleep(0.25) between calls. Exceeding limits returns PUGREST.ServerBusy with a "Please throttle your requests" message.
PubChem resolves CAS numbers like names: /compound/name/50-78-2/... correctly maps to aspirin (CID 2244). No special endpoint needed.
Ambiguous names can match multiple compounds. The property endpoint returns the first match. Use /compound/name/{name}/cids/JSON to see all matches, then query by specific CID.
PubChem's name resolution is surprisingly picky. Many names that seem obvious fail with PUGREST.NotFound:
Best practice: When a name lookup fails, try in this order: (1) CAS number, (2) base compound without salt/counter-ion, (3) simplified name, (4) SMILES or InChIKey if available. CAS numbers are the most reliable name-type identifier for PubChem lookups.
Searching by formula (e.g., C9H8O4) returns all compounds with that formula, not just the one you expect. Use MaxRecords to limit results, then filter by name or other properties.
Safety/hazard data, pharmacology summaries, and toxicity data are in PUG-View (/rest/pug_view/), not PUG-REST (/rest/pug/). PUG-REST has computed properties; PUG-View has curated content organized by headings.
Properties like Volume3D, FeatureCount3D, and steric quadrupoles require computed 3D conformers. Not all compounds have them. These return null/missing if unavailable.
The /assaysummary endpoint for well-studied compounds (like aspirin with 4,900+ assay results) returns large payloads. Process in batches or use column filtering.
Some property names in the URL do not match the keys in the JSON response. Most notably: requesting CanonicalSMILES returns a key called ConnectivitySMILES, and requesting IsomericSMILES returns a key called SMILES. Always check the actual keys in the response rather than assuming they match the request parameter.
The /description endpoint returns text from external databases (ChEBI, OEHHA, EPA, etc.). Well-known drugs may have only 1 or 2 sources. Do not assume a minimum number of descriptions will be available.
Compound names are in the URL path, not the query string. In URL paths, + is a literal plus sign, not a space. bisphenol+A fails with "No CID found"; bisphenol%20A succeeds. Always use urllib.parse.quote(name) which produces %20 for spaces.
Safety/GHS data is typically 3-4 levels deep: "Safety and Hazards" > "Hazards Identification" > "GHS Classification" > specific items. Use recursive traversal when searching for specific headings, not fixed-depth indexing.
| Error | Cause | Fix |
|---|---|---|
| PUGREST.NotFound | Compound name not recognized | Try CAS number first; then base compound name without salt form; then SMILES. Many common names (ascorbic acid, vitamin C, lead acetate) don't resolve but their CAS numbers do |
| PUGREST.BadRequest | Malformed query or invalid identifier | Check URL encoding, especially for SMILES. Verify CID is a positive integer |
| PUGREST.ServerBusy | Rate limit exceeded | Add time.sleep(0.25) between calls. Reduce batch sizes |
| Empty PropertyTable | Formula/similarity search returned no matches | Broaden search (lower similarity threshold, check formula) |
| SMILES lookup fails silently | Characters not URL-encoded | Always use urllib.parse.quote() on SMILES strings |
| Missing 3D properties | No 3D conformer computed | Not all compounds have 3D data. Use 2D properties as fallback |
| PUG-View returns nested sections | Normal; data is hierarchically organized | Navigate Section > Section > Information structure |
| CanonicalSMILES key not in response | Response uses different key name | Request CanonicalSMILES returns as ConnectivitySMILES; IsomericSMILES returns as SMILES |
| "No CID found" for compound with space in name | + in URL path is literal, not a space | Use urllib.parse.quote(name) to get %20 encoding, not + |