技能档案

PubChem PUG-REST API Skill

Name: PubChem PUG-REST API Skill
Author: 1102tools

Query the PubChem PUG-REST API for chemical compound data including molecular properties, structures, synonyms, safety/hazard data, bioassay results, and cross-references. Trigger for any mention of PubChem, CID, chemical compound lookup, molecular formula, molecular weight, SMILES, InChI, InChIKey, chemical structure, compound properties, chemical safety, GHS hazards, bioassay, compound similarity, substructure search, CAS number lookup, XLogP, TPSA, hydrogen bond donors/acceptors, drug-likeness, Lipinski rules, or toxicology data. Also trigger when the user needs to identify a compound by name or structure, compare molecular properties, find structurally similar compounds, look up safety data for a chemical, or cross-reference compound identifiers across databases. This skill serves toxicologists, chemists, reviewers, and anyone evaluating compound safety profiles or chemical data at FDA.

1102tools0 星标2026年4月10日

职业
分类: 计算化学

技能内容

Overview

The PubChem PUG-REST API (https://pubchem.ncbi.nlm.nih.gov/rest/pug/) provides free, no-auth access to PubChem's database of 100M+ chemical compounds, 300M+ substance records, and 1M+ bioassay records. PubChem is the world's largest open-access chemical database, maintained by NCBI at the National Library of Medicine.

Base URL (PUG-REST): https://pubchem.ncbi.nlm.nih.gov/rest/pug/ Base URL (PUG-View): https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/ API Docs: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest Web UI: https://pubchem.ncbi.nlm.nih.gov

No API key required. Rate limit: 5 requests/second, 400 requests/minute. Add between calls.

PubChem PUG-REST API Skill

1102tools0 星标2026年4月10日

职业
分类: 计算化学

技能内容

Overview

No API key required. Rate limit: 5 requests/second, 400 requests/minute. Add between calls.

相关技能

time.sleep(0.25)

import urllib.request, urllib.parse, json, time

PUG = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
PUG_VIEW = "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view"

def pubchem_get(url):
    """GET request with error handling."""
    req = urllib.request.Request(url, headers={"Accept": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=20) as resp:
            return json.loads(resp.read().decode())
    except urllib.error.HTTPError as e:
        body = e.read().decode()
        try:
            fault = json.loads(body).get("Fault", {})
            return {"error": fault.get("Code",""), "message": fault.get("Message","")}
        except:
            return {"error": str(e.code), "message": body[:200]}

def compound_properties(identifier, id_type="name", properties=None):
    """Get computed properties for a compound.
    
    id_type: 'name', 'cid', 'smiles', 'inchikey', 'formula'
    properties: comma-separated property names (see Properties Reference)
    """
    if properties is None:
        properties = "MolecularFormula,MolecularWeight,IUPACName,CanonicalSMILES,InChIKey,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,HeavyAtomCount,Complexity,Charge,ExactMass"
    
    # Formula uses fastformula endpoint
    if id_type == "formula":
        url = f"{PUG}/compound/fastformula/{urllib.parse.quote(str(identifier))}/property/{properties}/JSON"
    else:
        url = f"{PUG}/compound/{id_type}/{urllib.parse.quote(str(identifier))}/property/{properties}/JSON"
    
    data = pubchem_get(url)
    if "error" in data:
        return data
    return data.get("PropertyTable", {}).get("Properties", [])

/rest/pug/{domain}/{namespace}/{identifiers}/{operation}/{output}

Part	Values	Example
domain	`compound`, `substance`, `assay`	`compound`
namespace	`cid`, `name`, `smiles`, `inchikey`, `fastformula`, `fastsimilarity_2d`, `fastsubstructure`	`name`
identifiers	The actual value(s)	`aspirin` or `2244`
operation	`property`, `synonyms`, `sids`, `xrefs`, `description`, `classification`, `assaysummary`, `PNG`, `JSON`	`property/MolecularWeight`
output	`JSON`, `XML`, `CSV`, `TXT`, `SDF`, `PNG`, `SVG`	`JSON`

Namespace	Example	Description
`name`	`/compound/name/aspirin/...`	Search by drug/chemical name
`cid`	`/compound/cid/2244/...`	Direct CID lookup
`smiles`	`/compound/smiles/{encoded}/...`	Search by SMILES string (URL-encode)
`inchikey`	`/compound/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/...`	Search by InChIKey
`fastformula`	`/compound/fastformula/C9H8O4/...`	Search by molecular formula

Property	Description	Example
`MolecularFormula`	Molecular formula	C9H8O4
`MolecularWeight`	Molecular weight (g/mol)	180.16
`IUPACName`	IUPAC systematic name	2-acetyloxybenzoic acid
`CanonicalSMILES`	Canonical SMILES string (response key: `ConnectivitySMILES`)	CC(=O)OC1=CC=CC=C1C(=O)O
`IsomericSMILES`	SMILES with stereochemistry (response key: `SMILES`)
`InChI`	IUPAC International Chemical Identifier
`InChIKey`	Hashed InChI (27-char)	BSYNRYMUTXBXSQ-UHFFFAOYSA-N
`XLogP`	Predicted octanol-water partition coefficient	1.2
`TPSA`	Topological polar surface area (A^2)	63.6
`ExactMass`	Monoisotopic exact mass	180.04226
`HBondDonorCount`	Hydrogen bond donors	1
`HBondAcceptorCount`	Hydrogen bond acceptors	4
`RotatableBondCount`	Rotatable bonds	3
`HeavyAtomCount`	Non-hydrogen atoms	13
`Complexity`	Bertz complexity index	212
`Charge`	Formal charge	0
`Volume3D`	3D molecular volume	136

def compound_synonyms(identifier, id_type="name"):
    url = f"{PUG}/compound/{id_type}/{urllib.parse.quote(str(identifier))}/synonyms/JSON"
    data = pubchem_get(url)
    if "error" in data:
        return data
    info = data.get("InformationList", {}).get("Information", [{}])[0]
    return {"cid": info.get("CID"), "synonyms": info.get("Synonym", [])}

def compound_description(cid):
    url = f"{PUG}/compound/cid/{cid}/description/JSON"
    data = pubchem_get(url)
    if "error" in data:
        return data
    return [{"source": d.get("DescriptionSourceName",""),
             "description": d.get("Description","")}
            for d in data.get("InformationList",{}).get("Information",[]) if d.get("Description")]

def similarity_search(identifier, id_type="cid", threshold=90, max_records=10, properties=None):
    if properties is None:
        properties = "IUPACName,MolecularWeight,MolecularFormula"
    url = (f"{PUG}/compound/fastsimilarity_2d/{id_type}/{urllib.parse.quote(str(identifier))}"
           f"/property/{properties}/JSON?Threshold={threshold}&MaxRecords={max_records}")
    data = pubchem_get(url)
    if "error" in data:
        return data
    return data.get("PropertyTable", {}).get("Properties", [])

def substructure_search(smiles, max_records=10):
    url = (f"{PUG}/compound/fastsubstructure/smiles/{urllib.parse.quote(smiles)}"
           f"/cids/JSON?MaxRecords={max_records}")
    data = pubchem_get(url)
    if "error" in data:
        return data
    return data.get("IdentifierList", {}).get("CID", [])

def formula_search(formula, max_records=10, properties=None):
    if properties is None:
        properties = "IUPACName,MolecularWeight"
    url = f"{PUG}/compound/fastformula/{formula}/property/{properties}/JSON?MaxRecords={max_records}"
    data = pubchem_get(url)
    if "error" in data:
        return data
    return data.get("PropertyTable", {}).get("Properties", [])

def assay_summary(cid):
    url = f"{PUG}/compound/cid/{cid}/assaysummary/JSON"
    data = pubchem_get(url)
    if "error" in data:
        return data
    table = data.get("Table", {})
    columns = table.get("Columns", {}).get("Column", [])
    rows = table.get("Row", [])
    return {"columns": columns, "row_count": len(rows),
            "sample_rows": [r.get("Cell",[]) for r in rows[:5]]}

def structure_image_url(cid, fmt="PNG", size=300):
    return f"{PUG}/compound/cid/{cid}/{fmt}?image_size={size}x{size}"

def pug_view_section(cid, heading):
    url = f"{PUG_VIEW}/data/compound/{cid}/JSON?heading={urllib.parse.quote(heading)}"
    data = pubchem_get(url)
    if "error" in data:
        return data
    return data.get("Record", {}).get("Section", [])

Heading	Content
`Safety and Hazards`	GHS classification, hazard statements, precautions
`Pharmacology and Biochemistry`	Pharmacodynamics, MeSH classification, mechanism
`Toxicity`	LD50, LC50, toxicological data
`Drug and Medication Information`	Therapeutic uses, drug classes
`Chemical and Physical Properties`	Computed and experimental properties

def compound_profile(name):
    """Get a complete profile for a compound by name."""
    props = compound_properties(name, id_type="name",
        properties="MolecularFormula,MolecularWeight,IUPACName,CanonicalSMILES,InChIKey,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,Complexity,Charge,ExactMass")
    if isinstance(props, dict) and "error" in props:
        return props
    if not props:
        return {"error": "No results"}
    
    p = props[0]
    cid = p.get("CID")
    
    time.sleep(0.25)
    syns = compound_synonyms(str(cid), id_type="cid")
    synonym_list = syns.get("synonyms", [])[:10] if not isinstance(syns, dict) or "error" not in syns else []
    
    return {
        "cid": cid,
        "name": p.get("IUPACName"),
        "formula": p.get("MolecularFormula"),
        "molecular_weight": p.get("MolecularWeight"),
        "smiles": p.get("CanonicalSMILES"),
        "inchikey": p.get("InChIKey"),
        "xlogp": p.get("XLogP"),
        "tpsa": p.get("TPSA"),
        "hbd": p.get("HBondDonorCount"),
        "hba": p.get("HBondAcceptorCount"),
        "rotatable_bonds": p.get("RotatableBondCount"),
        "complexity": p.get("Complexity"),
        "synonyms": synonym_list,
        "image_url": structure_image_url(cid) if cid else None
    }

def lipinski_check(name):
    """Check Lipinski's Rule of Five for drug-likeness."""
    props = compound_properties(name, id_type="name",
        properties="MolecularWeight,XLogP,HBondDonorCount,HBondAcceptorCount")
    if isinstance(props, dict) and "error" in props:
        return props
    if not props:
        return {"error": "No results"}
    p = props[0]
    violations = 0
    checks = {
        "MW <= 500": float(p.get("MolecularWeight",0)) <= 500,
        "XLogP <= 5": (p.get("XLogP") or 0) <= 5,
        "HBD <= 5": (p.get("HBondDonorCount") or 0) <= 5,
        "HBA <= 10": (p.get("HBondAcceptorCount") or 0) <= 10,
    }
    violations = sum(1 for v in checks.values() if not v)
    return {"compound": p.get("CID"), "checks": checks,
            "violations": violations, "drug_like": violations <= 1}

def cas_to_compound(cas_number):
    """Look up a compound by CAS registry number."""
    return compound_properties(cas_number, id_type="name",
        properties="IUPACName,MolecularFormula,MolecularWeight,CanonicalSMILES,InChIKey")

def compare_compounds(names):
    """Compare properties across multiple compounds."""
    all_props = []
    for name in names:
        props = compound_properties(name, id_type="name",
            properties="MolecularFormula,MolecularWeight,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount")
        if isinstance(props, list) and props:
            all_props.append({"query": name, **props[0]})
        time.sleep(0.25)
    return all_props

Error	Cause	Fix
PUGREST.NotFound	Compound name not recognized	Try CAS number first; then base compound name without salt form; then SMILES. Many common names (ascorbic acid, vitamin C, lead acetate) don't resolve but their CAS numbers do
PUGREST.BadRequest	Malformed query or invalid identifier	Check URL encoding, especially for SMILES. Verify CID is a positive integer
PUGREST.ServerBusy	Rate limit exceeded	Add time.sleep(0.25) between calls. Reduce batch sizes
Empty PropertyTable	Formula/similarity search returned no matches	Broaden search (lower similarity threshold, check formula)
SMILES lookup fails silently	Characters not URL-encoded	Always use urllib.parse.quote() on SMILES strings
Missing 3D properties	No 3D conformer computed	Not all compounds have 3D data. Use 2D properties as fallback
PUG-View returns nested sections	Normal; data is hierarchically organized	Navigate Section > Section > Information structure
CanonicalSMILES key not in response	Response uses different key name	Request `CanonicalSMILES` returns as `ConnectivitySMILES`; `IsomericSMILES` returns as `SMILES`
"No CID found" for compound with space in name	`+` in URL path is literal, not a space	Use `urllib.parse.quote(name)` to get `%20` encoding, not `+`

xref_type	Returns
`RegistryID`	External registry IDs (CAS numbers, etc.)
`MMDBID`	MMDB structure IDs
`ProteinGI`	Protein GI numbers
`GeneID`	NCBI Gene IDs
`PatentID`	Patent IDs

PubChem PUG-REST API Skill

Overview

PubChem PUG-REST API Skill

Overview

Core Helper Functions

URL Pattern

Endpoints

1. Compound Properties (PRIMARY WORKHORSE)

2. Synonyms

3. Description

4. Similarity Search (2D Fingerprint)

5. Substructure Search

6. Formula Search

7. Bioassay Summary for a Compound

8. Cross-References

9. Structure Images

10. PUG-View: Curated Summary Sections

Common Workflows

Quick Compound Profile

Lipinski Rule of Five Check

CAS Number Lookup

Compare Multiple Compounds

Gotchas and Best Practices

1. SMILES Strings Must Be URL-Encoded (MOST COMMON MISTAKE)

2. Error Responses Use "Fault" Object, Not HTTP Status

3. Rate Limit: 5 req/sec, 400 req/min

4. CAS Numbers Work Through the Name Namespace

5. Name Lookups May Return Multiple CIDs

6. Many Common Chemical Names Do NOT Resolve (CRITICAL)

7. Formula Search Returns All Isomers

8. PUG-View is Separate from PUG-REST

9. 3D Properties Require 3D Conformers

10. Bioassay Summary Can Be Very Large

11. Response Property Keys Differ from Request Names

12. Description Sources Vary Widely by Compound

13. Spaces in Compound Names Must Use %20, Not +

14. PUG-View Sections Are Deeply Nested

Troubleshooting

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope