Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics.
PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.
This skill should be used when:
Search for compounds using multiple identifier types:
By Chemical Name:
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
By CID (Compound ID):
compound = pcp.Compound.from_cid(2244) # Aspirin
By SMILES:
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
By InChI:
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
By Molecular Formula:
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds matching this formula
Retrieve molecular properties for compounds using either high-level or low-level approaches:
Using PubChemPy (Recommended):
import pubchempy as pcp
# Get compound object with all properties
compound = pcp.get_compounds('caffeine', 'name')[0]
# Access individual properties
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # Partition coefficient
tpsa = compound.tpsa # Topological polar surface area
Get Specific Properties:
# Request only specific properties
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# Returns list of dictionaries
Batch Property Retrieval:
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
Available Properties: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see references/api_reference.md for complete list).
Find structurally similar compounds using Tanimoto similarity:
import pubchempy as pcp
# Start with a query compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # Similarity threshold (0-100)
MaxRecords=50
)
# Process results
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
Note: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
Find compounds containing a specific structural motif:
import pubchempy as pcp
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
Common Substructures:
c1ccccc1c1ccncc1c1ccc(O)cc1C(=O)OConvert between different chemical structure formats:
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# Convert to different formats
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# Download structure files
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
Generate 2D structure images:
import pubchempy as pcp
# Download compound structure as PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# Using direct URL (via requests)
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
Get all known names and synonyms for a compound:
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # First 10
print(f" - {syn}")
Retrieve biological activity data from assays:
import requests
import json
# Get bioassay summary for a compound
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# Process bioassay information
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
For more complex bioactivity queries, use the scripts/bioactivity_query.py helper script which provides:
Access detailed compound information through PUG-View:
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# Contains extensive data including:
# - Chemical and Physical Properties
# - Drug and Medication Information
# - Pharmacology and Biochemistry
# - Safety and Hazards
# - Toxicity
# - Literature references
# - Patents
Get Specific Section:
# Get only drug information
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
Install PubChemPy for Python-based access:
uv pip install pubchempy
For direct API access and bioactivity queries:
uv pip install requests
Optional for data analysis:
uv pip install pandas
This skill includes Python scripts for common PubChem tasks:
Provides utility functions for searching and retrieving compound information:
Key Functions:
search_by_name(name, max_results=10): Search compounds by namesearch_by_smiles(smiles): Search by SMILES stringget_compound_by_cid(cid): Retrieve compound by CIDget_compound_properties(identifier, namespace, properties): Get specific propertiessimilarity_search(smiles, threshold, max_records): Perform similarity searchsubstructure_search(smiles, max_records): Perform substructure searchget_synonyms(identifier, namespace): Get all synonymsbatch_search(identifiers, namespace, properties): Batch search multiple compoundsdownload_structure(identifier, namespace, format, filename): Download structuresprint_compound_info(compound): Print formatted compound informationUsage:
from scripts.compound_search import search_by_name, get_compound_properties
# Search for a compound
compounds = search_by_name('ibuprofen')
# Get specific properties
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
Provides functions for retrieving biological activity data:
Key Functions:
get_bioassay_summary(cid): Get bioassay summary for compoundget_compound_bioactivities(cid, activity_outcome): Get filtered bioactivitiesget_assay_description(aid): Get detailed assay informationget_assay_targets(aid): Get biological targets for assaysearch_assays_by_target(target_name, max_results): Find assays by targetget_active_compounds_in_assay(aid, max_results): Get active compoundsget_compound_annotations(cid, section): Get PUG-View annotationssummarize_bioactivities(cid): Generate bioactivity summary statisticsfind_compounds_by_bioactivity(target, threshold, max_compounds): Find compounds by targetUsage:
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# Get bioactivity summary
summary = summarize_bioactivities(2244) # Aspirin
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
Rate Limits:
Best Practices:
Error Handling:
from pubchempy import BadRequestError, NotFoundError, TimeoutError