Use when the task involves PubChem, chemical compound lookup, SMILES, InChI, molecular properties, similarity search, substructure search, bioactivity data, chemical format conversion, or structure image generation. Also use when the user wants to search for compounds by name or identifier, retrieve PubChem data programmatically, or analyze chemistry information from the PubChem database.
PubChem is the world's largest freely available chemical database maintained by the National Center for Biotechnology Information (NCBI). It contains over 110 million unique chemical structures and over 270 million bioactivities from more than 770 data sources. This skill provides guidance for programmatically accessing PubChem data using the PUG-REST API and PubChemPy Python library.
Search for compounds using multiple identifier types:
By Chemical Name:
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
By CID (Compound ID):
compound = pcp.Compound.from_cid(2244) # Aspirin
By SMILES:
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
By InChI:
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
By Molecular Formula:
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds matching this formula
Retrieve molecular properties for compounds using either high-level or low-level approaches:
Using PubChemPy (Recommended):
import pubchempy as pcp
# Get compound object with all properties
compound = pcp.get_compounds('caffeine', 'name')[0]
# Access individual properties
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # Partition coefficient
tpsa = compound.tpsa # Topological polar surface area
Get Specific Properties:
# Request only specific properties
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# Returns list of dictionaries
Batch Property Retrieval:
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
Available Properties: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see references/api_reference.md for complete list).
Find structurally similar compounds using Tanimoto similarity:
import pubchempy as pcp
# Start with a query compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # Similarity threshold (0-100)
MaxRecords=50
)
# Process results
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
Note: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
Find compounds containing a specific structural motif:
import pubchempy as pcp
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
Common Substructures:
c1ccccc1c1ccncc1c1ccc(O)cc1C(=O)OConvert between different chemical structure formats:
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# Convert to different formats
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# Download structure files
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
Generate 2D structure images:
import pubchempy as pcp
# Download compound structure as PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# Using direct URL (via requests)
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
Get all known names and synonyms for a compound:
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # First 10
print(f" - {syn}")
Retrieve biological activity data from assays:
import requests
import json
# Get bioassay summary for a compound
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# Process bioassay information
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
For more complex bioactivity queries, use the scripts/bioactivity_query.py helper script which provides:
Access detailed compound information through PUG-View:
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# Contains extensive data including:
# - Chemical and Physical Properties
# - Drug and Medication Information
# - Pharmacology and Biochemistry
# - Safety and Hazards
# - Toxicity
# - Literature references
# - Patents
Get Specific Section:
# Get only drug information
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
Install PubChemPy for Python-based access:
pip install pubchempy
For direct API access and bioactivity queries:
pip install requests
Optional for data analysis:
pip install pandas
This skill includes Python scripts for common PubChem tasks:
Provides utility functions for searching and retrieving compound information:
Key Functions:
search_by_name(name, max_results=10): Search compounds by namesearch_by_smiles(smiles): Search by SMILES stringget_compound_by_cid(cid): Retrieve compound by CIDget_compound_properties(identifier, namespace, properties): Get specific propertiessimilarity_search(smiles, threshold, max_records): Perform similarity searchsubstructure_search(smiles, max_records): Perform substructure searchget_synonyms(identifier, namespace): Get all synonymsbatch_search(identifiers, namespace, properties): Batch search multiple compoundsdownload_structure(identifier, namespace, format, filename): Download structuresprint_compound_info(compound): Print formatted compound informationUsage:
from scripts.compound_search import search_by_name, get_compound_properties
# Search for a compound
compounds = search_by_name('ibuprofen')
# Get specific properties
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
Provides functions for retrieving biological activity data:
Key Functions:
get_bioassay_summary(cid): Get bioassay summary for compoundget_compound_bioactivities(cid, activity_outcome): Get filtered bioactivitiesget_assay_description(aid): Get detailed assay informationget_assay_targets(aid): Get biological targets for assaysearch_assays_by_target(target_name, max_results): Find assays by targetget_active_compounds_in_assay(aid, max_results): Get active compoundsget_compound_annotations(cid, section): Get PUG-View annotationssummarize_bioactivities(cid): Generate bioactivity summary statisticsfind_compounds_by_bioactivity(target, threshold, max_compounds): Find compounds by targetUsage:
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# Get bioactivity summary
summary = summarize_bioactivities(2244) # Aspirin
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
Rate Limits:
Best Practices:
Error Handling:
from pubchempy import BadRequestError, NotFoundError, TimeoutError