Computational chemistry workflow guide for DeepChem, PySCF, RDKit, assay-table normalization, PDBbind-style structure datasets, QSAR and structure benchmarks, DrugBank lookup, ligand-only and structure-aware affinity prediction, ADMET triage, bioactivity prediction, virtual screening, and docking follow-up.
Use this skill when the user asks to:
Do not assume the chemistry stack is available. Check first.
which python3 || true
python3 - <<'PY'
mods = ["deepchem", "pyscf", "rdkit", "numpy", "pandas", "sklearn"]
for name in mods:
try:
__import__(name)
print(f"{name}: ok")
except Exception as exc:
print(f"{name}: missing ({exc})")
PY
If key modules are missing, say so immediately and recommend the unified drug-sandbox image documented in docs/operations/science-runtime.md.
templates/deepchem_featurize.pytemplates/pyscf_single_point.pytemplates/rdkit_descriptors.pytemplates/admet_screen.pytemplates/assay_data_prepare.pytemplates/pdbbind_prepare.pytemplates/binding_affinity_predict.pytemplates/bioactivity_predict.pytemplates/drugbank_lookup.pytemplates/protein_ligand_affinity.pytemplates/protein_ligand_benchmark.pytemplates/qsar_benchmark.pytemplates/virtual_screen.pyUse these templates instead of rewriting the same chemistry scripts from scratch.
.npy, .csv, .joblib, or .json.Use templates/deepchem_featurize.py for:
Quick start:
python3 templates/deepchem_featurize.py \
--smiles "CCO" "c1ccccc1" \
--featurizer circular \
--output-prefix chem/deepchem/demo
CSV input example:
python3 templates/deepchem_featurize.py \
--input ligands.csv \
--smiles-column smiles \
--id-column ligand_id \
--featurizer maccs \
--output-prefix chem/deepchem/ligands
Deliverables:
.npy feature matrix.summary.csv with per-molecule stats.json metadata with featurizer and shapeIf the user asks for actual DeepChem neural models, verify the required backend first. Do not assume TensorFlow or PyTorch models are available just because deepchem imports.
Use templates/rdkit_descriptors.py for:
Quick start:
python3 templates/rdkit_descriptors.py \
--smiles "CCO" "c1ccccc1O" \
--output chem/rdkit/descriptors.csv \
--summary chem/rdkit/summary.json
CSV input example:
python3 templates/rdkit_descriptors.py \
--input ligands.csv \
--smiles-column smiles \
--id-column ligand_id \
--output chem/rdkit/ligands.csv \
--summary chem/rdkit/ligands.json
Deliverables:
Use templates/admet_screen.py for:
Quick start:
python3 templates/admet_screen.py \
--smiles "CCO" "CC(=O)Oc1ccccc1C(=O)O" \
--output chem/admet/screen.csv \
--summary chem/admet/summary.json
CSV input example:
python3 templates/admet_screen.py \
--input ligands.csv \
--smiles-column smiles \
--id-column ligand_id \
--output chem/admet/ligands.csv \
--summary chem/admet/ligands.json
Deliverables:
Treat this as heuristic triage. It is not a clinically validated ADMET predictor.
Use templates/assay_data_prepare.py for:
id, smiles, target style tableExample:
python3 templates/assay_data_prepare.py \
--input chembl_export.csv \
--source chembl \
--task regression \
--convert-nm-to-pactivity \
--output chem/data/chembl_normalized.csv \
--summary chem/data/chembl_normalized.json
BindingDB classification example:
python3 templates/assay_data_prepare.py \
--input bindingdb_hits.tsv \
--source bindingdb \
--task classification \
--activity-threshold 1000 \
--threshold-direction "<=" \
--label-positive binder \
--label-negative non_binder \
--output chem/data/bindingdb_binary.csv \
--summary chem/data/bindingdb_binary.json
Deliverables:
Do not silently mix incompatible assays or units. If the export combines unrelated targets or endpoints, split it first.
Use templates/pdbbind_prepare.py for:
complex_path or receptor_path + ligand_path tablesprotein_ligand_affinity.py or protein_ligand_benchmark.pyExample:
python3 templates/pdbbind_prepare.py \
--root pdbbind/refined-set \
--index pdbbind/index/INDEX_refined_data.2020 \
--metadata pdbbind/pocket_groups.csv \
--output chem/data/pdbbind_normalized.csv \
--summary chem/data/pdbbind_normalized.json
Deliverables:
Use templates/qsar_benchmark.py for:
Example:
python3 templates/qsar_benchmark.py \
--input chem/data/chembl_normalized.csv \
--target-column target \
--task regression \
--split scaffold \
--feature-backend rdkit-morgan \
--algorithm rf \
--include-descriptors \
--metrics-output chem/benchmarks/affinity_metrics.json \
--predictions-output chem/benchmarks/affinity_predictions.csv \
--folds-output chem/benchmarks/affinity_folds.csv \
--model-output chem/models/affinity_from_benchmark.joblib
Deliverables:
Use scaffold split by default when chemical series leakage is a real risk.
Use templates/binding_affinity_predict.py for:
Training example:
python3 templates/binding_affinity_predict.py \
--train affinity_train.csv \
--smiles-column smiles \
--id-column ligand_id \
--target-column affinity \
--feature-backend deepchem-circular \
--algorithm et \
--include-descriptors \
--model-output chem/models/affinity.joblib \
--metrics-output chem/models/affinity_metrics.json
Inference example:
python3 templates/binding_affinity_predict.py \
--model-input chem/models/affinity.joblib \
--predict screening_library.csv \
--smiles-column smiles \
--id-column ligand_id \
--predictions-output chem/predictions/affinity.csv
Deliverables:
.joblib model bundleAssumptions:
Use templates/protein_ligand_affinity.py for:
complex_path or receptor_path + ligand_pathTraining example:
python3 templates/protein_ligand_affinity.py \
--train structure_affinity_train.csv \
--id-column id \
--complex-path-column complex_path \
--smiles-column smiles \
--target-column affinity \
--algorithm rf \
--metrics-output chem/benchmarks/protein_affinity_metrics.json \
--features-output chem/benchmarks/protein_affinity_features.csv \
--model-output chem/models/protein_affinity.joblib
Prediction example on docking outputs:
python3 templates/protein_ligand_affinity.py \
--model-input chem/models/protein_affinity.joblib \
--predict docking/results/analysis/docking_summary.csv \
--id-column ligand_slug \
--complex-path-column complex_path \
--predictions-output chem/predictions/protein_affinity.csv
Deliverables:
Treat this as a structure-aware baseline. It is still limited by complex quality and docking pose quality.
Use templates/protein_ligand_benchmark.py for:
Example:
python3 templates/protein_ligand_benchmark.py \
--input chem/data/pdbbind_normalized.csv \
--split group \
--group-column target_group \
--algorithm rf \
--metrics-output chem/benchmarks/protein_affinity_metrics.json \
--predictions-output chem/benchmarks/protein_affinity_predictions.csv \
--folds-output chem/benchmarks/protein_affinity_folds.csv \
--model-output chem/models/protein_affinity_benchmark.joblib
Deliverables:
Prefer group split when the benchmark should punish target-family leakage instead of only ligand-series leakage.
Use templates/drugbank_lookup.py for:
Example:
python3 templates/drugbank_lookup.py \
--catalog drugbank_export.csv \
--query imatinib \
--output chem/drugbank/imatinib_hits.csv \
--summary chem/drugbank/imatinib_summary.json \
--top-hit-json chem/drugbank/imatinib.json \
--sdf-output chem/drugbank/imatinib.sdf
Deliverables:
Treat this as licensed local-catalog search. Do not imply that DrugBank can be scraped anonymously at runtime.
Online example:
DRUGBANK_API_KEY=... \
python3 templates/drugbank_lookup.py \
--mode online \
--query imatinib \
--summary chem/drugbank/imatinib_online_summary.json \
--top-hit-json chem/drugbank/imatinib_online.json
Use --api-token or DRUGBANK_API_TOKEN when you need the token-based browser-compatible endpoint instead of the default API-key flow.
Use templates/bioactivity_predict.py for:
Classification example:
python3 templates/bioactivity_predict.py \
--train bioactivity_train.csv \
--smiles-column smiles \
--id-column ligand_id \
--target-column active \
--task classification \
--feature-backend rdkit-morgan \
--algorithm rf \
--include-descriptors \
--model-output chem/models/bioactivity.joblib \
--metrics-output chem/models/bioactivity_metrics.json
Prediction example:
python3 templates/bioactivity_predict.py \
--model-input chem/models/bioactivity.joblib \
--predict screening_library.csv \
--smiles-column smiles \
--id-column ligand_id \
--predictions-output chem/predictions/bioactivity.csv
Deliverables:
State the training label definition in the report, for example active, binder, pIC50, or IC50_nM.
Use templates/virtual_screen.py for:
protein_ligand_affinity.pyExample:
python3 templates/virtual_screen.py \
--input screening_library.csv \
--smiles-column smiles \
--id-column ligand_id \
--admet-csv chem/admet/ligands.csv \
--affinity-csv chem/predictions/protein_affinity.csv \
--affinity-model chem/models/affinity.joblib \
--bioactivity-model chem/models/bioactivity.joblib \
--docking-csv docking/results/summary.csv \
--docking-id-column ligand_id \
--docking-score-column best_score \
--output chem/screening/ranked.csv \
--summary chem/screening/summary.json
Deliverables:
Report the weights used for affinity, activity, ADMET, and docking. If only one signal is available, say so instead of presenting the rank as a multi-factor screen.
Use templates/pyscf_single_point.py for:
Quick start:
python3 templates/pyscf_single_point.py \
--atom "O 0 0 0; H 0 0 0.96; H 0.92 0 -0.24" \
--basis sto-3g \
--method rhf \
--output chem/pyscf/water_rhf.json
XYZ input example:
python3 templates/pyscf_single_point.py \
--xyz ligand.xyz \
--basis 6-31g* \
--method rks \
--xc b3lyp \
--output chem/pyscf/ligand_b3lyp.json
Report at minimum:
./chem/.deepchem missing: cannot featurize with the bundled templatepyscf missing: cannot run QM calculationsbio-tools.pharma-db-tools.pharma-ml-tools.docking-tools.