Map media ingredient names to ontology terms (CHEBI/FOODON/ENVO) using exact, normalized, and fuzzy matching strategies
Purpose: Map microbial growth media ingredient names to authoritative ontology terms for semantic integration and knowledge graph construction.
Why: Ensures consistency across datasets, enables semantic queries, links to external knowledge bases (KG-Microbe, CHEBI, FOODON, ENVO), and supports cross-study analysis.
Scope: Chemical compounds (salts, organics), biological materials (extracts, peptones), environmental samples (soil, seawater).
Use this skill when:
| Ingredient Type | Primary Ontology | Examples | When to Use |
|---|---|---|---|
| Simple chemicals | CHEBI | NaCl, glucose, MgSO4•7H2O, K2HPO4 | Pure chemical compounds, salts, ions |
| Biological materials | FOODON | yeast extract, peptone, tryptone, beef extract | Biological preparations, food-derived |
| Environmental samples | ENVO | soil extract, seawater, sediment | Natural environmental materials |
| Complex mixtures | May be unmappable | "Vitamin solution A", "Trace metals" | Often too generic for specific mapping |
Ontology priority order: CHEBI → FOODON → ENVO
Try the most specific ontology first. If no match, try broader ontologies.
Many ingredient names require normalization before ontology matching. The chemical_normalizer.py utility handles common patterns:
Pattern: Remove hydration notation (•, ·, .) and water count
MgSO4•7H2O → MgSO4 → search as "magnesium sulfate"CaCl2·2H2O → CaCl2 → search as "calcium chloride"FeSO4.7H2O → FeSO4 → search as "ferrous sulfate"Na2HPO4 dihydrate → Na2HPO4 → search as "disodium phosphate"Synonym preservation: Original hydrate form is saved as synonym with type HYDRATE_FORM
Pattern: Add missing atoms to common incomplete formulas
K2HPO → K2HPO4 → "dipotassium phosphate"Na2HPO → Na2HPO4 → "disodium phosphate"NaHPO → NaH2PO4 → "sodium dihydrogen phosphate"Synonym preservation: Incomplete form saved as INCOMPLETE_FORMULA synonym
Pattern: Strip supplier/catalog information
NaCl (Fisher S271-500) → NaCl → "sodium chloride"Glucose (Sigma G7021) → Glucose → "glucose"Agar (BD 214010) → Agar → "agar"Synonym preservation: Catalog variant saved as CATALOG_VARIANT synonym
Pattern: Expand common abbreviations to full names
dH2O → distilled water → "water"NaOAc → sodium acetateKOAc → potassium acetateBuilt-in mappings for common chemical formulas:
FORMULA_TO_NAME = {
'NaCl': 'sodium chloride',
'MgSO4': 'magnesium sulfate',
'CaCl2': 'calcium chloride',
'KCl': 'potassium chloride',
'K2HPO4': 'dipotassium phosphate',
# ... 40+ common chemicals
}
Direct string match with ontology labels (case-insensitive).
Using OAK (Ontology Access Kit):
from oaklib import get_adapter
# Search CHEBI
adapter = get_adapter("sqlite:obo:chebi")
results = adapter.basic_search("sodium chloride")
for entity_id in results:
label = adapter.label(entity_id)
print(f"{entity_id}: {label}")
# Output: CHEBI:26710: sodium chloride
Using OntologyClient:
from mediaingredientmech.utils.ontology_client import OntologyClient
client = OntologyClient()
results = client.search("glucose", sources=["CHEBI"])
for candidate in results:
print(f"{candidate.ontology_id}: {candidate.label} (score: {candidate.score})")
Apply chemical normalization, then search with all variants.
Using chemical_normalizer:
from mediaingredientmech.utils.chemical_normalizer import (
normalize_chemical_name,
generate_search_variants
)
# Normalize
result = normalize_chemical_name("MgSO4•7H2O")
print(f"Original: {result.original}")
print(f"Normalized: {result.normalized}")
print(f"Variants: {result.variants}")
# Variants: ['magnesium sulfate', 'MgSO4', 'magnesium sulphate']
# Search with all variants
client = OntologyClient()
matches = client.search_with_variants(result.variants, sources=["CHEBI"])
Multi-variant search (deduplicates results, keeps best scores):
# Generate comprehensive search variants
variants = generate_search_variants("MgSO4•7H2O")
# Returns: ['MgSO4•7H2O', 'MgSO4', 'magnesium sulfate', 'magnesium sulphate']
# Search with all variants, deduplicate results
results = client.search_with_variants(variants, sources=["CHEBI"], max_results=10)
Use fuzzy/lexical search for synonym matching and spelling variations.
Using OAK fuzzy search:
# Lexical search (prefix: l~)
uv run runoak -i sqlite:obo:chebi info "l~magnesium sulphate"
# Search with synonyms
uv run runoak -i sqlite:obo:chebi search "magnesium sulfate"
Using EBI OLS API (web-based, requires internet):
# Basic search
curl "https://www.ebi.ac.uk/ols4/api/search?q=magnesium%20sulfate&ontology=chebi"
# Python wrapper
import requests
def ols_search(query, ontology="chebi"):
url = "https://www.ebi.ac.uk/ols4/api/search"
params = {"q": query, "ontology": ontology}
response = requests.get(url, params=params)
return response.json()
results = ols_search("magnesium sulfate", "chebi")
for doc in results['response']['docs']:
print(f"{doc['obo_id']}: {doc['label']}")
When to use OLS vs OAK:
When automated matching fails, manual expert curation is required.
Use manual curation when:
MediaIngredientMech integrates with CultureMech (primary integration point) for KG-Microbe knowledge graph construction.
CultureMech → MediaIngredientMech (curate) → Export back to CultureMech
# Import ingredients from CultureMech
python scripts/import_from_culturemech.py \
--culturemech-dir /path/to/CultureMech/data \
--output data/ingredients
# Export curated mappings back
python scripts/export_to_culturemech.py \
--input data/ingredients/mapped \
--output /path/to/CultureMech/data/ingredients
Before creating new mappings, check if ingredient already exists in KG-Microbe:
# Search for existing ingredient entities
# Pattern: mediadive.ingredient:*
# Check media_dive_ingredients in CultureMech data
# If found, use existing ID instead of creating duplicate
Ingredients in KG-Microbe are linked via:
For one-off ingredient mapping with manual review:
# Interactive curation with GUI
python scripts/curate_unmapped.py
# Shows:
# - Original name and normalized variants
# - Top candidates from each ontology
# - Context (how many media use this ingredient)
# - Accept, skip, or mark for expert review
Process:
For auto-curable simple chemicals with high confidence:
# Analyze unmapped ingredients first
python scripts/analyze_unmapped.py
# Batch curate simple chemicals
python scripts/batch_curate_unmapped.py \
--category SIMPLE_CHEMICAL \
--auto-normalize \
--min-confidence 0.9 \
--dry-run
# Review dry-run results, then apply
python scripts/batch_curate_unmapped.py \
--category SIMPLE_CHEMICAL \
--auto-normalize \
--min-confidence 0.9
Best for:
Features:
For complex ingredients requiring LLM reasoning (zero-cost, no API):
Step 1: Prepare batch
# Prepare ingredients for Claude Code review
python scripts/prepare_for_claude_curation.py \
--category UNKNOWN \
--limit 20 \
--output notes/batch_001.md
This creates a markdown file with:
Step 2: Ask Claude Code
Open notes/batch_001.md and ask Claude Code:
"Please analyze these ingredients and suggest ontology mappings"
Claude Code will:
Step 3: Apply suggestions
# Apply Claude Code suggestions
python scripts/apply_claude_suggestions.py \
--suggestions notes/batch_001_suggestions.yaml \
--validate
# Validation checks:
# - Term exists in specified ontology
# - Confidence score is reasonable
# - Quality level is appropriate
Benefits:
See: docs/CLAUDE_CODE_CURATION.md for detailed guide
Input: MgSO4•7H2O
Process:
MgSO4MgSO4 → magnesium sulfate['MgSO4', 'magnesium sulfate', 'magnesium sulphate']CHEBI:32599 (magnesium sulfate)EXACT_MATCHMgSO4•7H2O as synonym type HYDRATE_FORMResult:
ontology_id: CHEBI:32599
ontology_label: magnesium sulfate
ontology_source: CHEBI