Build the Metaforge lexicon database from raw linguistic sources. This is a rare operation — use it when creating the database from scratch, reproducing the build for a licensing audit, or verifying data provenance. Use this skill when the user mentions creating the lexicon database, missing or searching for PRE_ENRICH.sql, building the base database, importing from scratch, or needs to understand where the raw data comes from.
Builds lexicon_v2.db from raw linguistic data sources. This is a one-time
operation — once built, updates and enrichments are managed via the
metaforge-pipeline-management skill.
Optionally outputs a snapshot as PRE_ENRICH.sql. Use this full-text dump of lexicon_v2.db to restore the database to its post-import state before any Metaforge-specific modications.
See also: data-pipeline/CLAUDE.md for architecture overview and key concepts.
| Source | Description | Expected Path | Download URL | Licence |
|---|---|---|---|---|
| OEWN (via sqlunet) | Synsets, lemmas, relations | data-pipeline/raw/sqlunet_master.db | TODO | TODO |
| Brysbaert GPT Familiarity | Word familiarity ratings | data-pipeline/input/multilex-en/*.xlsx | TODO | TODO |
| SUBTLEX-UK | Subtitle word frequencies |
data-pipeline/input/subtlex-uk/*.xlsx| TODO |
| TODO |
| SyntagNet | Collocation pairs | (bundled in sqlunet) | TODO | TODO |
| VerbNet | Verb classes, roles, examples | (bundled in sqlunet) | TODO | TODO |
| FastText (wiki-news-300d) | Word embeddings (300d) | ~/.local/share/metaforge/wiki-news-300d-1M.vec | TODO | TODO |
Large files live in
~/.local/share/metaforge/, NOT in the repo. Worktrees symlink into the shared location:data-pipeline/raw/wiki-news-300d-1M.vec
python3 -m venv .venv && source .venv/bin/activate
pip install -r data-pipeline/requirements.txt
data-pipeline/raw/wiki-news-300d-1M.vec → ~/.local/share/metaforge/wiki-news-300d-1M.vecThe entire build is handled by data-pipeline/import_raw.sh:
source .venv/bin/activate
# Build only
./data-pipeline/import_raw.sh
# Build and dump PRE_ENRICH.sql baseline
./data-pipeline/import_raw.sh --dump
The script performs these steps in order:
data-pipeline/SCHEMA.sqlimport_oewn.py)import_syntagnet.py)import_verbnet.py)import_familiarity.py)import_subtlex.py)build_vocab.py)build_antonyms.py)--dump) Export as PRE_ENRICH.sqlThe script prints row counts automatically. Expected values (approximate):
| Table | Expected Count |
|---|---|
| synsets | ~120,000 |
| lemmas | ~160,000 |
| relations | ~80,000 |
| frequencies | ~60,000 |
| syntagms | ~35,000 |
| vn_classes | ~400 |
| property_vocab_curated | 35,000 |
| property_antonyms | ~576 |
| enrichment | 0 |
| property_vocabulary | 0 |
| synset_properties | 0 |
| File | Purpose |
|---|---|
data-pipeline/SCHEMA.sql | Canonical DDL — all CREATE TABLE + CREATE INDEX statements |
data-pipeline/import_raw.sh | Build orchestrator |
data-pipeline/output/PRE_ENRICH.sql | Committed baseline dump (base data + empty enrichment schema) |
data-pipeline/scripts/utils.py | Shared constants including hardcoded paths for raw data files |
data-pipeline/scripts/utils.py — they do not take CLI arguments for input files. Check each script's source before running in case paths have changed.data-pipeline/SCHEMA.sql, data-pipeline/CLAUDE.md, and this skill before committing.