Build unified multi-level category taxonomy from hierarchical product category paths from any e-commerce companies using embedding-based recursive clustering with intelligent category naming via weighted word frequency analysis.
Create a unified multi-level taxonomy from hierarchical category paths by clustering similar paths and automatically generating meaningful category names.
Given category paths from multiple sources (e.g., "electronics -> computers -> laptops"), create a unified taxonomy that groups similar paths across sources, generates meaningful category names, and produces a clean N-level hierarchy (typically 5 levels). The unified category taxonomy could be used to do analysis or metric tracking on products from different platform.
DataFrame with added columns:
unified_level_1: Top-level category (e.g., "electronic | device")unified_level_2: Second-level category (e.g., "computer | laptop")unified_level_3 through unified_level_N: Deeper levelsCategory names use | separator, max 5 words, covering 70%+ of records in each cluster.
pip install pandas numpy scipy sentence-transformers nltk tqdm
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"
step1_preprocessing_and_merge.py)category_path column>, depth filtering, prefix removal, then merge all sources. source_level should reflect the processed version of the source level namecategory_path, source, depth, source_level_1 through source_level_Nstep2_weighted_embedding_generation.py)step3_recursive_clustering_naming.py)step4_result_assignments.py)unified_taxonomy_full.csv - all records with unified categoriesunified_taxonomy_hierarchy.csv - unique taxonomy structureUse scripts/pipeline.py to run the complete 4-step workflow.
See scripts/pipeline.py for: