Skill for working with the ML extraction pipeline — spaCy NER model training, regex-based attribute extraction, production extractor that combines both methods, resolver logic for cross-validation between ML and regex, and model evaluation. Use this skill whenever the user mentions NER, spaCy, extraction accuracy, mileage/fuel/power/year extraction, labeling data, training or retraining the model, resolvers, ProductionExtractor, error analysis, F1 score, training reports, or anything in the ml/ directory. Also trigger for labeling/ directory work (label_data_assisted.py, training data preparation).
Extract structured vehicle attributes (mileage, year of manufacture, fuel type, power) from free-text Czech car listing descriptions on bazos.cz. The pipeline uses a dual extraction approach — spaCy NER model + regex patterns — with cross-validation resolvers that compare both outputs and pick the best result with confidence scoring.
ml/
├── production_extractor.py # ProductionExtractor — main entry point, combines NER + regex
├── extractor.py # Core extraction logic
├── resolvers/ # Per-attribute resolution (mileage, fuel, power, year)
├── context_aware_patterns.py # Advanced regex patterns with contextual awareness
├── training/ # Model training scripts
├── error_analysis/ # Extraction quality analysis tools
└── clean_all_duplicates.py # Deduplication utility
labeling/
├── label_data_assisted.py # Semi-automated labeling (manual + auto-assisted)
├── export, filter, validate # Supporting labeling scripts
└── scrape scripts # Scraping raw data for labeling
ml_models/ # Saved model artifacts (DO NOT modify without approval)
car_ner_model/ # Active production NER model (DO NOT modify without approval)
training_reports/ # Post-training evaluation reports (F1 scores, metrics)
ProductionExtractor runs both methods on every listing description:
context_aware_patterns.py and per-attribute patternsFor each attribute (mileage, year, fuel, power), a dedicated resolver:
The resolvers are the critical quality layer — they catch cases where one method extracts correctly and the other fails.
Extracted fuel values are normalized to one of 6 types:
diesel, benzín, lpg, elektro, cng, hybrid
labeling/ scrape scripts)label_data_assisted.py for semi-automated labeling — combines manual review
with auto-suggestions to speed up the processlabeling/ scriptsml/training/training_reports/ml_models/car_ner_model/ (production)car_ner_model/ or ml_models/ without explicit user approval.
This includes retraining, overwriting, or deleting model files.ml/error_analysis/)label_data_assisted.pycontext_aware_patterns.pyProductionExtractor on it in isolation