Name: ML Pipeline Skill
Author: Filip057

ML Pipeline Skill

Skill for working with the ML extraction pipeline — spaCy NER model training, regex-based attribute extraction, production extractor that combines both methods, resolver logic for cross-validation between ML and regex, and model evaluation. Use this skill whenever the user mentions NER, spaCy, extraction accuracy, mileage/fuel/power/year extraction, labeling data, training or retraining the model, resolvers, ProductionExtractor, error analysis, F1 score, training reports, or anything in the ml/ directory. Also trigger for labeling/ directory work (label_data_assisted.py, training data preparation).

Filip0570 Sterne08.04.2026

Beruf
Kategorien: Machine Learning

Purpose

Extract structured vehicle attributes (mileage, year of manufacture, fuel type, power) from free-text Czech car listing descriptions on bazos.cz. The pipeline uses a dual extraction approach — spaCy NER model + regex patterns — with cross-validation resolvers that compare both outputs and pick the best result with confidence scoring.

Architecture

ml/
├── production_extractor.py    # ProductionExtractor — main entry point, combines NER + regex
├── extractor.py               # Core extraction logic
├── resolvers/                 # Per-attribute resolution (mileage, fuel, power, year)
├── context_aware_patterns.py  # Advanced regex patterns with contextual awareness
├── training/                  # Model training scripts
├── error_analysis/            # Extraction quality analysis tools
└── clean_all_duplicates.py    # Deduplication utility

labeling/
├── label_data_assisted.py     # Semi-automated labeling (manual + auto-assisted)
├── export, filter, validate   # Supporting labeling scripts
└── scrape scripts             # Scraping raw data for labeling

ml_models/                     # Saved model artifacts (DO NOT modify without approval)
car_ner_model/                 # Active production NER model (DO NOT modify without approval)
training_reports/              # Post-training evaluation reports (F1 scores, metrics)

ML Pipeline Skill

Filip0570 Sterne08.04.2026

Beruf
Kategorien: Machine Learning

Purpose

Architecture

ml/ ├── production_extractor.py # ProductionExtractor — main entry point, combines NER + regex ├── extractor.py # Core extraction logic ├── resolvers/ # Per-attribute resolution (mileage, fuel, power, year) ├── context_aware_patterns.py # Advanced regex patterns with contextual awareness ├── training/ # Model training scripts ├── error_analysis/ # Extraction quality analysis tools └── clean_all_duplicates.py # Deduplication utility labeling/ ├── label_data_assisted.py # Semi-automated labeling (manual + auto-assisted) ├── export, filter, validate # Supporting labeling scripts └── scrape scripts # Scraping raw data for labeling ml_models/ # Saved model artifacts (DO NOT modify without approval) car_ner_model/ # Active production NER model (DO NOT modify without approval) training_reports/ # Post-training evaluation reports (F1 scores, metrics)

ML Pipeline Skill

Purpose

Architecture

ML Pipeline Skill

Purpose

Architecture

How Extraction Works

Dual Extraction Strategy

Resolver Cross-Validation

Fuel Type Normalization

Model Training Workflow

Preparing Training Data

Training & Evaluation

Promoting a New Model

Critical Rules

Patterns to Follow

Improving Extraction Accuracy

Adding a New Regex Pattern

Debugging an Extraction Error

Common Pitfalls

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns