Run data quality validation on scraped store data
Validates scraped store data for quality issues using the project's validation utilities.
| Command | Description |
|---|---|
/validate-output verizon | Validate Verizon store data |
/validate-output all | Validate all retailers |
/validate-output target --strict | Strict mode (treat warnings as errors) |
store_id - Unique store identifiername - Store namestreet_address - Street addresscity - City namestatelatitude / longitude - Geographic coordinatesphone - Contact phone numberurl - Store page URLdata/{retailer}/output/stores_latest.jsonvalidate_stores_batch() from src/shared/utils.py=== Validation Report: verizon ===
Total stores: 1,247
Valid stores: 1,189 (95.3%)
Field Completeness:
store_id: 100.0%
name: 100.0%
street_address: 100.0%
city: 100.0%
state: 100.0%
latitude: 98.2%
longitude: 98.2%
phone: 87.4%
Issues Found:
- ERROR: Store 'VZW-1234' missing latitude/longitude
- WARNING: Store 'VZW-5678' has coordinates (0.0, 0.0)
... (22 more issues)
import json
from pathlib import Path
from src.shared.utils import validate_stores_batch
# Load data
data_path = Path(f"data/{retailer}/output/stores_latest.json")
with open(data_path) as f:
stores = json.load(f)
# Validate
summary = validate_stores_batch(stores, strict=strict_mode, log_issues=True)
# Report completeness
fields = ['store_id', 'name', 'street_address', 'city', 'state', 'latitude', 'longitude', 'phone', 'url']
for field in fields:
present = sum(1 for s in stores if s.get(field))
pct = (present / len(stores)) * 100
print(f" {field}: {pct:.1f}%")
stores_latest.json exists, suggest running the scraper first