Guide for adding new retailer scrapers following project patterns
Structured guide for adding a new retailer to the scraper system.
Use Kapture to discover hidden APIs by inspecting network traffic on the retailer's store locator.
Navigate to store locator
mcp__kapture__navigate → {retailer}.com/stores or /store-locator
Open Network tab and interact
Inspect network requests
mcp__kapture__console_logs → Check for API calls
mcp__kapture__dom → Inspect page structure for embedded JSON
Look for these patterns:
liveapi.yext.com (used by Cricket, many retailers)/api/stores, /api/locations, redsky.target.com/sitemap.xml, /sitemap-stores.xml, /store-sitemap.xml.gz<script type="application/ld+json"> or window.__INITIAL_STATE__Document findings:
| Provider | URL Pattern | Used By |
|---|---|---|
| Yext | liveapi.yext.com/v2/accounts/*/entities | Cricket, many retail chains |
| Uberall | uberall.com/api/storefinders/*/locations | Telus |
| Locally | locally.com/stores/conversion_data | Various |
| Google Places | Embedded maps with place IDs | Some retailers |
1. Navigate to cricketwireless.com/stores
2. Search for "10001"
3. Network tab shows: liveapi.yext.com/v2/accounts/me/search/vertical/query?...
4. Response contains store data with all fields needed
5. API key visible in URL parameters
After research, you should have:
Create src/scrapers/{retailer}.py implementing the standard interface:
"""
{Retailer} store scraper.
Discovery method: {sitemap|api|crawl}
Data source: {URL or API endpoint}
"""
import logging
from src.shared.utils import (
make_request_with_retry,
random_delay,
validate_store_data,
get_delay_for_mode
)
from src.shared.cache import URLCache
logger = logging.getLogger(__name__)
def run(session, retailer_config, retailer: str, **kwargs) -> dict:
"""
Main entry point for {retailer} scraper.
Args:
session: requests.Session with configured headers
retailer_config: Config from retailers.yaml
retailer: Retailer name string
**kwargs: Additional args (limit, test_mode, proxy_mode, etc.)
Returns:
dict with keys:
- stores: list of store dictionaries
- count: number of stores scraped
- checkpoints_used: bool indicating if resumed from checkpoint
"""
stores = []
checkpoints_used = False
# 1. Discover store URLs (from sitemap, API, or crawl)
store_urls = discover_store_urls(session, retailer_config, **kwargs)
# 2. Extract store data from each URL
for url in store_urls:
store_data = extract_store_data(session, url, retailer_config, **kwargs)
if store_data:
# 3. Validate before adding
validation = validate_store_data(store_data)
if validation.is_valid:
stores.append(store_data)
else:
logger.warning(f"Invalid store data: {validation.errors}")
# 4. Respect rate limits
random_delay(retailer_config, kwargs.get('proxy_mode'))
return {
'stores': stores,
'count': len(stores),
'checkpoints_used': checkpoints_used
}
store = {
'store_id': str, # Required: Unique identifier
'name': str, # Required: Store name
'street_address': str, # Required: Street address
'city': str, # Required: City
'state': str, # Required: State/province code
'zip_code': str, # Recommended
'latitude': str, # Recommended: As string
'longitude': str, # Recommended: As string
'phone': str, # Recommended
'url': str, # Recommended: Store page URL
'store_type': str, # Optional: corporate, authorized, etc.
'hours': dict, # Optional: Operating hours
'scraped_at': str, # Auto-added: ISO timestamp
}
Create config/{retailer}_config.py:
"""Configuration for {retailer} scraper."""
# Discovery settings
SITEMAP_URL = "https://example.com/sitemap.xml"
# or
API_URL = "https://api.example.com/stores"
# Store page URL pattern (for validation)
STORE_URL_PATTERN = r"https://example\.com/stores/[\w-]+"
# Fields to extract (maps source field to output field)
FIELD_MAPPING = {
'storeNumber': 'store_id',
'storeName': 'name',
'address1': 'street_address',
# ...
}
# Rate limiting (overrides retailers.yaml if needed)
MIN_DELAY = 2.0
MAX_DELAY = 5.0
Update src/scrapers/__init__.py:
# Add import
from src.scrapers.{retailer} import run as {retailer}_run
# Add to SCRAPERS dict
SCRAPERS = {
# ... existing scrapers
'{retailer}': {retailer}_run,
}
Add to config/retailers.yaml: