Generate a new retailer scraper skeleton with config and test fixtures
Generates a working scraper skeleton for a new retailer. Creates 3 files and registers the scraper.
| Command | Description |
|---|---|
/scaffold-scraper walgreens sitemap | Sitemap-based scraper |
/scaffold-scraper nordstrom api | Paginated JSON API scraper |
/scaffold-scraper costco graphql | GraphQL API scraper |
/scaffold-scraper gap html | Multi-phase HTML crawl scraper |
/scaffold-scraper autozone locator | Geo-radius store locator API |
sitemapapigraphqlhtmllocatorsrc/scrapers/{retailer}.pyScraper module with run() function following the project contract:
def run(session, retailer_config, retailer: str, **kwargs) -> dict:
# Returns: {'stores': [...], 'count': int, 'checkpoints_used': bool}
Each type includes:
src/shared/| Type | Base Pattern | Example Scraper |
|---|---|---|
sitemap | XML/gzipped sitemap parsing | att.py, bestbuy.py |
api | Paginated JSON API | cricket.py (Yext), telus.py (Uberall) |
graphql | GraphQL queries | homedepot.py |
html | Multi-phase HTML crawl | verizon.py |
locator | Geo-radius store locator | staples.py |
config/{retailer}_config.pyConfig skeleton with:
tests/test_scrapers/fixtures/{retailer}/Empty fixture directory. Drop sample responses here:
sitemap.xml or api_response.json or graphql_response.jsonstore_page.html or store_detail.jsonexpected_stores.jsonAdd entry to SCRAPER_REGISTRY in src/scrapers/__init__.py.
Add retailer block to config/retailers.yaml with:
enabled: truesrc/scrapers/src/scrapers/{retailer}.py from type templateconfig/{retailer}_config.py with placeholder valuestests/test_scrapers/fixtures/{retailer}/ directorySCRAPER_REGISTRY in src/scrapers/__init__.pyconfig/retailers.yamlpylint src/scrapers/{retailer}.py to verify syntax"""
{Retailer} store scraper.
Discovery: XML sitemap
Example reference: See att.py for a working sitemap scraper.
"""
import logging
from src.shared.utils import (
make_request_with_retry,
random_delay,
validate_store_data,
get_delay_for_mode,
)
logger = logging.getLogger(__name__)
def run(session, retailer_config, retailer: str, **kwargs) -> dict:
"""Main entry point for {retailer} scraper."""
stores = []
checkpoints_used = False
limit = kwargs.get('limit')
# TODO: Implement sitemap discovery
# See att.py:discover_store_urls() for XML sitemap parsing
store_urls = discover_store_urls(session, retailer_config, **kwargs)
for i, url in enumerate(store_urls):
if limit and i >= limit:
break
store_data = extract_store_data(session, url, retailer_config, **kwargs)
if store_data:
validation = validate_store_data(store_data)
if validation.is_valid:
stores.append(store_data)
else:
logger.warning("Invalid store data from %s: %s", url, validation.errors)
random_delay(retailer_config, kwargs.get('proxy_mode'))
return {'stores': stores, 'count': len(stores), 'checkpoints_used': checkpoints_used}
def discover_store_urls(session, retailer_config, **kwargs):
"""Parse sitemap XML to find store page URLs.
TODO: Implement sitemap fetching and parsing.
See att.py for XML sitemap, target.py for gzipped sitemaps.
"""
raise NotImplementedError("TODO: Implement sitemap discovery")
def extract_store_data(session, url, retailer_config, **kwargs):
"""Extract store data from a single store page.
TODO: Implement store page parsing.
See att.py:extract_store_data() for HTML parsing example.
"""
raise NotImplementedError("TODO: Implement store extraction")
"""
{Retailer} store scraper.
Discovery: Paginated JSON API
Example reference: See cricket.py (Yext API) or telus.py (Uberall API).
"""
# TODO: Implement paginated API discovery and extraction
# See cricket.py for geo-grid API pattern
# See telus.py for Uberall API pattern
"""
{Retailer} store scraper.
Discovery: GraphQL API
Example reference: See homedepot.py for GraphQL Federation Gateway pattern.
"""
# TODO: Implement GraphQL query and response parsing
# See homedepot.py for query construction and pagination
"""
{Retailer} store scraper.
Discovery: Multi-phase HTML crawl
Example reference: See verizon.py for 4-phase HTML crawl pattern.
"""
# TODO: Implement multi-phase crawl: index → region → city → store
# See verizon.py for the discovery phase chain
"""
{Retailer} store scraper.
Discovery: Store locator API (geo-radius queries)
Example reference: See staples.py for API + gap-fill pattern.
"""
# TODO: Implement geo-radius store locator queries
# See staples.py for StaplesConnect API + gap-fill strategy
After scaffolding, the developer should:
config/{retailer}_config.pydiscover_store_urls() (fetch sitemap/API/crawl index)extract_store_data() (parse store page/response)tests/test_scrapers/fixtures/{retailer}/python run.py --retailer {retailer} --testpytest tests/test_scrapers/test_{retailer}.py