Build new data ingestion providers following the FF Analytics registry pattern. This skill should be used when adding new data sources (APIs, files, databases) to the data pipeline. Guides through creating provider packages, registry mappings, loader functions, storage integration, primary key tests, and sampling tools following established patterns.
Create complete data ingestion providers for the Fantasy Football Analytics project following established patterns. This skill automates the process of adding new data sources with proper structure, metadata, testing, and integration.
Use this skill proactively when:
The FF Analytics project follows these principles for data ingestion:
_meta.json with lineagegs:// URIsFollow this six-step process to create a complete provider:
Before coding, gather information about the provider:
Ask clarifying questions:
Research existing documentation:
Output: Clear understanding of:
Map datasets to loader functions and define metadata.
Use assets/registry_template.py as starting point.
For each dataset, define:
name: Logical dataset name (lowercase, descriptive)loader_function: Function name in loader.pyprimary_keys: Tuple of columns that uniquely identify rowsdescription: Brief description of dataset contentsnotes: Special considerations, dependencies, or caveatsExample registry design:
REGISTRY = {
"players": DatasetSpec(
name="players",
loader_function="load_players",
primary_keys=("player_id",),
description="Player biographical and career data",
notes="Updates daily. Includes active and retired players."
),
"stats": DatasetSpec(
name="stats",
loader_function="load_stats",
primary_keys=("player_id", "game_id", "stat_type"),
description="Game-level player statistics",
notes="Grain: one row per player per game per stat type"
)
}
Quality checks:
load_{dataset_name} patternCreate the directory structure following the template.
See assets/package_structure.md for complete structure.
Create directories:
mkdir -p src/ingest/{provider}
mkdir -p tests
mkdir -p samples/{provider}
Create files:
src/ingest/{provider}/__init__.py (empty or with exports)src/ingest/{provider}/registry.py (from Step 2)src/ingest/{provider}/loader.py (will implement in Step 4)tests/test_{provider}_samples_pk.py (will implement in Step 5)Naming:
nflverse, espn_api, my_providerCreate loader functions using storage helper pattern.
Use assets/loader_template.py as starting point.
For each dataset in registry:
Create loader function following signature:
def load_{dataset_name}(
out_dir: str = "data/raw/{provider}",
**kwargs
) -> dict[str, Any]:
Implement data fetching:
Convert to DataFrame:
Write with storage helper:
from ingest.common.storage import write_parquet_any, write_text_sidecar
# Write Parquet
write_parquet_any(df, parquet_file)
# Write metadata sidecar
metadata = {
"dataset": dataset_name,
"asof_datetime": datetime.now(UTC).isoformat(),
"loader_path": "src.ingest.{provider}.loader.load_{dataset}",
"source_name": "{PROVIDER}",
"source_version": version,
"output_parquet": parquet_file,
"row_count": len(df)
}
write_text_sidecar(json.dumps(metadata, indent=2), f"{partition_dir}/_meta.json")
Return manifest:
return {
"dataset": dataset_name,
"partition_dir": partition_dir,
"parquet_file": parquet_file,
"row_count": len(df),
"metadata": metadata
}
Reference examples:
references/example_loader.py - Complete nflverse loaderreferences/example_storage.py - Storage helper implementationCommon patterns:
datetime.now(UTC) for all timestampsuuid.uuid4().hex[:8]dt=YYYY-MM-DDgs:// URIs uniformlyValidate sample data quality with automated tests.
Use assets/test_template.py as starting point.
Test structure:
@pytest.mark.parametrize("dataset_name,spec", REGISTRY.items())
def test_{provider}_primary_keys(dataset_name, spec):
# 1. Find sample files
# 2. Read with Polars
# 3. Check PK columns exist
# 4. Check PK uniqueness
# 5. Report duplicates if found
What to test:
Run tests:
pytest tests/test_{provider}_samples_pk.py -v
Connect the provider to existing workflows.
Update tools/make_samples.py:
Add provider-specific sampling logic:
# In make_samples.py argument parser
elif args.provider == "{provider}":
from ingest.{provider}.loader import load_{dataset}
# Provider-specific argument parsing
datasets = args.datasets or ["default_dataset"]
for dataset in datasets:
result = load_{dataset}(
out_dir=args.out,
**provider_kwargs
)
print(f"✓ Sampled {dataset}: {result['row_count']} rows")
Update documentation:
src/ingest/CLAUDE.md - Add provider-specific notesCLAUDE.md - If architecturally significantREADME.md - If user-facingCreate sample data:
uv run python tools/make_samples.py {provider} --datasets {dataset1} {dataset2} --out ./samples
Validate:
# Check sample data created
ls -la samples/{provider}/
# Run PK tests
pytest tests/test_{provider}_samples_pk.py -v
# Check metadata
cat samples/{provider}/{dataset}/dt=*/_meta.json | jq .
Provider implementation examples from codebase:
Load these references when implementing a new provider to see proven patterns.
Templates for creating new providers:
Use these templates directly when generating provider code.
Environment variables:
import os
api_key = os.environ.get("{PROVIDER}_API_KEY")
if not api_key:
raise ValueError("Set {PROVIDER}_API_KEY environment variable")
OAuth flow:
from requests_oauthlib import OAuth2Session
oauth = OAuth2Session(client_id, token=token)
response = oauth.get(endpoint)
Offset-based:
all_data = []
offset = 0
limit = 100
while True:
response = fetch(offset=offset, limit=limit)
data = response.json()
all_data.extend(data)
if len(data) < limit:
break
offset += limit
Cursor-based:
all_data = []
cursor = None
while True:
response = fetch(cursor=cursor)
data = response.json()
all_data.extend(data["results"])
cursor = data.get("next_cursor")
if not cursor:
break
Simple delay:
import time
for dataset in datasets:
result = load_dataset()
time.sleep(1) # 1 second between requests
Exponential backoff:
import time
from requests.exceptions import HTTPError
max_retries = 3
for attempt in range(max_retries):
try:
response = fetch()
response.raise_for_status()
break
except HTTPError as e:
if e.response.status_code == 429: # Rate limit
wait_time = 2 ** attempt
time.sleep(wait_time)
else:
raise
When helping user create a provider:
After Step 2 (Registry Design):
✅ Registry Designed: {provider}
Datasets defined:
- {dataset1}: {description} (PK: {pk_columns})
- {dataset2}: {description} (PK: {pk_columns})
Ready to create package structure (Step 3)?
After Step 4 (Loader Implementation):
✅ Loaders Implemented
Created loader functions:
- load_{dataset1}() - Fetches from {source}
- load_{dataset2}() - Fetches from {source}
All loaders use storage helpers and write metadata sidecars.
Ready to create tests (Step 5)?
After Step 6 (Integration Complete):
✅ Provider Integration Complete: {provider}
Created:
- Registry: src/ingest/{provider}/registry.py ({N} datasets)
- Loaders: src/ingest/{provider}/loader.py
- Tests: tests/test_{provider}_samples_pk.py
- Samples: samples/{provider}/ ({N} datasets)
Integration:
- ✓ Added to tools/make_samples.py
- ✓ Updated documentation
- ✓ Primary key tests passing ({N}/{N})
To use:
```bash
# Generate samples
uv run python tools/make_samples.py {provider} --datasets all --out ./samples
# Run tests
pytest tests/test_{provider}_samples_pk.py -v
# Use in production
from ingest.{provider}.loader import load_{dataset}
result = load_{dataset}(out_dir="gs://ff-analytics/raw/{provider}")
```
User says: "Add integration for the ESPN Fantasy API"
Response:
User says: "I have the API docs for PFF, help me integrate it"
Response:
User says: "The nflverse loader is missing a dataset"
Response:
Issue: Primary key tests failing
Issue: Storage helper fails with GCS
GOOGLE_APPLICATION_CREDENTIALS environment variablereferences/example_storage.py for patternsIssue: Loader returns empty data
Issue: Make_samples.py not finding provider
src/ingest/{provider}/This skill works well with: