Name: Data Ingestion Builder
Author: zazu-22

스킬 검색.../

Data Ingestion Builder | Skills Pool

REGISTRY = {
    "players": DatasetSpec(
        name="players",
        loader_function="load_players",
        primary_keys=("player_id",),
        description="Player biographical and career data",
        notes="Updates daily. Includes active and retired players."
    ),
    "stats": DatasetSpec(
        name="stats",
        loader_function="load_stats",
        primary_keys=("player_id", "game_id", "stat_type"),
        description="Game-level player statistics",
        notes="Grain: one row per player per game per stat type"
    )
}

mkdir -p src/ingest/{provider}
mkdir -p tests
mkdir -p samples/{provider}

Create loader function following signature:

def load_{dataset_name}(
    out_dir: str = "data/raw/{provider}",
    **kwargs
) -> dict[str, Any]:

Implement data fetching:
- API calls with proper authentication
- File parsing (CSV, JSON, XML, etc.)
- Database queries
- Handle pagination, retries, error cases
Convert to DataFrame:
- Prefer Polars for performance
- Pandas acceptable for compatibility
- Ensure consistent column types

Write with storage helper:

from ingest.common.storage import write_parquet_any, write_text_sidecar

# Write Parquet
write_parquet_any(df, parquet_file)

# Write metadata sidecar
metadata = {
    "dataset": dataset_name,
    "asof_datetime": datetime.now(UTC).isoformat(),
    "loader_path": "src.ingest.{provider}.loader.load_{dataset}",
    "source_name": "{PROVIDER}",
    "source_version": version,
    "output_parquet": parquet_file,
    "row_count": len(df)
}
write_text_sidecar(json.dumps(metadata, indent=2), f"{partition_dir}/_meta.json")

Return manifest:

return {
    "dataset": dataset_name,
    "partition_dir": partition_dir,
    "parquet_file": parquet_file,
    "row_count": len(df),
    "metadata": metadata
}

@pytest.mark.parametrize("dataset_name,spec", REGISTRY.items())
def test_{provider}_primary_keys(dataset_name, spec):
    # 1. Find sample files
    # 2. Read with Polars
    # 3. Check PK columns exist
    # 4. Check PK uniqueness
    # 5. Report duplicates if found

pytest tests/test_{provider}_samples_pk.py -v

# In make_samples.py argument parser
elif args.provider == "{provider}":
    from ingest.{provider}.loader import load_{dataset}

    # Provider-specific argument parsing
    datasets = args.datasets or ["default_dataset"]

    for dataset in datasets:
        result = load_{dataset}(
            out_dir=args.out,
            **provider_kwargs
        )
        print(f"✓ Sampled {dataset}: {result['row_count']} rows")

uv run python tools/make_samples.py {provider} --datasets {dataset1} {dataset2} --out ./samples

# Check sample data created
ls -la samples/{provider}/

# Run PK tests
pytest tests/test_{provider}_samples_pk.py -v

# Check metadata
cat samples/{provider}/{dataset}/dt=*/_meta.json | jq .

import os

api_key = os.environ.get("{PROVIDER}_API_KEY")
if not api_key:
    raise ValueError("Set {PROVIDER}_API_KEY environment variable")

from requests_oauthlib import OAuth2Session

oauth = OAuth2Session(client_id, token=token)
response = oauth.get(endpoint)

all_data = []
offset = 0
limit = 100

while True:
    response = fetch(offset=offset, limit=limit)
    data = response.json()
    all_data.extend(data)

    if len(data) < limit:
        break
    offset += limit

all_data = []
cursor = None

while True:
    response = fetch(cursor=cursor)
    data = response.json()
    all_data.extend(data["results"])

    cursor = data.get("next_cursor")
    if not cursor:
        break

import time

for dataset in datasets:
    result = load_dataset()
    time.sleep(1)  # 1 second between requests

import time
from requests.exceptions import HTTPError

max_retries = 3
for attempt in range(max_retries):
    try:
        response = fetch()
        response.raise_for_status()
        break
    except HTTPError as e:
        if e.response.status_code == 429:  # Rate limit
            wait_time = 2 ** attempt
            time.sleep(wait_time)
        else:
            raise

After Step 2 (Registry Design):

✅ Registry Designed: {provider}

Datasets defined:
- {dataset1}: {description} (PK: {pk_columns})
- {dataset2}: {description} (PK: {pk_columns})

Ready to create package structure (Step 3)?

After Step 4 (Loader Implementation):

✅ Loaders Implemented

Created loader functions:
- load_{dataset1}() - Fetches from {source}
- load_{dataset2}() - Fetches from {source}

All loaders use storage helpers and write metadata sidecars.

Ready to create tests (Step 5)?

After Step 6 (Integration Complete):

✅ Provider Integration Complete: {provider}

Created:
- Registry: src/ingest/{provider}/registry.py ({N} datasets)
- Loaders: src/ingest/{provider}/loader.py
- Tests: tests/test_{provider}_samples_pk.py
- Samples: samples/{provider}/ ({N} datasets)

Integration:
- ✓ Added to tools/make_samples.py
- ✓ Updated documentation
- ✓ Primary key tests passing ({N}/{N})

To use:

  ```bash
  # Generate samples
  uv run python tools/make_samples.py {provider} --datasets all --out ./samples

  # Run tests
  pytest tests/test_{provider}_samples_pk.py -v

  # Use in production
  from ingest.{provider}.loader import load_{dataset}
  result = load_{dataset}(out_dir="gs://ff-analytics/raw/{provider}")
  ```

Data Ingestion Builder

Data Ingestion Provider Builder

When to Use This Skill

Provider Integration Philosophy

Data Ingestion Builder

Data Ingestion Provider Builder

When to Use This Skill

Provider Integration Philosophy

Provider Building Workflow

Step 1: Understand the Data Source

Step 2: Design the Registry

Step 3: Create Provider Package Structure

Step 4: Implement Loader Functions

Step 5: Create Primary Key Tests

Step 6: Integrate with Project Tooling

Resources Provided

references/

assets/

Best Practices

Registry Design

Loader Implementation

Testing

Integration

Common Patterns

Authentication

Rate Limiting

Output Format

Handling User Scenarios

Scenario: User wants to add a specific API

Scenario: User has API docs, needs implementation

Scenario: User wants to fix existing provider

Troubleshooting

Integration with Other Skills

Bluebubbles

Ai First Engineering

Crosspost

Clickhouse Io

Architecture Decision Records

Inventory Demand Planning

Data Ingestion Builder

Data Ingestion Provider Builder

When to Use This Skill

Provider Integration Philosophy

Data Ingestion Builder

Data Ingestion Provider Builder

When to Use This Skill

Provider Integration Philosophy

Provider Building Workflow

Step 1: Understand the Data Source

Step 2: Design the Registry

Step 3: Create Provider Package Structure

Step 4: Implement Loader Functions

Step 5: Create Primary Key Tests

Step 6: Integrate with Project Tooling

Resources Provided

references/

assets/

Best Practices

Registry Design

Loader Implementation

Testing

Integration

Common Patterns

Authentication

Pagination

Rate Limiting

Output Format

Handling User Scenarios

Scenario: User wants to add a specific API

Scenario: User has API docs, needs implementation

Scenario: User wants to fix existing provider

Troubleshooting

Integration with Other Skills

Bluebubbles

Ai First Engineering

Crosspost

Clickhouse Io

Architecture Decision Records

Inventory Demand Planning