Skill-Datei

Pipeline Data Contract Validation

Name: Pipeline Data Contract Validation
Author: rengotaku

Validate data contracts between pipeline stages using real output during setup phase to prevent schema assumption failures

rengotaku0 Sterne22.02.2026

Beruf
Kategorien: Data Engineering

Skill-Inhalt

Problem

When implementing features that consume data from earlier pipeline stages, it's easy to make false assumptions about data schemas. This leads to:

Implementation based on incorrect assumptions
All tests passing with mock data
Complete failure in production/real pipeline
Wasted implementation effort (4+ hours in this case)

Specific Example: Assumed layout.json contained ocr_text field, but actual yomitoku output only had bbox, type, label. Code detection feature was implemented and tested successfully with mocks, but failed completely in real pipeline.

Solution

1. Setup Phase: Real Data Schema Validation (MANDATORY)

# Task Definition Pattern
- [ ] T001 Identify upstream component (e.g., yomitoku)
- [ ] T001a **Run upstream component and capture actual output**
- [ ] T001b **Document all fields present in real output**
- [ ] T001c **Document all fields MISSING from real output**
- [ ] T001d **Validate assumptions against real schema**

Verwandte Skills

Pipeline Data Contract Validation | Skills Pool

# Example
python -m src.detect_layout --input samples/page.jpg --output /tmp/layout.json
cat /tmp/layout.json | jq '.regions[0]'  # Inspect actual fields

# Improved Data Flow Diagram
[Image] → [yomitoku] → [layout.json]
                          ↓
                    ✅ fields: bbox, type, label
                    ❌ missing: ocr_text ← EXPLICIT!
                          ↓
                      [OCR execution]
                          ↓
                    ✅ fields: text content
                          ↓
                   [your_feature] ← Can use text here

# ❌ BAD: Mock data that doesn't match reality
def test_code_detection():
    layout = {"regions": [{"type": "TEXT", "ocr_text": "def foo(): pass"}]}
    result = detect_code(layout)  # Passes but wrong!

# ✅ GOOD: Use actual pipeline output
def test_code_detection():
    # Load actual yomitoku output
    with open("tests/fixtures/real_layout.json") as f:
        layout = json.load(f)
    result = detect_code(layout)  # Catches schema mismatch

# Task Definition with Preconditions
- [ ] T017 Implement detect_code_regions()
  - **Precondition**: layout.json contains `ocr_text` field
  - **Verification**: Confirmed in T001b
  - **If missing**: Adjust design to use alternative data source

# spec.md assumption
"Use ocr_text from layout.json regions"

# T001: Check yomitoku API
✓ Document function names
✓ Document processing flow
✗ Verify actual output schema  # SKIPPED!

# Implementation
def detect_code(layout):
    for region in layout["regions"]:
        text = region["ocr_text"]  # Field doesn't exist!

# T001a: Verify real schema
$ python -m src.detect_layout --input sample.jpg --output /tmp/layout.json
$ cat /tmp/layout.json
{
  "regions": [
    {"bbox": [...], "type": "TEXT", "label": "text"}
    # Note: NO ocr_text field!
  ]
}

# T001b: Update design
"ocr_text not available in layout.json
 → Must use rover/*.txt files from OCR step
 → Adjust implementation to read from correct source"

Pipeline Data Contract Validation

Problem

Solution

1. Setup Phase: Real Data Schema Validation (MANDATORY)

Pipeline Data Contract Validation

Problem

Solution

1. Setup Phase: Real Data Schema Validation (MANDATORY)

2. Document Data Flow with Field Availability

3. Integration Tests: Real Pipeline Data Only

4. Explicit Preconditions in Task Definitions

Example

Before (Failure Case)

After (Success Case)

When to Use

Impact Metrics

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns