Validate data contracts between pipeline stages using real output during setup phase to prevent schema assumption failures
When implementing features that consume data from earlier pipeline stages, it's easy to make false assumptions about data schemas. This leads to:
Specific Example: Assumed layout.json contained ocr_text field, but actual yomitoku output only had bbox, type, label. Code detection feature was implemented and tested successfully with mocks, but failed completely in real pipeline.
# Task Definition Pattern
- [ ] T001 Identify upstream component (e.g., yomitoku)
- [ ] T001a **Run upstream component and capture actual output**
- [ ] T001b **Document all fields present in real output**
- [ ] T001c **Document all fields MISSING from real output**
- [ ] T001d **Validate assumptions against real schema**
Action: Run the actual pipeline step and inspect the output file:
# Example
python -m src.detect_layout --input samples/page.jpg --output /tmp/layout.json
cat /tmp/layout.json | jq '.regions[0]' # Inspect actual fields
# Improved Data Flow Diagram
[Image] → [yomitoku] → [layout.json]
↓
✅ fields: bbox, type, label
❌ missing: ocr_text ← EXPLICIT!
↓
[OCR execution]
↓
✅ fields: text content
↓
[your_feature] ← Can use text here
# ❌ BAD: Mock data that doesn't match reality
def test_code_detection():
layout = {"regions": [{"type": "TEXT", "ocr_text": "def foo(): pass"}]}
result = detect_code(layout) # Passes but wrong!
# ✅ GOOD: Use actual pipeline output
def test_code_detection():
# Load actual yomitoku output
with open("tests/fixtures/real_layout.json") as f:
layout = json.load(f)
result = detect_code(layout) # Catches schema mismatch
# Task Definition with Preconditions
- [ ] T017 Implement detect_code_regions()
- **Precondition**: layout.json contains `ocr_text` field
- **Verification**: Confirmed in T001b
- **If missing**: Adjust design to use alternative data source
# spec.md assumption
"Use ocr_text from layout.json regions"
# T001: Check yomitoku API
✓ Document function names
✓ Document processing flow
✗ Verify actual output schema # SKIPPED!
# Implementation
def detect_code(layout):
for region in layout["regions"]:
text = region["ocr_text"] # Field doesn't exist!
# T001a: Verify real schema
$ python -m src.detect_layout --input sample.jpg --output /tmp/layout.json
$ cat /tmp/layout.json
{
"regions": [
{"bbox": [...], "type": "TEXT", "label": "text"}
# Note: NO ocr_text field!
]
}
# T001b: Update design
"ocr_text not available in layout.json
→ Must use rover/*.txt files from OCR step
→ Adjust implementation to read from correct source"
Trigger this pattern when:
Red Flags (use immediately):
Prevention Checklist:
From Original Incident:
With This Pattern: