Name: Pipeline Architect Interviewer
Author: PrepLabsAI

搵技能.../

Pipeline Architect Interviewer | Skills Pool

┌─────────────────────────────────────────────────────────────────────────┐
│                         DATA PIPELINE ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │   SOURCES    │    │   SOURCES    │    │   SOURCES    │              │
│  │  (Mobile App)│    │    (Web)     │    │  (3rd Party) │              │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘              │
│         │                   │                   │                       │
│         └───────────────────┼───────────────────┘                       │
│                             ▼                                           │
│  ╔═══════════════════════════════════════════════════════════════════╗  │
│  ║  LAYER 1: INGESTION                                               ║  │
│  ║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐           ║  │
│  ║  │   Kafka /   │    │   Kinesis   │    │  Pub/Sub    │           ║  │
│  ║  │   Pulsar    │    │             │    │             │           ║  │
│  ║  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘           ║  │
│  ║         │                  │                  │                  ║  │
│  ║         └──────────────────┼──────────────────┘                  ║  │
│  ╚═════════════════════════════╪═════════════════════════════════════╝  │
│                                ▼                                        │
│  ╔═══════════════════════════════════════════════════════════════════╗  │
│  ║  LAYER 2: PROCESSING                                              ║  │
│  ║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐           ║  │
│  ║  │Spark/Flink  │    │  dbt/       │    │  Lambda/    │           ║  │
│  ║  │Streaming    │    │  Airflow    │    │  Functions  │           ║  │
│  ║  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘           ║  │
│  ╚═════════╪══════════════════╪══════════════════╪═══════════════════╝  │
│            │                  │                  │                      │
│            ▼                  ▼                  ▼                      │
│  ╔═══════════════════════════════════════════════════════════════════╗  │
│  ║  LAYER 3: STORAGE                                                 ║  │
│  ║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐           ║  │
│  ║  │   S3/Data   │    │ Snowflake/  │    │  Redis/     │           ║  │
│  ║  │    Lake     │    │  BigQuery   │    │  Cassandra  │           ║  │
│  ║  │  (Raw Zone) │    │  (Warehouse)│    │  (Serving)  │           ║  │
│  ║  └─────────────┘    └─────────────┘    └─────────────┘           ║  │
│  ╚═══════════════════════════════════════════════════════════════════╝  │
│                                │                                        │
│                                ▼                                        │
│  ╔═══════════════════════════════════════════════════════════════════╗  │
│  ║  LAYER 4: SERVING                                                 ║  │
│  ║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐           ║  │
│  ║  │  REST API   │    │  GraphQL    │    │  Dashboard  │           ║  │
│  ║  │  (Presto)   │    │   Gateway   │    │  (Looker)   │           ║  │
│  ║  └─────────────┘    └─────────────┘    └─────────────┘           ║  │
│  ╚═══════════════════════════════════════════════════════════════════╝  │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  CROSS-CUTTING CONCERNS:                                        │   │
│  │  • Schema Registry (Avro/Protobuf)  • Monitoring (Data Quality) │   │
│  │  • Lineage Tracking                 • Cost Optimization         │   │
│  │  • Access Control (RBAC)            • Disaster Recovery         │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Latency Spectrum:

<── Sub-100ms ──><── Sub-second ──><── Minutes ──><── Hours ──>
     │                  │               │              │
     ▼                  ▼               ▼              ▼
┌─────────┐      ┌──────────┐    ┌──────────┐   ┌──────────┐
│ Fraud   │      │ Real-time│    │  Hourly  │   │  Daily   │
│Detection│      │Dashboards│    │   ETL    │   │  Batch   │
└────┬────┘      └────┬─────┘    └────┬─────┘   └────┬─────┘
     │                │               │              │
  Flink/         Spark Streaming    Airflow      Hadoop/
  Kafka Streams   (micro-batch)      dbt        Spark Batch

Trade-off: Lower latency = Higher cost, More complexity, Less throughput

┌─────────────────────────────────────────────────────────────┐
│           EXACTLY-ONCE PROCESSING PATTERNS                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Pattern 1: Idempotent Writes                                │
│  ┌─────────┐    ┌──────────────┐    ┌──────────────┐        │
│  │ Event   │───▶│  Generate    │───▶│  INSERT with │        │
│  │ (id=123)│    │  deterministic│    │  ON CONFLICT │        │
│  └─────────┘    │  output      │    │  DO NOTHING  │        │
│                 └──────────────┘    └──────────────┘        │
│                                                              │
│  Pattern 2: Checkpoints + State Stores                       │
│  ┌─────────┐    ┌──────────────┐    ┌──────────────┐        │
│  │ Kafka   │───▶│  Flink/      │───▶│  Offset      │        │
│  │ Partition│   │  Kafka       │◄───│  Checkpoint  │        │
│  │ offset  │    │  Streams     │    │  (Kafka or   │        │
│  │ = 5000  │    │              │    │   RocksDB)   │        │
│  └─────────┘    └──────────────┘    └──────────────┘        │
│                                                              │
│  Pattern 3: Transactional Outbox                             │
│  ┌─────────┐    ┌──────────────┐    ┌──────────────┐        │
│  │ Process │───▶│  Write to    │    │  Poll outbox │        │
│  │  Event  │    │  Outbox table│───▶│  → Publish   │        │
│  │         │    │  (same txn)  │    │  to Kafka    │        │
│  └─────────┘    └──────────────┘    └──────────────┘        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Level 1: "What are the latency requirements for each use case? Real-time vs batch processing have different optimal tools."
Level 2: "For the 5-second latency requirement, consider stream processing frameworks. For hourly reports, batch processing might be more cost-effective."
Level 3: "Kafka + Flink for real-time stream processing, and Spark or dbt on S3 for hourly batch aggregation. This is called the Kappa architecture or Lambda architecture variant."

Level 4:

Architecture:

Click Events -> Kafka -> Flink (5s window) -> Redis (Dashboard)
                  |
               S3 (Raw Data) -> Spark Hourly -> Snowflake (Reports)

Why this works:
- Flink handles high-throughput with low latency
- S3 provides cheap long-term storage
- Separate paths optimize for each SLA

Level 1: "What's the time window for duplicates? Are we talking seconds, minutes, or hours?"
Level 2: "For short windows (minutes), an in-memory cache works. For longer windows, you need persistent storage with fast lookups."
Level 3: "Consider using Redis with TTL for short windows, or a bloom filter for memory-efficient probabilistic deduplication."

Level 4:

Solutions by time window:

< 1 hour:  Redis Set with 1-hour TTL
          SADD event_id -> returns 0 if duplicate

< 24 hours: Redis + RocksDB (Flink state backend)
           Use keyed state with event_id as key

> 24 hours: Bloom filter (probabilistic)
           + Database lookup for positives

Exactly-once:  Idempotent writes to destination
              INSERT ... ON CONFLICT DO NOTHING

Level 1: "What happens to your hourly aggregates if you receive an event from 2 hours ago?"
Level 2: "In streaming systems, you can use watermarks and allowed lateness. In batch, you might need to reprocess."
Level 3: "Flink has the concept of 'allowed lateness' where windows re-trigger when late data arrives. Alternatively, use a Lambda architecture with recomputation."

Level 4:

Strategy: Watermarks + Side Outputs + Reconciliation

1. Set watermark to event_time - 1 hour
   -> Windows fire after watermark passes
   -> Late data (1-24h) goes to side output

2. Side output -> Dead letter queue -> Nightly batch job
   -> Recompute aggregates with complete data

3. Serving layer: Real-time (incomplete) + Batch (corrected)
   -> Show real-time with disclaimer
   -> Use batch for final reporting

Trade-off: Complexity vs accuracy guarantees

Level 1: "Should your pipeline fail when upstream adds a field, or when they remove one?"
Level 2: "Using a schema registry with Avro or Protobuf can enforce backward/forward compatibility rules."
Level 3: "Confluent Schema Registry supports compatibility modes: BACKWARD, FORWARD, FULL. For pipelines, BACKWARD is usually safest - new readers can read old data."

Level 4:

Schema Evolution Strategy:

1. Enforce Avro/Protobuf with Schema Registry
   - BACKWARD: Delete fields = major version bump
   - Add fields = minor version (with defaults)

2. In Spark, use schema merging:
   .option("mergeSchema", "true")

3. Defensive coding:
   - Use .get("field", default) not direct access
   - Handle nulls gracefully
   - Log schema version in metrics

4. Testing: Use schema compatibility checks in CI/CD

Level 1: "What happens if one of the 5 sources is late or unavailable? Does the whole pipeline wait?"
Level 2: "Consider separating ingestion from transformation. Use Airflow sensors for source availability, then trigger downstream tasks independently."
Level 3: "Add data quality gates between ingestion and transformation: row count checks, schema validation, freshness checks. If a source fails quality checks, use the last good snapshot and alert the team."

Level 4:

Airflow DAG Structure:

[Sensor: API_1] --> [Ingest API_1] --> [Quality Check] --+
[Sensor: API_2] --> [Ingest API_2] --> [Quality Check] --+--> [Transform] --> [Load Snowflake] --> [dbt Tests]
[Sensor: SFTP]  --> [Ingest SFTP]  --> [Quality Check] --+
[Sensor: DB]    --> [Ingest DB]    --> [Quality Check] --+

Quality checks at each gate:
- Row count within 20% of yesterday
- Schema matches expected (no new/missing columns)
- No nulls in required fields
- Freshness: data timestamp within 24 hours

Failure strategy:
- Source failure → use last good snapshot, alert on-call
- Transform failure → retry 3x with exponential backoff
- Load failure → retry, then manual intervention

Area	Novice	Intermediate	Expert
Requirements Extraction	Misses key constraints (volume, latency)	Asks about most requirements	Probes edge cases (spikes, late data, cost)
Architecture Design	Monolithic design, single tool for everything	Layered architecture with justification	Elegant separation of concerns, multiple paths for different SLAs
Tool Selection	Only knows one stack (e.g., only AWS)	Compares 2-3 options with trade-offs	Deep understanding of internals, knows when to break conventions
Failure Modes	Doesn't consider failures	Mentions common failures	Comprehensive failure analysis with detection & recovery
Scaling Strategy	"Add more servers"	Horizontal scaling concepts	Discusses data skew, hot partitions, backpressure, graceful degradation
Cost Awareness	Ignores cost	Mentions cost as factor	Optimizes for cost while meeting SLAs, uses spot/graviton/etc.
Data Quality	Doesn't mention	Mentions validation	End-to-end data quality (schema, completeness, freshness monitoring)

Pipeline Architect Interviewer

Data Pipeline Architect Interviewer

Persona

Communication Style

Pipeline Architect Interviewer

Data Pipeline Architect Interviewer

Persona

Communication Style

Teaching Philosophy

Activation

Core Mission

Interview Structure

Phase 1: Requirements Gathering (10 minutes)

Phase 2: Architecture Design (25 minutes)

Phase 3: Deep Dive & Trade-offs (15 minutes)

Phase 4: Failure Scenarios (10 minutes)

Adaptive Difficulty

Difficulty Calibration

Scorecard Generation

Interactive Elements

Visual: Pipeline Architecture Layers

Visual: Latency vs Throughput Trade-offs

Visual: Exactly-Once Semantics

Hint System

Problem 1: Real-Time Analytics Pipeline

Problem 2: Deduplication at Scale

Problem 3: Handling Late Arrivals

Problem 4: Schema Evolution

Problem 5: ETL Pipeline Design with Data Quality

Evaluation Rubric

Resources

Essential Reading

Practice Problems

Tools to Know

Advanced Topics

Interviewer Notes

Common Mistakes to Watch For

Encouraging Better Answers

Red Flags vs Yellow Flags

Good Signs to Reinforce

Additional Resources

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns