Skill-Datei

Senior Data Engineer

Name: Senior Data Engineer
Author: KaiserWhoLearns

World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, real-time streaming, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, Flink, Kinesis, and modern data stack. Includes data modeling, pipeline orchestration, data quality, streaming quality monitoring, and DataOps. Use when designing data architectures, building batch or streaming data pipelines, optimizing data workflows, or implementing data governance.

KaiserWhoLearns0 Sterne21.01.2026

Beruf
Kategorien: Data Engineering

Skill-Inhalt

Core Capabilities

Batch Pipeline Orchestration - Design and implement production-ready ETL/ELT pipelines with Airflow, intelligent dependency resolution, retry logic, and comprehensive monitoring
Real-Time Streaming - Build event-driven streaming pipelines with Kafka, Flink, Kinesis, and Spark Streaming with exactly-once semantics and sub-second latency
Data Quality Management - Comprehensive batch and streaming data quality validation covering completeness, accuracy, consistency, timeliness, and validity
Streaming Quality Monitoring - Track consumer lag, data freshness, schema drift, throughput, and dead letter queue rates for streaming pipelines
Performance Optimization - Analyze and optimize pipeline performance with query optimization, Spark tuning, and cost analysis recommendations

Key Workflows

Workflow 1: Build ETL Pipeline

Time: 2-4 hours

Steps:

Verwandte Skills

Senior Data Engineer | Skills Pool

# Generate Airflow DAG from configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/

# Validate pipeline configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --validate

# Use incremental load template
python scripts/pipeline_orchestrator.py --template incremental --output dags/

# Validate CSV file with quality checks
python scripts/data_quality_validator.py --input data/sales.csv --output report.html

# Validate database table with custom rules
python scripts/data_quality_validator.py \
    --connection postgresql://user:pass@host/db \
    --table sales_transactions \
    --rules rules/sales_validation.yaml \
    --threshold 0.95

# Analyze pipeline performance and get recommendations
python scripts/etl_performance_optimizer.py \
    --airflow-db postgresql://host/airflow \
    --dag-id sales_etl_pipeline \
    --days 30 \
    --optimize

# Analyze Spark job performance
python scripts/etl_performance_optimizer.py \
    --spark-history-server http://spark-history:18080 \
    --app-id app-20250115-001

# Validate streaming pipeline configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate

# Generate Kafka topic and client configurations
python scripts/kafka_config_generator.py \
    --topic user-events \
    --partitions 12 \
    --replication 3 \
    --output kafka/topics/

# Generate exactly-once producer configuration
python scripts/kafka_config_generator.py \
    --producer \
    --profile exactly-once \
    --output kafka/producer.properties

# Generate Flink job scaffolding
python scripts/stream_processor.py \
    --config streaming_config.yaml \
    --mode flink \
    --generate \
    --output flink-jobs/

# Monitor streaming quality
python scripts/streaming_quality_validator.py \
    --lag --consumer-group events-processor --threshold 10000 \
    --freshness --topic processed-events --max-latency-ms 5000 \
    --output streaming-health-report.html

# Basic DAG generation
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/

# With validation
python scripts/pipeline_orchestrator.py --config config.yaml --validate

# From template
python scripts/pipeline_orchestrator.py --template incremental --output dags/

# Validate with custom rules
python scripts/data_quality_validator.py \
    --input data/sales.csv \
    --rules rules/sales_validation.yaml \
    --output report.html

# Database table validation
python scripts/data_quality_validator.py \
    --connection postgresql://host/db \
    --table sales_transactions \
    --threshold 0.95

# Analyze Airflow DAG
python scripts/etl_performance_optimizer.py \
    --airflow-db postgresql://host/airflow \
    --dag-id sales_etl_pipeline \
    --days 30 \
    --optimize

# Spark job analysis
python scripts/etl_performance_optimizer.py \
    --spark-history-server http://spark-history:18080 \
    --app-id app-20250115-001

# Validate configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate

# Generate Kafka configurations
python scripts/stream_processor.py --config streaming_config.yaml --mode kafka --generate

# Generate Flink job scaffolding
python scripts/stream_processor.py --config streaming_config.yaml --mode flink --generate --output flink-jobs/

# Generate Docker Compose for local development
python scripts/stream_processor.py --config streaming_config.yaml --mode docker --generate

# Monitor consumer lag
python scripts/streaming_quality_validator.py \
    --lag --consumer-group events-processor --threshold 10000

# Monitor data freshness
python scripts/streaming_quality_validator.py \
    --freshness --topic processed-events --max-latency-ms 5000

# Full quality validation
python scripts/streaming_quality_validator.py \
    --lag --freshness --throughput --dlq \
    --output streaming-health-report.html

# Generate topic configuration
python scripts/kafka_config_generator.py \
    --topic user-events --partitions 12 --replication 3 --retention-hours 168

# Generate exactly-once producer
python scripts/kafka_config_generator.py \
    --producer --profile exactly-once --transactional-id producer-001

# Generate Kafka Streams config
python scripts/kafka_config_generator.py \
    --streams --application-id events-processor --exactly-once

Senior Data Engineer

Core Capabilities

Key Workflows

Workflow 1: Build ETL Pipeline

Senior Data Engineer

Core Capabilities

Key Workflows

Workflow 1: Build ETL Pipeline

Workflow 2: Build Real-Time Streaming Pipeline

Overview

Quick Start

Pipeline Orchestration

Data Quality Validation

Performance Optimization

Real-Time Streaming

Core Workflows

1. Building Production Data Pipelines

2. Data Quality Management

3. Data Modeling & Transformation

4. Performance Optimization

5. Building Real-Time Streaming Pipelines

Python Tools

pipeline_orchestrator.py

data_quality_validator.py

etl_performance_optimizer.py

stream_processor.py

streaming_quality_validator.py

kafka_config_generator.py

Reference Documentation

Frameworks (frameworks.md)

Templates (templates.md)

Tools (tools.md)

Tech Stack

Integration Points

Best Practices

Performance Targets

Resources

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns