Transforms raw data into valuable insights through alchemical processes
Data Alchemist transforms raw, messy datasets into structured, actionable insights through automated ETL (Extract, Transform, Load) pipelines. It performs data validation, transformation, enrichment, and analysis without requiring manual scripting for common data operations.
CSV to PostgreSQL Pipeline: Automatically ingest a messy CSV with inconsistent headers, missing values, and type mismatches, validate against a schema, clean, and load to PostgreSQL with full audit trail.
API Data Enrichment: Extract data from REST endpoints, merge with existing datasets, apply transformations (geocoding, categorization, sentiment analysis), and export to dashboard-ready Parquet files.
Data Quality Audit: Scan production databases for anomalies, check for null thresholds, validate foreign key relationships, and generate a comprehensive quality report with failed rows extracted.
: Take high-frequency sensor data (millions of rows), resample to business intervals, calculate rolling statistics, detect outliers using IQR, and produce trend reports.
Schema Extraction & Documentation: Infer schema from unstructured data sources, generate JSON Schema or Pydantic models, create data dictionaries, and produce ER diagrams automatically.
# Transform a file with automatic type inference and cleaning
alchemy transform --input data/raw.csv --output data/clean.parquet --clean-modes deduplicate,standardize,impute
# Validate data against a schema (JSON Schema or Pydantic)
alchemy validate --input data/input.json --schema schemas/customer.schema.json --report-quality
# Extract from API with pagination and auth
alchemy extract --endpoint https://api.example.com/v1/orders --token $API_TOKEN --pages-all --output raw/api_orders.json
# Load to database with upsert strategy
alchemy load --input data/clean.parquet --table orders --db postgresql://user:pass@localhost/db --upsert-keys order_id,customer_id
# Run full ETL pipeline defined in YAML
alchemy pipeline --file pipelines/daily_sales.yaml --run-id 20260301_001
# Generate data profile and insights
alchemy profile --input data/sales.csv --output reports/sales_profile.html --correlation-threshold 0.8
# Merge multiple sources with conflict resolution
alchemy merge --sources data/jan.parquet,data/feb.parquet --on date,product_id --strategy latest --output data/merged.parquet
# Detect anomalies using statistical methods
alchemy anomaly --input data/metrics.parquet --column revenue --method iqr --threshold 1.5 --output anomalies.csv
# Create new pipeline template
alchemy init-pipeline --name daily_sales --template postgres-to-parquet
# List available transformations
alchemy list-transforms
# Test pipeline without execution
alchemy dry-run --file pipelines/daily_sales.yaml
# Rollback last successful load
alchemy rollback --table orders --to-timestamp "2026-03-01 02:00:00"
# Generate schema from sample data
alchemy infer-schema --input data/sample.csv --format pydantic --output schemas/sample.py
# Create data quality tests
alchemy generate-tests --table customers --expectation-type not-null --columns email,phone
# Profile raw data to understand structure and quality
alchemy profile --input data/raw/input.csv --output assessment/profile.json ---sample-rows 10000
# Infer schema from sample
alchemy infer-schema --input data/raw/input.csv --format json-schema --output schemas/inferred.json
Review generated profile:
Create explicit schema based on inference:
# Use inferred schema as starting point
cp schemas/inferred.json schemas/customer.v1.json
# Edit to add constraints, descriptions, required fields
# Validate source against schema
alchemy validate \
--input data/raw/input.csv \
--schema schemas/customer.v1.json \
--fail-on-error \
--output validation/results.json \
--extract-failures data/failures/
Create pipeline YAML:
# pipelines/daily_sales.yaml