Help users plan multi-step synthetic data generation workflows. Use this skill when the user describes a complex data scenario ("I need healthcare risk data with monthly scores and peer benchmarks"), asks "what data do I need", "how should I generate this", "plan my dataset", "help me formulate a prompt", or wants to understand what synthdata-generate can and cannot produce directly. Also trigger on "prompt builder", "data planning", "generation plan", "what tables do I need", or "help me set up data for [tool/product/demo]".
A planning and advisory skill. Given a domain description and downstream requirements, produce
a sequenced set of steps the user can execute to generate a complete, internally-consistent
synthetic dataset — including both raw tables (via synthdata-generate) and derived/aggregated
tables (via synthdata-compute).
Ask the user:
Classify every table the user needs into one of three categories:
| Category |
|---|
| How to produce |
|---|
| Example |
|---|
| Raw event tables | synthdata-generate from template or custom schema | users, transactions, events, devices |
| Derived/aggregated tables | synthdata-compute after generation | monthly_risk, department_summary, percentile_ranks |
| Independent reference tables | synthdata-generate with separate schema, or LLM-driven generation | peer_benchmarks, industry_averages, configuration tables |
Key rule: if a table requires reading generated data and performing aggregation, grouping,
scoring, or cross-table joins — it's derived, and synthdata-generate cannot produce it. Use
synthdata-compute instead.
Check if any of the 12 built-in templates cover the raw tables. If so, start there. If not,
describe a custom schema for synthdata-generate's interview flow.
Format the plan as numbered steps with the specific command or prompt for each:
Step 1: [synthdata-generate] Generate raw tables from template X
Step 2: [synthdata-compute] Derive monthly rollups from raw events
Step 3: [synthdata-generate] Generate peer benchmark dataset (custom schema)
Step 4: [synthdata-compute] Compute segment summaries from peer data
Step 5: [synthdata-extract] Extract everything to JSON for the downstream tool
Offer to execute Step 1 immediately.
The 12 built-in templates and what they cover. Use --list-templates to confirm:
| Template | Raw tables generated | Common derived tables needed downstream |
|---|---|---|
| blank-slate | (user-defined) | Depends on schema |
| hr-directory | departments, employees | headcount_by_dept, tenure_distribution |
| ecommerce-orders | customers, products, orders | monthly_revenue, customer_ltv, product_rankings |
| saas-metrics | accounts, users, events, subscriptions | mrr_by_month, churn_rate, feature_usage_summary |
| healthcare-patients | providers, patients, encounters, claims | monthly_encounters, cost_by_provider, diagnosis_distribution |
| financial-transactions | customers, accounts, transactions | monthly_balances, fraud_score, customer_risk_profile |
| security-events | users, devices, alerts, incidents | alert_volume_by_day, mttr, severity_distribution |
| log-events | services, requests, errors | error_rate_by_service, latency_percentiles, daily_traffic |
| iot-sensors | devices, readings, events | hourly_averages, anomaly_flags, device_health_summary |
| crm-pipeline | companies, contacts, deals, activities | pipeline_by_stage, win_rate, forecast_by_quarter |
| survey-responses | respondents, questions, responses | nps_score, question_summary, segment_breakdown |
| healthcare-hrm-security | users, threat_events, phishing_sims, training, dlp_events, abuse_mailbox | monthly_risk (composite scores), peer_benchmarks (industry comparison), segment_summaries, department_risk |
synthdata-generate produces raw tabular data with:
These require synthdata-compute or LLM-driven generation:
The formula column type works only within a single row of a single table. It cannot reference other tables, aggregate child rows, or perform lookups.
Reference recipes for synthdata-compute. Include the relevant pattern in the plan description.
Group events by (entity_id, month), count/sum event columns. Used for time-series dashboards, trend analysis, scoring.
Compute sub-scores per dimension (e.g., threat=35%, sim=25%, dlp=20%, training=10%, reporting=10%), then weighted sum → composite (0–100). Apply tier thresholds.
Map numeric scores to categorical tiers via thresholds (e.g., Critical >=75, High 50-74, Medium 25-49, Low <25).
df['percentile'] = df['score'].rank(pct=True) * 100 — useful for "where does this user/org
stand relative to peers."
Group a population by segment flags, compute mean/median/p25/p75 for each metric. Common for peer benchmarking.
Join user dimension to event/score tables, group by department, compute averages and counts.
User says: "I need healthcare risk data for a security product demo — user risk scores, threat events, peer benchmarks, department summaries."
Plan:
Step 1: [synthdata-generate]
Generate raw event tables from the healthcare-hrm-security template at medium effort.
> synthdata-generate --template healthcare-hrm-security --effort medium --seed 42
Produces: users (1K), threat_events (~90K), phishing_sims (12K), training (5K),
dlp_events (~850), abuse_mailbox (~22K)
Step 2: [synthdata-compute]
Derive monthly_risk from the raw event tables:
- Group threat_events by (user_id, month) → count events, sum clicks
- Group phishing_sims by user_id → sim click rate, report rate
- Group dlp_events by user_id → count violations
- Group training by user_id → completion rate, avg score
- Group abuse_mailbox by user_id → report count, confirmed-malicious ratio
- Compute sub-scores (0-100 each), weighted composite (Threat 35%, Sim 25%,
DLP 20%, Training 10%, Reporting 10%), assign risk tiers
Step 3: [synthdata-generate or LLM-driven]
Generate 200 peer organizations with correlated metrics.
This is best done via a prompted generation (not a template) because the cross-metric
correlations (mature orgs have lower click rates, higher MFA adoption, etc.) require
reasoning that the schema engine can't express.
> "Generate a peer benchmarks dataset: 200 healthcare-adjacent organizations with
> company info, security maturity, sim/threat/DLP/training metrics, composite risk
> scores, and segment flags for healthcare/geography/size. Include realistic
> cross-metric correlations."
Step 4: [synthdata-compute]
Derive segment_summaries from the peer data:
- Group peers by each segment flag
- Compute mean/median/p25/p75 for each metric
- Add the user's organization as a comparison row
Step 5: [synthdata-extract]
Extract the final xlsx workbooks to JSON for the downstream product:
> synthdata-extract --input <company_data>.xlsx --output json/
> synthdata-extract --input <peer_data>.xlsx --output json/
User says: "I need transaction data to test a fraud detection model — transactions, user profiles, and fraud labels with known fraud rings."
Plan:
Step 1: [synthdata-generate]
Generate raw tables from the financial-transactions template at thorough effort.
> synthdata-generate --template financial-transactions --effort thorough --seed 42
Produces: customers (~5K), accounts (~8K), transactions (~100K+) with fraud_ring profiles
Step 2: [synthdata-compute]
Derive fraud_labels table:
- Flag transactions from fraud_ring profile users with amount > p95 as "suspicious"
- Flag transactions with velocity > 3 per hour from same account as "velocity_alert"
- Compute per-customer risk features (avg transaction amount, transaction frequency,
unique merchant count, max single transaction)
Step 3: [synthdata-compute]
Derive daily_summary:
- Group transactions by date → total volume, total amount, fraud count, fraud rate
- Compute 7-day rolling averages for trend analysis
User says: "I need sensor data for a monitoring dashboard — device readings, alerts, and health summaries."
Plan:
Step 1: [synthdata-generate]
Generate from iot-sensors template at medium effort.
> synthdata-generate --template iot-sensors --effort medium --seed 42
Produces: devices, readings, events
Step 2: [synthdata-compute]
Derive hourly_averages:
- Group readings by (device_id, hour) → mean/min/max for each metric
- Flag anomalies where value > 3 standard deviations from device mean
Step 3: [synthdata-compute]
Derive device_health_summary:
- Per device: uptime %, alert count, last reading timestamp, current status
If the user's needs are simple (single template, no derived tables), skip the planning and
suggest going directly to synthdata-generate.