Skill ファイル

Synthdata Prompt Builder

Name: Synthdata Prompt Builder
Author: rappdw

Help users plan multi-step synthetic data generation workflows. Use this skill when the user describes a complex data scenario ("I need healthcare risk data with monthly scores and peer benchmarks"), asks "what data do I need", "how should I generate this", "plan my dataset", "help me formulate a prompt", or wants to understand what synthdata-generate can and cannot produce directly. Also trigger on "prompt builder", "data planning", "generation plan", "what tables do I need", or "help me set up data for [tool/product/demo]".

rappdw0 スター2026/04/07

職業
カテゴリ: 営業・マーケティング

スキル内容

A planning and advisory skill. Given a domain description and downstream requirements, produce a sequenced set of steps the user can execute to generate a complete, internally-consistent synthetic dataset — including both raw tables (via synthdata-generate) and derived/aggregated tables (via synthdata-compute).

How to Use This Skill

Step 1: Understand the domain

Ask the user:

What domain? (healthcare, e-commerce, finance, security, IoT, etc.)
What will consume this data? (a demo, an analytics tool, a specific product, testing, training)
What questions should the data answer? (risk scores, fraud detection, sales forecasts, benchmarks)
How large? (quick prototype ~100 rows, medium demo ~1K, thorough load test ~5K+)

Step 2: Identify raw vs derived tables

Classify every table the user needs into one of three categories:

関連 Skill

Synthdata Prompt Builder | Skills Pool

Step 1: [synthdata-generate] Generate raw tables from template X
Step 2: [synthdata-compute]  Derive monthly rollups from raw events
Step 3: [synthdata-generate] Generate peer benchmark dataset (custom schema)
Step 4: [synthdata-compute]  Compute segment summaries from peer data
Step 5: [synthdata-extract]  Extract everything to JSON for the downstream tool

Template	Raw tables generated	Common derived tables needed downstream
blank-slate	(user-defined)	Depends on schema
hr-directory	departments, employees	headcount_by_dept, tenure_distribution
ecommerce-orders	customers, products, orders	monthly_revenue, customer_ltv, product_rankings
saas-metrics	accounts, users, events, subscriptions	mrr_by_month, churn_rate, feature_usage_summary
healthcare-patients	providers, patients, encounters, claims	monthly_encounters, cost_by_provider, diagnosis_distribution
financial-transactions	customers, accounts, transactions	monthly_balances, fraud_score, customer_risk_profile
security-events	users, devices, alerts, incidents	alert_volume_by_day, mttr, severity_distribution
log-events	services, requests, errors	error_rate_by_service, latency_percentiles, daily_traffic
iot-sensors	devices, readings, events	hourly_averages, anomaly_flags, device_health_summary
crm-pipeline	companies, contacts, deals, activities	pipeline_by_stage, win_rate, forecast_by_quarter
survey-responses	respondents, questions, responses	nps_score, question_summary, segment_breakdown
healthcare-hrm-security	users, threat_events, phishing_sims, training, dlp_events, abuse_mailbox	monthly_risk (composite scores), peer_benchmarks (industry comparison), segment_summaries, department_risk

Step 1: [synthdata-generate]
  Generate raw event tables from the healthcare-hrm-security template at medium effort.
  > synthdata-generate --template healthcare-hrm-security --effort medium --seed 42
  Produces: users (1K), threat_events (~90K), phishing_sims (12K), training (5K),
  dlp_events (~850), abuse_mailbox (~22K)

Step 2: [synthdata-compute]
  Derive monthly_risk from the raw event tables:
  - Group threat_events by (user_id, month) → count events, sum clicks
  - Group phishing_sims by user_id → sim click rate, report rate
  - Group dlp_events by user_id → count violations
  - Group training by user_id → completion rate, avg score
  - Group abuse_mailbox by user_id → report count, confirmed-malicious ratio
  - Compute sub-scores (0-100 each), weighted composite (Threat 35%, Sim 25%,
    DLP 20%, Training 10%, Reporting 10%), assign risk tiers

Step 3: [synthdata-generate or LLM-driven]
  Generate 200 peer organizations with correlated metrics.
  This is best done via a prompted generation (not a template) because the cross-metric
  correlations (mature orgs have lower click rates, higher MFA adoption, etc.) require
  reasoning that the schema engine can't express.
  > "Generate a peer benchmarks dataset: 200 healthcare-adjacent organizations with
  >  company info, security maturity, sim/threat/DLP/training metrics, composite risk
  >  scores, and segment flags for healthcare/geography/size. Include realistic
  >  cross-metric correlations."

Step 4: [synthdata-compute]
  Derive segment_summaries from the peer data:
  - Group peers by each segment flag
  - Compute mean/median/p25/p75 for each metric
  - Add the user's organization as a comparison row

Step 5: [synthdata-extract]
  Extract the final xlsx workbooks to JSON for the downstream product:
  > synthdata-extract --input <company_data>.xlsx --output json/
  > synthdata-extract --input <peer_data>.xlsx --output json/

Step 1: [synthdata-generate]
  Generate raw tables from the financial-transactions template at thorough effort.
  > synthdata-generate --template financial-transactions --effort thorough --seed 42
  Produces: customers (~5K), accounts (~8K), transactions (~100K+) with fraud_ring profiles

Step 2: [synthdata-compute]
  Derive fraud_labels table:
  - Flag transactions from fraud_ring profile users with amount > p95 as "suspicious"
  - Flag transactions with velocity > 3 per hour from same account as "velocity_alert"
  - Compute per-customer risk features (avg transaction amount, transaction frequency,
    unique merchant count, max single transaction)

Step 3: [synthdata-compute]
  Derive daily_summary:
  - Group transactions by date → total volume, total amount, fraud count, fraud rate
  - Compute 7-day rolling averages for trend analysis

Step 1: [synthdata-generate]
  Generate from iot-sensors template at medium effort.
  > synthdata-generate --template iot-sensors --effort medium --seed 42
  Produces: devices, readings, events

Step 2: [synthdata-compute]
  Derive hourly_averages:
  - Group readings by (device_id, hour) → mean/min/max for each metric
  - Flag anomalies where value > 3 standard deviations from device mean

Step 3: [synthdata-compute]
  Derive device_health_summary:
  - Per device: uptime %, alert count, last reading timestamp, current status

Raw event tables	`synthdata-generate` from template or custom schema	users, transactions, events, devices
Derived/aggregated tables	`synthdata-compute` after generation	monthly_risk, department_summary, percentile_ranks
Independent reference tables	`synthdata-generate` with separate schema, or LLM-driven generation	peer_benchmarks, industry_averages, configuration tables

Synthdata Prompt Builder

How to Use This Skill

Step 1: Understand the domain

Step 2: Identify raw vs derived tables

Synthdata Prompt Builder

How to Use This Skill

Step 1: Understand the domain

Step 2: Identify raw vs derived tables

Step 3: Match to templates

Step 4: Output a sequenced plan

Template Catalog

What the Engine CAN Generate Directly

What the Engine CANNOT Generate

Common Derived-Data Patterns

Monthly event rollup

Composite score from weighted sub-scores

Risk tier assignment

Percentile rank

Segment summary

Department / team breakdown

Worked Example: Healthcare Human Risk Management

Worked Example: E-Commerce Fraud Detection

Worked Example: IoT Monitoring Dashboard

Interaction Pattern

Taskflow Inbox Triage

Accessibility

Open a Pull Request

Investor Materials

Continuous Agent Loop

Configure Ecc