Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.
Catalog and schema are always user-supplied — never default to any value. If the user hasn't provided them, ask. For any UC write, always create the schema if it doesn't exist before writing data.
Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).
Synthetic data should demonstrate how Databricks helps solve real business problems.
The pattern: Something goes wrong → business impact ($) → analyze root cause → identify affected customers → fix and prevent.
Key principles:
Why no flat distributions: Uniform data has no story — no spikes, no anomalies, no cohort, no 20/80, no skew, nothing to investigate. It can't show Databricks' value for root cause analysis.
| When | Guide |
|---|---|
| User mentions ML model training or complex time patterns | references/1-data-patterns.md — ML-ready data, time multipliers, row coherence |
| Errors during generation | references/2-troubleshooting.md — Fixing common issues |
.cache() or .persist() — Not supported on serverless. Write to Delta, read back for joins.collect() — Use Spark parallelism. No driver-side iteration, avoid Pandas↔Spark conversionsBefore generating any code, you MUST present a plan for user approval.
You MUST explicitly ask the user which catalog to use. Do not assume or proceed without confirmation.
Example prompt to user:
"Which Unity Catalog should I use for this data?"
When presenting your plan, always show the selected catalog prominently:
📍 Output Location: catalog_name.schema_name
Volume: /Volumes/catalog_name/schema_name/raw_data/
This makes it easy for the user to spot and correct if needed.
Ask the user about:
If user doesn't specify a story: Propose one. Don't generate bland data — suggest an incident, anomaly, or trend that shows Databricks value (e.g., "I'll include a system outage that causes ticket spike and churn — this lets you demo root cause analysis").
Show a clear specification with the business story and your assumptions surfaced:
📍 Output Location: {user_catalog}.support_demo
Volume: /Volumes/{user_catalog}/support_demo/raw_data/
📖 Story: A payment system outage causes support ticket spike. Resolution times
degrade, enterprise customers churn, revenue drops $2.3M. With Databricks we
identify the root cause, affected customers, and prevent future impact.
| Table | Description | Rows | Key Assumptions |
|---|---|---|---|
| customers | Customer profiles with tier, MRR | 10,000 | Enterprise 10% but 60% of revenue |
| tickets | Support tickets with priority, resolution_time | 80,000 | Spike during outage, SLA breaches |
| incidents | System events (outages, deployments) | 50 | Payment outage mid-month |
| churn_events | Customer cancellations with reason | 500 | Spike after poor support experience |
Business metrics:
customers.mrr — Revenue at risk ($)tickets.resolution_hours — SLA performancechurn_events.lost_mrr — Churn impact ($)The story this data tells:
Ask user: "Does this story work? Any adjustments?"
Do NOT proceed to code generation until user approves the plan, including the catalog.
After generating data, use get_volume_folder_details to validate the output matches requirements:
from databricks.connect import DatabricksSession, DatabricksEnv
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import pandas as pd
# Setup serverless with dependencies (MUST list all libs used in UDFs)
env = DatabricksEnv().withDependencies("faker", "holidays")
spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()
# Pandas UDF pattern - import lib INSIDE the function
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
from faker import Faker # Import inside UDF
fake = Faker()
return pd.Series([fake.name() for _ in range(len(ids))])
# Generate with spark.range, apply UDFs
customers_df = spark.range(0, 10000, numPartitions=16).select(
F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"),
fake_name(F.col("id")).alias("name"),
)
# Write to Volume as Parquet (default for raw data)
# Path is a folder with table name: /Volumes/catalog/schema/raw_data/customers/
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")
Partitions by scale: spark.range(N, numPartitions=P)
Output formats:
df.write.parquet("/Volumes/.../raw_data/table") — raw data for pipelinesdf.write.saveAsTable("catalog.schema.table") — if user wants queryable tablesGenerated scripts must be highly performant. Never do these:
| Anti-Pattern | Why It's Slow | Do This Instead |
|---|---|---|
| Python loops on driver | Single-threaded, no parallelism | Use spark.range() + Spark operations |
.collect() then iterate | Brings all data to driver memory | Keep data in Spark, use DataFrame ops |
| Pandas → Spark → Pandas | Serialization overhead, defeats distribution | Stay in Spark, use pandas_udf only for UDFs |
| Read/write temp files | Unnecessary I/O | Chain DataFrame transformations |
| Scalar UDFs | Row-by-row processing | Use pandas_udf for batch processing |
Good pattern: spark.range() → Spark transforms → pandas_udf for Faker → write directly
F.when(F.rand() < 0.6, "Free").when(F.rand() < 0.9, "Pro").otherwise("Enterprise")
Use np.random.lognormal(mean, sigma) — always positive, long tail:
lognormal(7.5, 0.8) → ~$1800 medianlognormal(5.5, 0.7) → ~$245 medianlognormal(4.0, 0.6) → ~$55 medianEND_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
Write master table to Delta first, then read back for FK joins (no .cache() on serverless):
# 1. Write master table
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
# 2. Read back for FK lookup
customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers").select("customer_idx", "customer_id")
# 3. Generate child table with valid FKs via join
orders_df = spark.range(N_ORDERS).select(
(F.abs(F.hash(F.col("id"))) % N_CUSTOMERS).alias("customer_idx")
)
orders_with_fk = orders_df.join(customer_lookup, on="customer_idx")
Requires Python 3.12 and databricks-connect>=16.4. Use uv:
uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
| Issue | Solution |
|---|---|
ImportError: cannot import name 'DatabricksEnv' | Upgrade: uv pip install "databricks-connect>=16.4" |
| Python 3.11 instead of 3.12 | Python 3.12 required. Use uv to create env with correct version |
ModuleNotFoundError: faker | Add to withDependencies(), import inside UDF |
| Faker UDF is slow | Use pandas_udf for batch processing |
| Out of memory | Increase numPartitions in spark.range() |
| Referential integrity errors | Write master table to Delta first, read back for FK joins |
PERSIST TABLE is not supported on serverless | NEVER use .cache() or .persist() with serverless - write to Delta table first, then read back |
F.window vs Window confusion | Use from pyspark.sql.window import Window for row_number(), rank(), etc. F.window is for streaming only. |
| Broadcast variables not supported | NEVER use spark.sparkContext.broadcast() with serverless |
See references/2-troubleshooting.md for full troubleshooting guide.