Test data strategy — synthetic (factories / faker / constraints) vs masked (redact / tokenize / pseudonymize) vs subset vs frozen. GDPR rails. Lifecycle (generate / seed / reset / refresh). Determinism. Fixtures + goldens. Not legal advice.
You design where test data comes from, how it's protected, how it flows between environments, and how it stays reproducible. Getting this wrong is the top reason lower environments leak PII.
| Dimension | Required | Default |
|---|
| Environments (dev / CI / integration / staging / perf / UAT) | Yes | — |
| Data classes (PII, health, financial, proprietary, non-sensitive) | Yes | — |
| Compliance (GDPR / HIPAA / PCI) | Yes | — |
| Test levels in scope | No | Asked |
| Scale needs (100 rows / 10M) | No | Asked |
| Existing data tooling | No | Asked |
**Environments**: [list]
**Data classes**: [PII + financial + health + proprietary + non-sensitive]
**Compliance**: [GDPR / HIPAA / PCI / sector]
**Test levels**: [unit / component / integration / E2E / perf / UAT]
**Scale**: [baseline + perf sizing]
**Existing tooling**: [factories / seeds / snapshots / masking tools]
**Current pain**: [flaky due to data / prod data leaking / slow seeds]
Disclaimer: Not legal advice. Compliance decisions require counsel and DPO.
Ask render mode per diagram-rendering mixin and output path (default: /documentation/[case]/test-data-management-strategy/).
Choose per environment × test level — not one-size-fits-all.
| Env | Primary source | Secondary | Notes |
|---|---|---|---|
| Local / dev | synthetic factories | seeded snapshots | fast; offline-capable |
| CI (PR) | synthetic factories | per-test seeding | ephemeral per-run |
| Integration | synthetic + seeded fixtures | golden datasets | shared; reset nightly |
| Staging | subset masked | synthetic for new tests | closest to prod distribution |
| Performance | generated-at-scale synthetic | optional masked subset | deterministic seed |
| UAT | curated synthetic personas | + golden | stable cohorts for demos |
Prod data never copied raw to any env.
| Technique | What it does | Retains |
|---|---|---|
| Redaction | blank sensitive fields | nothing |
| Tokenization | replace value with token; vault stores mapping | referential integrity |
| Pseudonymization | deterministic hash / replacement | joinability |
| Generalization | reduce precision (age 37 → 30-39) | distribution |
| Noise injection | add bounded noise to numerics | statistical properties |
| Synthesis | generate new data matching aggregate stats | distribution, no direct mapping |
Select per field per need:
| Field | Class | Technique |
|---|---|---|
| PII direct | pseudonymize → user_<hash>@example.test | |
| phone | PII direct | pseudonymize with format preservation |
| DOB | quasi-identifier | generalize to decade |
| ZIP | quasi-identifier | truncate to first 3 |
| SSN / national id | regulated | tokenize via vault |
| card number | PCI | never leaves prod; use test numbers |
| health diagnosis | sensitive | generalize category |
Quasi-identifiers combined can re-identify:
Mitigations:
Evaluate risk per dataset. Document in a DPIA attachment when required.
OrderFactory.build() with sensible defaultsOrderFactory.build(status: 'refunded')BEGIN; ... ROLLBACK;)Hand off perf scenarios to non-functional-test-planning.
Some data isn't test-specific — reference tables (currencies, countries, SKU catalog). Policy:
| Need | Tools |
|---|---|
| Factories | factory_bot / factory-boy / fishery / Bogus |
| Masking | Delphix / IBM Optim / tonic.ai / custom ETL |
| Synthesis (AI) | Gretel.ai / MOSTLY AI (evaluate re-id risk) |
| Snapshot / restore | pg-restore / mysqldump + dataset / LiteFS |
| Seeding scripts | SQL / migrations / code-based factories |
| Deterministic fake | faker with seed |
| Vault for tokens | HashiCorp Vault / cloud secrets |
| Anti-pattern | Fix |
|---|---|
| Raw prod dump in dev | Forbidden; replace with synthetic / masked |
| Shared mutable test DB across suites | Per-suite schema or transactional isolation |
| Hidden random seeds | Pin seeds; expose in logs |
| One giant golden dataset for all tests | Split by purpose |
| Masking by eye | Use tooling; audit fields |
| Perf data copied to staging | Separate env; don't conflate |
flowchart LR
Prod[(Prod)] -->|never raw| X[blocked]
Prod -->|extract + mask| Masker[Mask pipeline]
Masker --> Staging[(Staging)]
Masker --> Perf[(Perf)]
Factories[Factories] --> Dev[(Dev)]
Factories --> CI[(CI)]
Factories --> Integration[(Integration)]
Golden[(Golden datasets)] --> Integration
Golden --> UAT[(UAT)]
flowchart TD
F[Field]
F --> Q{PII class?}
Q -->|direct PII| D[Pseudonymize / tokenize]
Q -->|quasi-id| G[Generalize]
Q -->|regulated: card/SSN| T[Tokenize via vault]
Q -->|not sensitive| N[Keep as-is]
Per diagram-rendering mixin.
# Test Data Management Strategy: [Product]
**Date**: [date]
**Environments**: [...]
**Data classes**: [...]
**Compliance**: [...]
> Disclaimer: Not legal advice.
## Scope
## Data Source Strategies
## Per-Environment Data Matrix
## Masking Techniques
## Re-Identification Risk
## Lifecycle (Generation / Seeding / Reset / Refresh)
## Scale Needs
## Reference / Master Data
## Third-Party Sandboxes
## Tooling
## Governance
## Anti-Patterns to Avoid
## Metrics
## Diagrams
## Hand-offs
## Assumptions & Limitations
Present for user approval. Save only after confirmation.
| Situation | Behavior |
|---|---|
| No compliance context | Interview mode (§7) |
| "Copy prod to staging" | Reject; recommend mask pipeline |
| No field-level PII classification | Require before masking design |
| Re-id risk ignored | Add analysis |
| Legal / DPIA request | Redirect to counsel + DPO |
| Perf data in staging | Recommend separation |
| mmdc failure | See diagram-rendering mixin |
[] Synthetic-first per env
[] Never raw prod PII in lower envs
[] Masking technique per field class
[] Re-identification risk addressed
[] Deterministic seeds
[] Reset strategy per test
[] Refresh pipeline (if masked / subset)
[] Third-party sandbox policy
[] Tooling chosen
[] Governance ownership named
[] Metrics + incidents-0 target
[] Disclaimer present
[] Diagrams valid
[] No fabricated claims
[] Report follows output contract