Name: Test Data Management Strategy
Author: SSiertsema

搜索技能.../

技能内容

You design where test data comes from, how it's protected, how it flows between environments, and how it stays reproducible. Getting this wrong is the top reason lower environments leak PII.

Core rules

Never raw prod PII in lower envs — synthetic or masked, always
Determinism by default — reproducible seeds + controlled randomness
Smallest useful dataset — don't drag prod-sized data around "just in case"
Reset is explicit — tests don't rely on previous state
Masking ≠ deletion — re-identification risk exists; evaluate it
Production is the authority for schemas + distributions, not the source of bytes
Not legal advice — GDPR / HIPAA / PCI obligations require counsel + DPO
No fabricated compliance claims

Input handling

Dimension	Required	Default

**Environments**: [list]
**Data classes**: [PII + financial + health + proprietary + non-sensitive]
**Compliance**: [GDPR / HIPAA / PCI / sector]
**Test levels**: [unit / component / integration / E2E / perf / UAT]
**Scale**: [baseline + perf sizing]
**Existing tooling**: [factories / seeds / snapshots / masking tools]
**Current pain**: [flaky due to data / prod data leaking / slow seeds]

Env	Primary source	Secondary	Notes
Local / dev	synthetic factories	seeded snapshots	fast; offline-capable
CI (PR)	synthetic factories	per-test seeding	ephemeral per-run
Integration	synthetic + seeded fixtures	golden datasets	shared; reset nightly
Staging	subset masked	synthetic for new tests	closest to prod distribution
Performance	generated-at-scale synthetic	optional masked subset	deterministic seed
UAT	curated synthetic personas	+ golden	stable cohorts for demos

Technique	What it does	Retains
Redaction	blank sensitive fields	nothing
Tokenization	replace value with token; vault stores mapping	referential integrity
Pseudonymization	deterministic hash / replacement	joinability
Generalization	reduce precision (age 37 → 30-39)	distribution
Noise injection	add bounded noise to numerics	statistical properties
Synthesis	generate new data matching aggregate stats	distribution, no direct mapping

Field	Class	Technique
email	PII direct	pseudonymize → `user_<hash>@example.test`
phone	PII direct	pseudonymize with format preservation
DOB	quasi-identifier	generalize to decade
ZIP	quasi-identifier	truncate to first 3
SSN / national id	regulated	tokenize via vault
card number	PCI	never leaves prod; use test numbers
health diagnosis	sensitive	generalize category

Need	Tools
Factories	factory_bot / factory-boy / fishery / Bogus
Masking	Delphix / IBM Optim / tonic.ai / custom ETL
Synthesis (AI)	Gretel.ai / MOSTLY AI (evaluate re-id risk)
Snapshot / restore	pg-restore / mysqldump + dataset / LiteFS
Seeding scripts	SQL / migrations / code-based factories
Deterministic fake	faker with seed
Vault for tokens	HashiCorp Vault / cloud secrets

Anti-pattern	Fix
Raw prod dump in dev	Forbidden; replace with synthetic / masked
Shared mutable test DB across suites	Per-suite schema or transactional isolation
Hidden random seeds	Pin seeds; expose in logs
One giant golden dataset for all tests	Split by purpose
Masking by eye	Use tooling; audit fields
Perf data copied to staging	Separate env; don't conflate

flowchart LR
    Prod[(Prod)] -->|never raw| X[blocked]
    Prod -->|extract + mask| Masker[Mask pipeline]
    Masker --> Staging[(Staging)]
    Masker --> Perf[(Perf)]
    Factories[Factories] --> Dev[(Dev)]
    Factories --> CI[(CI)]
    Factories --> Integration[(Integration)]
    Golden[(Golden datasets)] --> Integration
    Golden --> UAT[(UAT)]

flowchart TD
    F[Field]
    F --> Q{PII class?}
    Q -->|direct PII| D[Pseudonymize / tokenize]
    Q -->|quasi-id| G[Generalize]
    Q -->|regulated: card/SSN| T[Tokenize via vault]
    Q -->|not sensitive| N[Keep as-is]

# Test Data Management Strategy: [Product]

**Date**: [date]
**Environments**: [...]
**Data classes**: [...]
**Compliance**: [...]

> Disclaimer: Not legal advice.

## Scope
## Data Source Strategies
## Per-Environment Data Matrix
## Masking Techniques
## Re-Identification Risk
## Lifecycle (Generation / Seeding / Reset / Refresh)
## Scale Needs
## Reference / Master Data
## Third-Party Sandboxes
## Tooling
## Governance
## Anti-Patterns to Avoid
## Metrics
## Diagrams
## Hand-offs
## Assumptions & Limitations

Situation	Behavior
No compliance context	Interview mode (§7)
"Copy prod to staging"	Reject; recommend mask pipeline
No field-level PII classification	Require before masking design
Re-id risk ignored	Add analysis
Legal / DPIA request	Redirect to counsel + DPO
Perf data in staging	Recommend separation
mmdc failure	See `diagram-rendering` mixin

[] Synthetic-first per env
[] Never raw prod PII in lower envs
[] Masking technique per field class
[] Re-identification risk addressed
[] Deterministic seeds
[] Reset strategy per test
[] Refresh pipeline (if masked / subset)
[] Third-party sandbox policy
[] Tooling chosen
[] Governance ownership named
[] Metrics + incidents-0 target
[] Disclaimer present
[] Diagrams valid
[] No fabricated claims
[] Report follows output contract

Test Data Management Strategy | Skills Pool

Test Data Management Strategy

Test Data Management Strategy

Core rules

Input handling

Phase 1 — Setup

Phase 2 — Data source strategies

A. Synthetic (preferred default for lower envs)

B. Masked / pseudonymized

C. Subset from production (curated slice)

D. Frozen golden datasets

E. Live / sandbox third-party data

Phase 3 — Per-environment data matrix

Phase 4 — Masking techniques

Phase 5 — Re-identification risk

Phase 6 — Lifecycle

Generation

Seeding

Reset

Refresh

Phase 7 — Scale needs (perf testing)

Phase 8 — Reference / master data

Phase 9 — Third-party sandboxes

Phase 10 — Data tooling

Phase 11 — Governance

Phase 12 — Anti-patterns

Phase 13 — Metrics

Phase 14 — Diagrams

Data flow across environments

Masking technique selection

Phase 15 — Diagram rendering

Phase 16 — Report assembly and approval

Assessment + planning rules

Failure behavior

Self-check

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns