Replace real PII in a dataset with realistic synthetic equivalents while preserving row counts, column types, and statistical distributions. Detects names, emails, phones, SSNs, addresses, credit cards, and user-identifying columns via name heuristics + value patterns. Use this skill when the user wants to "anonymize this dataset", "scrub PII", "make this data safe to share", "de-identify real data", "create a synthetic copy", or needs a sharable version of production data without exposing individuals.
Turn a real dataset into a safe synthetic equivalent: original PII is replaced with Faker-generated values, while row counts, column types, numeric distributions, and categorical frequencies are preserved.
pip install openpyxl faker numpy pandas --break-system-packages
Ask the user for the input file (xlsx/csv/json) and output path. Then run the detector in scan-only mode to list candidate PII columns with confidence scores:
python scripts/anonymize.py --input data.xlsx --scan
The detector flags columns via two signals:
name, email, phone, , , , , , etc.ssnaddressdobipcredit_cardEach flagged column receives a suggested Faker field (e.g., email_address → faker.email,
home_phone → faker.phone_number, full_name → faker.name).
Show the user the detection report. Ask them to:
Columns the user marks as "keep" stay unchanged. This is the user's chance to protect join keys, timestamps, geographies, or categorical fields that must retain real values.
python scripts/anonymize.py --input data.xlsx --output data_anon.xlsx \
--map "email=email,phone=phone_number,full_name=name,ssn=ssn" \
--keep "department,hire_date,salary" \
--seed 42
CLI flags:
| Flag | Description |
|---|---|
--input | Source dataset (xlsx/csv/json) |
--output | Output path (default: <input>_anon.<ext>) |
--scan | Detect-only; print report and exit |
--map | Comma list of col=faker_method overrides |
--keep | Comma list of columns to pass through unchanged |
--drop | Comma list of columns to remove entirely |
--preserve-joins | Columns that must map consistently (same real value → same fake value) |
--locale | Faker locale (default en_US) |
--seed | Seed for reproducible anonymization |
If the dataset has foreign keys (e.g., customer_id in orders referencing customers.customer_id),
pass --preserve-joins customer_id so the same real customer_id always maps to the same fake one
across all tables. This keeps relational integrity intact.
Print:
preserve-joins consistency within a run.--output or a suffixed path