Normalize, classify, and manage contact databases across 3 Gmail accounts. Clean CSV exports, deduplicate, tag categories, flag touchbase/unsubscribe candidates.
Normalize and classify contact databases for multi-account Gmail management.
| Account | Source Path | Format | Entries |
|---|---|---|---|
| ace | aceengineer-admin/admin/contacts/aceengineer_contacts.csv | Outlook CSV export | ~1,306 |
| personal | aceengineer-admin/admin/contacts/achantav_contacts.csv | Outlook CSV export | ~1,157 |
| skestates | sabithaandkrishnaestates/admin/contacts/ | Manual (from key_contacts.md) | ~20 |
Known issues in the raw CSVs:
<[email protected]>Output format (per account): contacts_normalized.csv
email,first_name,last_name,company,category,touchbase_cadence,notes
For SKEstates (expanded schema with department/phone):
email,first_name,last_name,title,company,department,category,touchbase_cadence,phone,notes
Categories:
client — current/past clients (EPCI operators, engineering firms)colleague — current/former colleagues, same-company contactsvendor — service providers, suppliersrecruiter — staffing, job-related contactsindustry — industry organizations, standards bodiesalumni — university contacts (Rice, TAMU, UT McCombs)personal — friends, family (gmail/yahoo with known names)financial — banks, insurance, investmentgovernment — tax, legal, regulatorynewsletter — subscriptions, marketing listsspam — junk, craigslist, unknown bulktenant — tenant contacts (for real estate accounts)vendor-functional — shared functional inboxes (no touchbase)tenant-functional — shared tenant inboxesAce client domains:
ril.com, dorisgroup.com, mcdermott.com, shell.com, kbr.com,
technip.com, technipfmc.com, subsea7.com, nov.com, aker.com,
bp.com, awilcodrilling.com, eagle.org, vulcanoffshore.com,
boptechnologies.com, risersinc.com, sandsig.com,
engineeredcustomsolutions.com, mecorparada.com.ve
Ace colleague domains:
trendsetterengineering.com, spire-engineers.com,
2hoffshoreinc.com, 2hoffshore.com, prospricing.com
Ace vendor domains:
disys.com, winworldinfo.com, partneresi.com, ansys.com,
akselos.com, engys.com, dnvgl.com, tescocorp.com,
deccaconsulting.com, flooranddecor.com, pulse-monitoring.com,
quantumep.com, acematrix.com
Ace recruiter domains:
stepstoprogress.com, thejukesgroup.com, apexsystems.com,
indianeagle.com
Ace industry domains:
km.kongsberg.com, ceesol.com
Personal alumni domains:
rice.edu, mccombs.utexas.edu, neo.tamu.edu, tamu.edu, houstonisd.org
Personal financial domains:
aaa-texas.com, colehealth.com, harkandgroup.com, aol.com,
sbcglobal.net, constellation.com
Spam domains (auto-remove):
sale.craigslist.org, talkmatch.com
Spam patterns (regex):
unsubscribe, no.?reply, noreply, do.?not.?reply
mailer-daemon, postmaster@, bounce@
@unsubscribe2\., \.unsubscribe\., @sailthru\., @mcsv\., @rsys5\., @customer\.io
Spam name fragments:
craigslist, mailer-daemon, unsubscribe, academia.edu, 123greetings, machinemetrics
Rules:
132 overlaps found between ace and personal files. Decision rules (applied in order):
aceengineer.com → ace only (internal domain)achanta in email prefix → personal only (family)Result: 122 moved from personal to ace, 10 kept in personal. Zero remaining overlaps.
Report template: reports/email/contact-dedup-report.md
uv run scripts/email/contact-normalizer.py
Input: raw Outlook CSV exports from both accounts.
Output: aceengineer_normalized.csv (1,281 contacts) + achantav_normalized.csv (994 contacts).
Built from two markdown files:
key_contacts.md (3 PCA vendor contacts)fd_corporate_contact_maintenance.md (22 Family Dollar/Dollar Tree staff)Output: sabithaandkrishnaestates/admin/contacts/skestates_contacts.csv (25 contacts).
unknown_domains = Counter(r["email"].split("@")[-1] for r in contacts if r["category"] == "unknown") for domain, count in unknown_domains.most_common(20): print(f" {domain:40s} {count}")
Then expand the domain classification sets (ACE_CLIENT_DOMAINS, ACE_VENDOR_DOMAINS, etc.)
in the normalizer script and re-run. Second pass typically reaches 80%+ classification.
### Step 4: Cross-file deduplication
When contacts appear in multiple account files (email addresses imported into
multiple Outlook exports), resolve overlaps using these rules:
| Signal | Canonical Home | Rationale |
|---|---|---|
| aceengineer.com domain | ace | Internal domain |
| achantav* in email prefix | personal | Family email |
| Company domain (non-gmail/yahoo) | ace | Professional contact |
| Personal email with company in record | ace | Cross-referenced professional |
| Both unknown, personal email | personal | No business value |
| Ace has classification | ace | Already categorized |
After dedup: annotate the canonical file with cross-ref note, remove from duplicate file.
### Step 5: Write output files
## Scripts
### Contact normalizer (run with uv)
```bash
uv run scripts/email/contact-normalizer.py
The normalizer processes ALL accounts in one run. It reads both raw CSVs and outputs normalized files alongside them. No CLI args needed — paths are hardcoded for the workspace-hub convention.
Edit the domain sets in scripts/email/contact-normalizer.py (ACE_CLIENT_DOMAINS, ACE_VENDOR_DOMAINS, etc.) then re-run. The script is idempotent.
| Account | Normalized CSV | Classified CSV |
|---|---|---|
| ace | aceengineer-admin/admin/contacts/aceengineer_normalized.csv | aceengineer-admin/admin/contacts/aceengineer_classified.csv |
| personal | aceengineer-admin/admin/contacts/achantav_normalized.csv | aceengineer-admin/admin/contacts/achantav_classified.csv |
| skestates | sabithaandkrishnaestates/admin/contacts/skestates_contacts.csv | (small enough to classify manually) |
<[email protected]>encoding="utf-8-sig".legal-deny-list.yaml scan before committing_normalized.csv alongsidecd into them to commit, don't use git add from workspace-hub rootscripts/email/) but reads/writes in repo subdirectories — it finds the root by walking up from its own path