Design a data model — entities, relationships, constraints, and access patterns.
Design a data model for $ARGUMENTS.
Before creating any tables or schemas, understand the domain:
For each entity:
| Property | Question |
|---|---|
| Identity | What uniquely identifies an instance? (UUID, natural key, composite key?) |
| Lifecycle | How is it created? Modified? Archived? Deleted? |
| Ownership | Who owns it? What parent entity does it belong to? |
| Cardinality | How many will exist? (10s, 1000s, millions, billions?) |
| Mutability | Does it change after creation? Which fields change? How often? |
| Temporal | Do we need to know its state at a point in time? |
For every pair of related entities:
| Relationship type | Schema pattern | Example |
|---|---|---|
| One-to-one | FK on either side, or embed in same table | User ↔ UserProfile |
| One-to-many | FK on the "many" side pointing to the "one" | Source → Crawls |
| Many-to-many | Junction table with FKs to both sides | User ↔ Role via user_roles |
| Hierarchical (tree) | Self-referencing FK or materialised path | Category → SubCategory |
| Temporal | Validity period columns (valid_from, valid_to) | EmployeeRole with date range |
For each relationship, answer:
For each entity, define the full schema:
CREATE TABLE sources (
-- Identity
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Core attributes
name TEXT NOT NULL,
url TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'active'
CHECK (status IN ('active', 'paused', 'archived')),
-- Relationships
owner_id UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
-- Metadata
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_by UUID REFERENCES users(id),
-- Constraints
CONSTRAINT uq_sources_name_per_owner UNIQUE (owner_id, name)
);
-- Indexes for access patterns
CREATE INDEX idx_sources_owner_id ON sources(owner_id);
CREATE INDEX idx_sources_status ON sources(status) WHERE status = 'active';
Schema design rules:
| Rule | Implementation |
|---|---|
| UUIDs for primary keys | UUID PRIMARY KEY DEFAULT gen_random_uuid() — no sequential integers (enumeration risk, merge conflicts) |
| Timestamps with timezone | TIMESTAMPTZ not TIMESTAMP — always store UTC |
created_at and updated_at on every table | Audit trail, debugging, optimistic concurrency |
NOT NULL by default | Explicitly nullable fields are the exception, not the rule |
CHECK constraints for enums | Database-level validation, not just application-level |
| Named constraints | CONSTRAINT uq_sources_name_per_owner — not anonymous |
| Soft delete via status | status = 'archived' not DELETE FROM. Hard delete only for GDPR/privacy |
| No premature denormalisation | Normalise first. Denormalise only when query performance demands it, with evidence |
The schema must be optimised for the actual access patterns. Document every query pattern:
### Access Patterns
| # | Pattern | Query shape | Frequency | Latency SLA |
|---|---|---|---|---|
| AP1 | List sources for a user | `WHERE owner_id = ? ORDER BY created_at DESC LIMIT 25` | 100/min | < 50ms |
| AP2 | Get source by ID | `WHERE id = ?` | 500/min | < 10ms |
| AP3 | Search sources by name | `WHERE name ILIKE '%query%'` | 20/min | < 200ms |
| AP4 | Count active sources per user | `WHERE owner_id = ? AND status = 'active'` | 50/min | < 50ms |
| AP5 | List recent crawls for a source | `WHERE source_id = ? ORDER BY started_at DESC LIMIT 10` | 200/min | < 50ms |
Index strategy based on access patterns:
| Access pattern | Index needed | Type |
|---|---|---|
| AP1 | (owner_id, created_at DESC) | Composite, covering the sort |
| AP2 | (id) — PK, already indexed | Primary key |
| AP3 | gin(name gin_trgm_ops) | Trigram index for ILIKE |
| AP4 | (owner_id) WHERE status = 'active' | Partial index |
| AP5 | (source_id, started_at DESC) | Composite |
Rules:
WHERE status = 'active')| Relationship | ON DELETE strategy | Rationale |
|---|---|---|
| User → Sources | RESTRICT | Don't delete users with sources — orphaned data |
| Source → Crawls | CASCADE | Deleting a source removes all its crawls |
| Crawl → Pages | CASCADE | Deleting a crawl removes its pages |
| User → AuditLog | SET NULL | Keep the log entry, just anonymise the actor |
-- Quantity must be positive
ALTER TABLE order_items ADD CONSTRAINT chk_quantity_positive CHECK (quantity > 0);
-- End date must be after start date
ALTER TABLE subscriptions ADD CONSTRAINT chk_date_order CHECK (end_date > start_date);
-- Email format (basic)
ALTER TABLE users ADD CONSTRAINT chk_email_format CHECK (email ~* '^.+@.+\..+$');
-- Status transitions (enforced via trigger or application layer)
-- active -> paused -> active (allowed)
-- active -> archived (allowed)
-- archived -> active (NOT allowed without admin)
| Constraint | Scope | Implementation |
|---|---|---|
| User email | Global | UNIQUE (email) |
| Source name per owner | Per-owner | UNIQUE (owner_id, name) |
| Crawl per source per day | Per-source | UNIQUE (source_id, DATE(started_at)) |
If the domain benefits from event sourcing:
-- Events are immutable — append only, never update
CREATE TABLE source_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
stream_id UUID NOT NULL, -- The aggregate ID
stream_type TEXT NOT NULL, -- 'Source', 'Crawl', etc.
event_type TEXT NOT NULL, -- 'SourceCreated', 'CrawlStarted'
data JSONB NOT NULL, -- Event payload
metadata JSONB DEFAULT '{}', -- Correlation IDs, causation, user
version INTEGER NOT NULL, -- Sequence within the stream
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT uq_stream_version UNIQUE (stream_id, version)
);
CREATE INDEX idx_events_stream ON source_events(stream_id, version);
Event sourcing rules:
Design for change. Document how the schema evolves:
| Change type | Safe? | Strategy | Downtime required? |
|---|---|---|---|
| Add nullable column | YES | ALTER TABLE ADD COLUMN | No |
| Add column with default | YES | ALTER TABLE ADD COLUMN DEFAULT | No (Postgres 11+) |
| Add NOT NULL column | NO | Add nullable → backfill → add constraint | No (multi-step) |
| Remove column | NO | Stop reading → deploy → drop column | No (multi-step) |
| Rename column | NO | Add new → dual-write → migrate reads → drop old | No (multi-step) |
| Change column type | NO | Add new column → backfill → swap | Possibly |
| Add index | YES | CREATE INDEX CONCURRENTLY | No |
| Drop index | YES | DROP INDEX CONCURRENTLY | No |
Evolution rules:
ALTER TABLE on large tables can lock — use CONCURRENTLY for indexes| Principle | Implementation |
|---|---|
| Data minimisation | Don't collect what you don't need. If you don't need date of birth, don't add the column |
| PII identification | Tag columns containing PII: name, email, phone, address, IP, payment info |
| Encryption at rest | PII columns encrypted. Key rotation plan documented |
| Retention policy | Define how long each data type is kept. Automate deletion |
| Right to erasure | Document the deletion path: which tables, which columns, cascade vs anonymise |
| Anonymisation | Replace PII with hashed/randomised values rather than hard delete (preserves analytics) |
| Access logging | Who accessed PII and when? Audit table for sensitive data access |
-- PII columns documented
COMMENT ON COLUMN users.email IS 'PII: email address. Retention: account lifetime + 30 days. Erasure: anonymise to hash.';
COMMENT ON COLUMN users.name IS 'PII: full name. Retention: account lifetime + 30 days. Erasure: set to "Deleted User".';
TIMESTAMP without timezone — always TIMESTAMPTZ. Timezone-naive timestamps cause bugs in every timezone except the server'sVARCHAR(255) everywhere — use TEXT for unbounded strings. VARCHAR(n) only when the max length is a real business rule# Data Model: [domain name]
## Entity-Relationship Diagram
[Mermaid ER diagram]
## Entities
### [Entity Name]
- **Purpose:** [one sentence]
- **Cardinality:** [expected count]
- **Mutability:** [immutable / rarely updated / frequently updated]
#### Schema
[CREATE TABLE SQL with constraints, indexes, comments]
#### Access Patterns
[Table of query patterns with frequency and latency SLA]
#### Indexes
[Table of indexes with rationale]
## Relationships
| From | To | Type | On delete | Constraint |
|---|---|---|---|---|
## Business Rules
[Constraints enforced at database level]
## Privacy
| Column | Classification | Retention | Erasure strategy |
|---|---|---|---|
## Evolution Plan
[Known upcoming changes and migration strategy]
## Open Questions
[Decisions that need product/business input before finalising]
/data-engineer:write-query — queries operate on this data model. Define the model first, then write queries against it./data-engineer:event-tracking-plan — event data feeds into the model. Align event schemas with the data model.