Guide developers through choosing the right database BEFORE writing any code. Covers SQL (PostgreSQL, MySQL), NoSQL document (MongoDB), graph (Neo4j), key-value (Redis), wide-column (Cassandra), object-oriented (IRIS), time-series (InfluxDB), and polyglot persistence. ALWAYS use this skill when: user asks "which database should I use", starts building without mentioning a DB choice, compares databases (MongoDB vs PostgreSQL), designs system architecture, mentions CAP theorem / ACID vs BASE / sharding / replication, says "I need a database for...", builds e-commerce / IoT / social / analytics / real-time systems, mentions "store data" / "persist data" / "data layer" / "polyglot persistence". Even if the user already picked a database, STOP and validate that choice before helping implement it. The wrong database is the most expensive technical debt.
STOP. Before writing any database code, schema, or query — justify the choice first. Perfect code on the wrong foundation is technical debt from day one.
Ask the user if not provided:
Choose when: stable schema, ACID mandatory, complex JOINs, referential integrity critical. CAP: CA. Signals: "users table", "orders", "bank transactions", "billing", "ERP". AVOID: schema changes often, hierarchical data, need 100s of nodes, millions writes/sec.
Choose when: variable attributes per record, nested/JSON data, schema evolves fast, need sharding. CAP: CP. Sharding: by range, hash, or geographic. Signals: "product catalog", "user profiles", "CMS", "different fields per record". AVOID: highly relational with many JOINs, simple key-value pairs.
Choose when: sub-millisecond latency, get/set by key, caching, sessions, counters, leaderboards, queues. CAP: CP (cluster). Structures: Strings, Lists, Sets, Hashes, Sorted Sets, Streams. Signals: "cache", "sessions", "real-time", "leaderboard", "rate limiter", "shopping cart". AVOID: data exceeds RAM, complex queries needed, need persistent-first storage.
Choose when: relationships ARE the value, path-finding, recommendations, fraud detection, social networks. CAP: CA. Clustering: Core servers (Raft, M=2F+1) + Read Replicas. Causal consistency. Signals: "relationships", "recommendations", "fraud", "knowledge graph", "who knows whom". AVOID: massive distributed writes, tabular data without relationships, bulk updates on millions of nodes.
Choose when: write-heavy (IoT, logs, metrics), multi-datacenter, known query patterns, linear scaling, zero downtime. CAP: AP. Consistency tuning: R+W>N = strong; W=1,R=N = fast writes; Quorum: W>N/2. Signals: "IoT", "sensor data", "event log", "metrics", "multi-region", "always available". AVOID: ad-hoc queries, JOINs, unpredictable read patterns, frequently changing data model.
Choose when: deep OOP hierarchies, no ORM needed, complex domains (healthcare, finance, manufacturing). Signals: "complex objects", "inheritance", "medical records", "financial instruments", "multi-model". AVOID: simple structures, need max horizontal scaling, open-source requirement.
Choose when: data indexed by time, monitoring/metrics/IoT, downsampling, retention policies, scientific data. Indexing: R-trees, KD-trees, LSM-trees. Compression: 90%+ reduction. Signals: "metrics", "monitoring", "Grafana", "dashboards", "sensor data over time". AVOID: data not time-indexed, need relationships, general-purpose CRUD.
Most real systems need multiple databases. For each component independently ask: what's the data shape, access pattern, consistency need, and performance requirement?
| Component | DB | Why |
|---|---|---|
| Users & Auth | PostgreSQL | ACID, strong consistency, stable schema |
| Product Catalog | MongoDB | Variable attributes, flexible schema |
| Shopping Cart | Redis | Sub-ms latency, temporary, high concurrency |
| Recommendations | Neo4j | Graph traversal: "users who bought X also bought Y" |
| Order History | Cassandra | High writes, time-partitioned, multi-region |
| System Metrics | InfluxDB | Time-indexed, downsampling, dashboards |
ALWAYS produce this for every database choice:
COMPONENT: [Name]
DATABASE: [Technology] ([Type])
DATA: shape, volume, schema stability
ACCESS: R/W ratio, query type, latency need
CONSISTENCY: ACID/eventual/tunable — CAP priority
SCALE: vertical/horizontal, distribution
REJECTED ALTERNATIVES: [why each was worse]
TRADE-OFFS ACCEPTED: [what you sacrifice and why]