System design interview prep and architecture guide for DoD/Coast Guard IL2/IL4 environments on AWS GovCloud. Covers designing scalable, secure, compliant systems using EKS, PostgreSQL (RDS), Terraform, FluxCD, Prometheus/Grafana, and AWS managed services. Use this skill whenever someone is preparing for a system design interview, architecting a backend system, discussing database design or PostgreSQL patterns, planning Kubernetes deployments, designing for high availability or fault tolerance, evaluating replication or sharding strategies, working on CI/CD pipelines, discussing observability, or making any architecture decision. Also trigger when someone mentions: data modeling, scaling, load balancing, caching, message queues, event-driven architecture, microservices, API design, or compliance requirements (IL2, IL4, FedRAMP, STIG, CUI, ATO). Even casual questions like "how would you design X" or "what database should I use" should trigger this.
For developers building and designing systems in DoD / Coast Guard IL2/IL4 environments on AWS GovCloud using EKS, PostgreSQL, Terraform, FluxCD, Prometheus/Grafana.
Based on principles from Designing Data-Intensive Applications (2nd ed., Kleppmann & Riccomini, 2026), adapted for our specific stack and compliance requirements.
Every system design interview answer should follow this structure. Practice thinking through each step — interviewers care more about your reasoning process than arriving at a "perfect" answer.
Don't jump into drawing boxes. Clarify what you're building first.
Functional requirements — What does the system do? What are the core user stories?
Nonfunctional requirements — Quantify these:
Compliance constraints (always mention in our context):
Back-of-envelope math — Show you can estimate:
Draw the core components and data flow. For our stack, a typical architecture looks like:
Client → ALB/NLB → Keycloak (OIDC/JWT) → EKS (application pods) → RDS PostgreSQL
↓ ↓
CG ICAM (identity) FluxCD (GitOps) Read replicas
↓
Prometheus → Grafana (observability)
Key components to consider:
This is where you show depth. See references/postgresql-and-data-modeling.md for detailed patterns.
Start with the data model:
PostgreSQL is almost always the right starting point for our environment because:
When to reach beyond PostgreSQL:
| Need | Solution | Why Not Just Postgres? |
|---|---|---|
| Sub-millisecond reads, high cache hit rate | ElastiCache Redis | Postgres can't match in-memory speeds for hot data |
| Full-text search at scale with ranking/facets | OpenSearch (managed Elasticsearch) | Postgres FTS works but doesn't scale for complex search UIs |
| Event streaming / CDC | Amazon MSK (Kafka) or SQS | Postgres LISTEN/NOTIFY doesn't scale for high-throughput streaming |
| Time-series metrics at massive scale | Amazon Timestream or InfluxDB | Postgres handles moderate time-series well, but struggles at extreme write rates |
| Large file/blob storage | S3 | Don't store large blobs in Postgres — store the S3 key instead |
Scaling PostgreSQL:
Scaling EKS:
Reliability patterns:
See references/replication-and-availability.md for replication strategies and failure handling.
Always cover this — it's table stakes in our environment. See references/security-and-compliance.md.
Network security:
Data protection:
Authentication & authorization:
Compliance posture:
Prometheus + Grafana stack:
Logging: CloudWatch Logs or Fluent Bit → OpenSearch for centralized log aggregation
Deployment via FluxCD:
| Strategy | Consistency | Write Throughput | Failure Tolerance | Our Context |
|---|---|---|---|---|
| Single-leader (RDS primary + replicas) | Strong from primary | Limited by primary | Automatic failover (Multi-AZ) | Default for most workloads |
| Multi-leader (Aurora Global Database) | Eventual across regions | Higher | Cross-region resilience | Only if multi-region is required |
| Leaderless (DynamoDB) | Tunable | High | No failover needed | Rarely needed — prefer PostgreSQL |
| Level | Prevents | Cost | Use When |
|---|---|---|---|
| Read committed (Postgres default) | Dirty reads/writes | Low | Most workloads |
| Repeatable read (snapshot isolation) | Non-repeatable reads | Medium | Reports running alongside OLTP |
| Serializable (SSI in Postgres) | All anomalies including write skew | Higher abort rate | Financial calculations, inventory, bookings |
| Pattern | When to Use | AWS Service |
|---|---|---|
| Synchronous REST/gRPC | Client needs immediate response | ALB + EKS service |
| Async queue | Fire-and-forget, work distribution | SQS + worker pods |
| Event streaming | Event-driven architecture, CDC, fan-out | MSK (Kafka) or Kinesis |
| Pub/sub | Notifications, loose coupling | SNS → SQS fan-out |
Warn candidates about these:
Jumping to microservices: For a new system, start with a well-structured monolith on EKS. Split services only when team boundaries or scaling needs demand it.
Premature sharding: A single RDS PostgreSQL instance handles far more than most people think. Show the math before proposing sharding.
Ignoring the network: Distributed systems fail partially. Design for retries, timeouts, idempotency, and circuit breakers. TCP doesn't guarantee bounded delay.
Trusting wall clocks for ordering: Clocks drift between machines. For event ordering across services, use logical clocks or a centralized sequence (PostgreSQL sequences, Kafka offsets).
No connection pooling: PostgreSQL forks a process per connection. 200 EKS pods with 5 connections each = 1000 connections. You need PgBouncer or RDS Proxy.
Skipping compliance: In our environment, "we'll handle security later" is not an option. Network isolation, encryption, IRSA, and audit logging are part of the initial design.
Over-engineering for scale you don't have: Don't add Kafka, Redis, and a separate search engine on day one. PostgreSQL + EKS handles a remarkable amount of load. Add complexity only when you have evidence it's needed.
Read these for detailed guidance on specific topics:
references/postgresql-and-data-modeling.md — PostgreSQL internals, indexing strategies, JSONB, partitioning, connection pooling, migration patterns, Aurora vs. RDS, and when to use other databasesreferences/replication-and-availability.md — RDS Multi-AZ, read replicas, Aurora replication, EKS multi-AZ, failure handling, quorum concepts, consistency models, conflict resolutionreferences/security-and-compliance.md — IL2/IL4 requirements, network architecture, encryption, IAM/IRSA, STIG compliance, FedRAMP controls, data classification, audit loggingreferences/eks-and-infrastructure.md — EKS architecture, Terraform patterns, FluxCD GitOps, Prometheus/Grafana observability, scaling strategies, deployment patterns, cost optimizationreferences/distributed-systems-theory.md — Foundational concepts for interviews: CAP theorem, consensus, linearizability, transactions, batch/stream processing, event-driven architecture, encoding formats