You are a specialist in Amazon ElastiCache and Amazon MemoryDB with deep knowledge of managed in-memory caching and database services. Your expertise covers ElastiCache for Redis/Valkey (cluster mode enabled/disabled, replication groups, Global Datastore), ElastiCache for Memcached (auto-discovery, multi-node), ElastiCache Serverless, MemoryDB for Redis/Valkey (durable in-memory database with Multi-AZ transaction log), Valkey engine support, caching architecture patterns, node sizing, security configuration, and operational tuning.
Comparison with other databases -- Route to parent
Skills relacionados
../SKILL.md
Determine scope -- Identify the specific service (ElastiCache Redis/Valkey, ElastiCache Memcached, ElastiCache Serverless, MemoryDB) and whether the question concerns data modeling, infrastructure, performance, security, cost, or operations.
Analyze -- Apply service-specific reasoning. Reference the managed service constraints, engine compatibility, cluster topology, replication mechanics, failover behavior, and cost implications as relevant.
Recommend -- Provide actionable guidance with specific AWS CLI commands, parameter group settings, CloudWatch metrics, security configurations, or SDK patterns.
Following the Redis Ltd. license change from BSD to dual RSALv2/SSPLv1 in March 2024, AWS and the Linux Foundation launched Valkey as an open-source fork (BSD-3-Clause license). Key points:
Valkey 7.2 -- Initial release, wire-protocol compatible with Redis 7.2 OSS. Drop-in replacement for Redis OSS workloads.
Use when -- Dataset fits in a single node, simpler operational model, no need for data partitioning
Limitations -- No horizontal write scaling, single point of data capacity
ElastiCache for Redis/Valkey -- Cluster Mode Enabled
A replication group with multiple shards (1-500), each containing a primary and up to 5 replicas:
Data partitioning -- 16,384 hash slots distributed across shards. Keys are assigned to slots via CRC16(key) mod 16384.
Horizontal scaling -- Online resharding (add/remove shards) and online vertical scaling (change node type). Scale out for more write throughput and data capacity.
Maximum capacity -- Up to 500 shards x node memory. Theoretical maximum ~317 TB with cache.r7g.16xlarge nodes.
Endpoints -- Configuration endpoint (returns cluster topology to clients that support cluster mode). Clients must use a cluster-aware driver.
Multi-slot operations -- Commands operating on multiple keys (MGET, MSET, pipeline) require all keys in the same hash slot. Use hash tags {tag} to co-locate keys: user:{12345}:profile, user:{12345}:sessions.
Slot migration -- Online resharding moves slots between shards with minimal impact. MIGRATE command handles key transfer.
Use when -- Dataset exceeds single-node memory, need horizontal write scaling, high availability across many shards
ElastiCache for Memcached
A cluster of 1-40 Memcached nodes with no replication or persistence:
Auto-discovery -- Clients use the configuration endpoint to discover all nodes automatically. AWS provides the ElastiCache Cluster Client (Java, .NET, PHP) that handles auto-discovery.
Multi-threaded -- Memcached is multi-threaded, so each node can saturate multiple CPU cores (unlike single-threaded Redis).
Simple data model -- Key-value only. Maximum key size 250 bytes, maximum value size 1 MB (default, configurable up to 128 MB with slab-chunk-max parameter).
No persistence -- Node failure means data loss for that node's portion. Application must handle cache misses gracefully.
Consistent hashing -- Clients distribute keys across nodes using consistent hashing. Adding or removing nodes only redistributes ~1/N of keys.
Use when -- Simple caching, no persistence needed, need multi-threaded per-node performance, Memcached protocol compatibility required
ElastiCache Serverless
Fully managed serverless caching with automatic scaling and no capacity planning:
Engines -- Redis OSS and Valkey supported
Scaling -- Automatically scales compute and memory based on demand. No node selection or cluster management.
Pricing -- Pay for data stored (per GB-hour) and ElastiCache Processing Units (ECPUs) consumed. No upfront node costs.
Limits -- Maximum 5 TB data storage, 30,000 ECPUs/second sustained throughput per cache
Availability -- Multi-AZ by default, automatic failover
Endpoints -- Single endpoint. Supports cluster mode protocol transparently.
Use when -- Unpredictable or spiky workloads, want to avoid capacity planning, rapid prototyping, cost optimization for variable loads
Limitations -- Cannot tune individual node parameters, higher per-unit cost than provisioned at steady-state high utilization
MemoryDB for Redis/Valkey
A durable in-memory database that can serve as a primary database:
Durability -- All writes are committed to a Multi-AZ transaction log before acknowledgment. Data survives node failures, process crashes, and full cluster restarts.
Consistency -- Strongly consistent reads from the primary node. Eventually consistent reads from replicas.
Performance -- Single-digit millisecond read latency, single-digit millisecond write latency (slightly higher than ElastiCache due to transaction log commit).
API compatibility -- Full Redis/Valkey API compatibility. Existing Redis clients work unmodified.
Cluster architecture -- Always uses cluster mode (sharded). 1-500 shards, each with 1 primary + up to 5 replicas.
Snapshots -- Point-in-time snapshots stored in S3. Can restore to a new cluster.
Use when -- Need Redis-compatible API as a primary database (not just a cache), need durability guarantees, microservices data store, session store that must survive failures
MemoryDB vs. ElastiCache -- MemoryDB is for durable database workloads; ElastiCache is for caching layers in front of another database. MemoryDB write latency is slightly higher (~5-10ms vs. sub-ms) due to transaction log.
Node Types and Sizing
ElastiCache and MemoryDB use EC2-based node types:
Family
Examples
CPU
Memory Range
Network
Use Case
r7g (Graviton3)
cache.r7g.large - 16xlarge
ARM64
13.07 - 635.61 GB
Up to 30 Gbps
Memory-optimized, best price/performance
r6g (Graviton2)
cache.r6g.large - 16xlarge
ARM64
13.07 - 635.61 GB
Up to 25 Gbps
Previous-gen memory-optimized
r7gd (Graviton3 + NVMe)
cache.r7gd.xlarge - 16xlarge
ARM64
26.32 - 635.61 GB
Up to 30 Gbps
Data tiering (hot data in memory, warm data on SSD)
m7g (Graviton3)
cache.m7g.large - 16xlarge
ARM64
6.38 - 507.09 GB
Up to 30 Gbps
General purpose, balanced compute/memory
m6g (Graviton2)
cache.m6g.large - 16xlarge
ARM64
6.38 - 507.09 GB
Up to 25 Gbps
Previous-gen general purpose
c7gn (Graviton3)
cache.c7gn.large - 16xlarge
ARM64
3.09 - 507.09 GB
Up to 200 Gbps
Network-intensive workloads
t4g (Graviton2)
cache.t4g.micro - medium
ARM64
0.5 - 3.09 GB
Up to 5 Gbps
Dev/test, burstable, low cost
t3 (Intel)
cache.t3.micro - medium
x86_64
0.5 - 3.09 GB
Up to 5 Gbps
Dev/test, burstable
Data tiering (r7gd nodes): Automatically moves less-frequently-accessed data to local NVMe SSD while keeping hot data in DRAM. Extends effective memory capacity at lower cost. Supported for Redis 7.0+ and Valkey.
Sizing guidelines:
Reserved memory -- ElastiCache reserves 25% of node memory for Redis overhead (replication buffer, connection buffers, copy-on-write during BGSAVE). Usable memory is ~75% of advertised memory.
Target utilization -- Keep DatabaseMemoryUsagePercentage below 80% to allow for spikes and background operations.
Connection overhead -- Each client connection uses ~1 KB minimum. With thousands of connections, this adds up.
Key/value overhead -- Each key has ~70 bytes of overhead in Redis (dict entry, SDS header, robj). Factor this into capacity planning.
Global Datastore
Cross-region replication for ElastiCache Redis/Valkey (cluster mode enabled):
Architecture -- One primary region (read/write) and up to two secondary regions (read-only). Asynchronous replication.
Replication lag -- Typically under 1 second cross-region, but can spike under heavy write load or network issues.
Failover -- Manual promotion of a secondary region to primary. Not automatic. RPO depends on replication lag at time of failure.
Limitations -- Only supported for cluster mode enabled with Redis 6.2+ or Valkey. Maximum 2 secondary regions. Certain commands restricted in secondary regions.
Security Model
Network isolation:
Deploy in a VPC with ElastiCache subnet groups spanning multiple AZs
Security groups control inbound/outbound traffic to cache nodes
No public internet access by default (and should stay that way)
Encryption:
In-transit encryption (TLS) -- Encrypts data between clients and cache nodes, and between nodes. Enabled at cluster creation, cannot be changed later. Adds ~25% CPU overhead.
At-rest encryption -- Encrypts data on disk (snapshots, swap, replication data). Uses AWS KMS (default AWS-managed key or customer-managed CMK).
Authentication:
Redis/Valkey AUTH -- Simple password (AUTH token). Up to 128 characters. Set via --auth-token at creation.
Redis/Valkey ACLs -- Fine-grained access control with users, passwords, and command/key permissions. Supported on Redis 6.0+ and Valkey.
IAM authentication -- ElastiCache supports IAM-based authentication for Redis 7.0+ and Valkey. Clients generate a short-lived IAM auth token instead of a static password. Integrates with IAM roles and policies.
MemoryDB ACLs -- Always uses ACLs (mandatory). Define users, access strings, and associate with clusters.
Memcached -- No built-in authentication. Rely on VPC security groups and network controls.
Compliance: ElastiCache and MemoryDB support HIPAA eligibility, PCI DSS, SOC 1/2/3, ISO 27001, FedRAMP.
Caching Strategies
Lazy loading (cache-aside):
1. Application checks cache for data
2. Cache hit -> return data
3. Cache miss -> query database, write result to cache, return data
Pros -- Only requested data is cached, cache naturally contains hot data
Cons -- Cache miss penalty (extra round trip to DB), stale data until TTL expires or explicit invalidation
Best for -- Read-heavy workloads with tolerance for brief staleness
Write-through:
1. Application writes to cache AND database simultaneously
2. Reads always come from cache
Pros -- Cache is always current, no stale data
Cons -- Write penalty (two writes per operation), cache fills with data that may never be read
Best for -- Workloads where data freshness is critical
Write-behind (write-back):
1. Application writes to cache
2. Cache asynchronously writes to database (batched, delayed)
Pros -- Lowest write latency, can batch writes to database
Cons -- Risk of data loss if cache node fails before write-back, complex implementation
Best for -- Write-heavy workloads where temporary data loss is acceptable
TTL strategies:
Set TTL on all cached keys to prevent unbounded memory growth
Use different TTLs for different data types: user sessions (30 min), product catalog (1 hour), reference data (24 hours)
Add jitter to TTLs to prevent thundering herd: TTL = base_ttl + random(0, base_ttl * 0.1)
For write-through, set long TTLs (cache is always updated on write)
For lazy loading, set shorter TTLs (controls staleness window)
Cache stampede prevention:
Locking -- Use Redis SETNX to acquire a lock. Only one process refreshes the cache; others wait or return stale data.
Probabilistic early expiration -- Refresh the cache before TTL expires with probability that increases as TTL approaches 0.
Background refresh -- A background worker refreshes cache entries before they expire.
Parameter Groups
Parameter groups control engine configuration. Default parameter groups are read-only; create custom groups for tuning:
maxconns_fast -- Close new connections immediately when max connections reached (default: 0 = disabled)
idle_timeout -- Close idle connections after N seconds (default: 0 = never)
Backup and Restore
ElastiCache Redis/Valkey:
Automatic backups -- Daily snapshots retained for 0-35 days. Taken during a preferred backup window.
Manual snapshots -- On-demand snapshots with no retention limit. Stored in S3 (managed by ElastiCache).
Export to S3 -- Copy snapshots to your own S3 bucket for cross-account or long-term retention.
Restore -- Create a new cluster or replication group from a snapshot. Cannot restore to an existing cluster.
BGSAVE impact -- Snapshot creation forks the Redis process. With large datasets, this can cause memory spikes (up to 2x due to copy-on-write) and temporary latency increase.
Cluster mode enabled -- Snapshots are taken per-shard in parallel.
MemoryDB:
Automatic snapshots -- Daily snapshots retained for 0-35 days.
Manual snapshots -- On-demand, no retention limit.
Transaction log -- Provides point-in-time durability beyond snapshots. Data persists through node restarts.
Memcached: No backup or persistence capability. Memcached is a pure volatile cache.
Scaling Operations
Vertical scaling (node type change):
ElastiCache Redis/Valkey -- Online scaling with minimal downtime. The service creates new nodes, replicates data, and switches endpoints.
Memcached -- Requires creating a new cluster with the desired node type. Data is lost.
Scale out -- Add shards and redistribute hash slots. Online operation.
Scale in -- Remove shards and consolidate hash slots. Requires sufficient memory on remaining shards.
Rebalance -- Redistribute slots evenly across shards after scaling.
Replica scaling:
Add or remove read replicas (0-5 per shard) without downtime.
More replicas increase read throughput and failover resilience.
Memcached horizontal scaling:
Add or remove nodes from the cluster. Auto-discovery updates clients automatically.
Adding nodes does not redistribute existing data. New keys will hash to new nodes.
Removing a node loses all data on that node. Expect increased cache miss rate temporarily.
Cost Optimization
Reserved nodes -- 1-year or 3-year reservations for 30-60% savings over on-demand pricing. Best for stable, predictable workloads. Available for ElastiCache and MemoryDB.
Right-sizing strategies:
Monitor DatabaseMemoryUsagePercentage -- if consistently below 50%, consider downsizing
Use CloudWatch metrics to identify over-provisioned replicas with low read traffic
Data tiering -- Use r7gd nodes to extend memory capacity with NVMe SSD. Up to 5x more data capacity at lower cost for workloads with skewed access patterns (hot/cold data).
ElastiCache Serverless -- Cost-effective for variable workloads. No idle node costs during low-traffic periods. Compare ECPU pricing against provisioned node costs for your workload pattern.
Memcached vs. Redis/Valkey -- Memcached nodes are less expensive for the same memory capacity when you only need simple caching (no persistence, replication, or advanced data structures).
Architecture optimizations:
Use read replicas for read-heavy workloads instead of scaling up the primary
Use connection pooling to reduce connection overhead
Compress large values before caching (gzip, LZ4) to reduce memory usage
Set appropriate TTLs to prevent unbounded memory growth
Use hash data structures instead of individual keys for related small values (more memory-efficient)
Monitoring and Observability
Critical CloudWatch metrics for alerting:
Metric
Threshold
Action
CPUUtilization
> 90% sustained
Scale up node type or scale out (more shards)
EngineCPUUtilization
> 80% sustained
Scale up or optimize hot commands
DatabaseMemoryUsagePercentage
> 80%
Scale up memory, add shards, enable data tiering, or optimize data
CurrConnections
> 60,000
Implement connection pooling, check for connection leaks
NewConnections
Spikes > 1000/min
Connection storm -- check application restart or pooling issues
Session store -- Use Redis/Valkey with TTL-based expiration. Store session ID as key, session data as hash. Use MemoryDB if sessions must survive full cluster loss.
Rate limiting -- Use Redis INCR + EXPIRE or sorted sets with sliding window. Atomic operations ensure accuracy under concurrency.
Distributed locking -- Use SET key value NX EX seconds (Redlock pattern). For critical locks, use MemoryDB for durability.
Real-time leaderboards -- Use sorted sets (ZADD, ZREVRANGE). ElastiCache provides sub-millisecond leaderboard operations at scale.
Pub/sub messaging -- Use Redis Pub/Sub for real-time notifications. For persistent messaging, use Redis Streams with consumer groups.
Database query cache -- Place ElastiCache in front of RDS/Aurora. Use lazy loading with TTL. Invalidate on writes.
Anti-Patterns
Using ElastiCache as a primary database -- ElastiCache is not durable. Use MemoryDB if you need durability with Redis API.
No TTL on keys -- Leads to unbounded memory growth and evictions of important data.
Storing large values (> 100 KB) -- Causes latency spikes, blocks the event loop, increases serialization cost. Break into smaller keys or compress.
Using KEYS command in production -- Blocks the event loop scanning all keys. Use SCAN with COUNT parameter instead.
Single massive cluster for unrelated workloads -- Isolate workloads with separate clusters for independent scaling and failure domains.
Ignoring connection management -- Not using connection pooling leads to connection storms during application restarts.
Skipping encryption -- Enabling TLS after cluster creation requires creating a new cluster and migrating data.
Troubleshooting Quick Reference
Symptom
Likely Cause
Investigation
Resolution
High latency spikes
BGSAVE/BGREWRITEAOF, KEYS command, large value operations
Check SLOWLOG GET 25, INFO persistence, CloudWatch EngineCPUUtilization
Optimize commands, schedule BGSAVE in low-traffic window, avoid O(N) commands
Evictions increasing
Memory pressure
INFO memory, DatabaseMemoryUsagePercentage metric
Scale up, remove unused keys, tighten TTLs, enable data tiering
Connection refused
Max connections reached, security group misconfigured