AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.
Never dump all logs. Use aggregation queries first:
# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)
# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10
# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)
# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| Symptom | First Check | Typical Cause |
|---|---|---|
| Unreachable | describe_ec2_instance | Security group, stopped, status check failed |
| Performance | get_cloudwatch_metrics (CPUUtilization) | CPU exhaustion, network saturation |
| Disk full | get_cloudwatch_metrics (DiskSpaceUtilization) | Logs, temp files |
Key CloudWatch metrics for EC2:
| Symptom | First Check | Typical Cause |
|---|---|---|
| Timeout | CloudWatch logs | External call slow, cold start, insufficient memory |
| Permission denied | CloudWatch logs | IAM role missing permissions |
| Memory error | CloudWatch metrics | Memory allocation too low |
| Cold starts | CloudWatch logs + metrics | Provisioned concurrency needed |
Key CloudWatch metrics for Lambda:
CloudWatch Insights for Lambda:
# Cold start analysis
filter @type = "REPORT"
| stats avg(@initDuration) as avg_cold_start,
count(@initDuration) as cold_starts,
count(*) as total_invocations
by bin(5m)
# Timeout analysis
filter @message like /Task timed out/
| stats count(*) by bin(5m)
| Symptom | First Check | Typical Cause |
|---|---|---|
| Task failed | list_ecs_tasks | Container crash, resource limits, image pull |
| Service unhealthy | list_ecs_tasks | Health check failing, target group issues |
| Slow scaling | CloudWatch metrics | Insufficient capacity, service limits |
Investigation flow:
list_ecs_tasks - See task status and health| Symptom | First Check | Typical Cause |
|---|---|---|
| Connection refused | get_rds_instance_status | Security group, stopped, maintenance |
| Slow queries | CloudWatch metrics | CPU, IOPS, connections |
| Storage full | CloudWatch metrics | Data growth, logs, snapshots |
Key CloudWatch metrics for RDS:
AccessDeniedException
UnauthorizedAccess
→ Check IAM role/policy attached to the service
Throttling
Rate exceeded
TooManyRequestsException
→ Implement exponential backoff, request limit increase
ResourceNotFoundException
NoSuchEntity
→ Verify resource name, region, account