스킬 파일

Aws Troubleshoot

Name: Aws Troubleshoot
Author: incidentfox

AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.

incidentfox559 스타2026. 1. 23.

직업: 네트워크 및 컴퓨터 시스템 관리자
카테고리: 클라우드

스킬 내용

AWS Troubleshooting Expertise

Investigation Methodology

Identify the AWS resource/service involved
Check resource status using describe functions
Review CloudWatch logs for errors
Check CloudWatch metrics for anomalies
Analyze configuration for misconfigurations
Synthesize and recommend

CloudWatch Logs Strategy

Partition First (CRITICAL)

Never dump all logs. Use aggregation queries first:

# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)

# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10

# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)

# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type

관련 스킬

Aws Troubleshoot | Skills Pool

Symptom	First Check	Typical Cause
Unreachable	`describe_ec2_instance`	Security group, stopped, status check failed
Performance	`get_cloudwatch_metrics` (CPUUtilization)	CPU exhaustion, network saturation
Disk full	`get_cloudwatch_metrics` (DiskSpaceUtilization)	Logs, temp files

Symptom	First Check	Typical Cause
Timeout	CloudWatch logs	External call slow, cold start, insufficient memory
Permission denied	CloudWatch logs	IAM role missing permissions
Memory error	CloudWatch metrics	Memory allocation too low
Cold starts	CloudWatch logs + metrics	Provisioned concurrency needed

# Cold start analysis
filter @type = "REPORT"
| stats avg(@initDuration) as avg_cold_start,
        count(@initDuration) as cold_starts,
        count(*) as total_invocations
        by bin(5m)

# Timeout analysis
filter @message like /Task timed out/
| stats count(*) by bin(5m)

Symptom	First Check	Typical Cause
Task failed	`list_ecs_tasks`	Container crash, resource limits, image pull
Service unhealthy	`list_ecs_tasks`	Health check failing, target group issues
Slow scaling	CloudWatch metrics	Insufficient capacity, service limits

Symptom	First Check	Typical Cause
Connection refused	`get_rds_instance_status`	Security group, stopped, maintenance
Slow queries	CloudWatch metrics	CPU, IOPS, connections
Storage full	CloudWatch metrics	Data growth, logs, snapshots

AccessDeniedException
UnauthorizedAccess

Throttling
Rate exceeded
TooManyRequestsException

ResourceNotFoundException
NoSuchEntity

Aws Troubleshoot

AWS Troubleshooting Expertise

Investigation Methodology

CloudWatch Logs Strategy

Partition First (CRITICAL)

Aws Troubleshoot

AWS Troubleshooting Expertise

Investigation Methodology

CloudWatch Logs Strategy

Partition First (CRITICAL)

Query Flow

Service-Specific Patterns

EC2 Issues

Lambda Issues

ECS/Fargate Issues

RDS Issues

Common AWS Errors

Permission Errors

Throttling

Resource Not Found

Feishu Drive

Nanoclaw Repl

Crosspost

Cloudflare

Mcp Integration

Setup Deploy