Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact |
| 2 hours |
| Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
```
curl -I https://api.company.com/payments/health| Symptom | Likely Cause | Go To Section |
|---|---|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
# Step 1: Check pod status
kubectl get pods -n payments
# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments
# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
# Step 1: Check current request rate
kubectl top pods -n payments
# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1