Investigate platform incidents, perform RCA, create incident documentation, and follow alert runbooks in the Kagenti platform
This skill helps you investigate incidents, perform root cause analysis (RCA), and create comprehensive incident documentation.
# List all available runbooks
ls -1 docs/runbooks/alerts/*.md
# For specific alert, check annotation
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
-u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down' # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
print('Runbook URL:', rule['annotations'].get('runbook_url', 'N/A'))
"
Runbooks are located at: docs/runbooks/alerts/<alert-uid>.md
Standard runbook sections:
Example: Follow Prometheus Down runbook:
# From docs/runbooks/alerts/prometheus-down.md
# 1. Check pod status
kubectl get pods -n observability -l app=prometheus
# 2. Check pod logs
kubectl logs -n observability deployment/prometheus --tail=100
# 3. Check events
kubectl get events -n observability --field-selector involvedObject.name=prometheus --sort-by='.lastTimestamp'
# 4. Test Prometheus endpoint
kubectl exec -n observability deployment/grafana -- \
curl -s http://prometheus.observability.svc:9090/-/ready
Pod Status & Events:
# Get pod status in namespace
kubectl get pods -n <namespace>
# Detailed pod description
kubectl describe pod <pod-name> -n <namespace>
# Recent events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# All failing pods across platform
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull|Pending"
Logs:
# Current container logs
kubectl logs -n <namespace> <pod-name> --tail=100
# Previous container (if crashed)
kubectl logs -n <namespace> <pod-name> --previous
# Specific container in pod
kubectl logs -n <namespace> <pod-name> -c <container-name>
# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true
# Query Loki for errors (last 5 minutes)
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
--data-urlencode 'query={kubernetes_namespace_name="<namespace>"} |= "error"' \
--data-urlencode 'limit=100' \
--data-urlencode "start=$(date -u -v-5M +%s)000000000" \
--data-urlencode "end=$(date -u +%s)000000000" | python3 -m json.tool
Metrics:
# Check if service is up
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=up{job="<job-name>"}' | python3 -m json.tool
# Check replica availability
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=kube_deployment_status_replicas_available{deployment="<name>"}' \
| python3 -m json.tool
# Check pod restarts
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=kube_pod_container_status_restarts_total{pod=~"<pod-pattern>"}' \
| python3 -m json.tool
ArgoCD Application Status:
# Check application health
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web
# Check sync status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -E "Degraded|OutOfSync"
# View recent sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web
Common Root Causes:
Configuration Error
git log --oneline -10argocd app diff <app-name> --port-forward ...kustomize build components/...Image Issues
docker exec kagenti-demo-control-plane crictl images | grep <image>kubectl describe pod <pod-name> | grep -A10 "Events"Resource Constraints
kubectl top pods -n <namespace>kubectl top nodeskubectl get events -A | grep OOMDependency Failure
kubectl get endpoints -n <namespace>kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- curl http://service-namemTLS/Network Issues
kubectl get pods -n <namespace> (should show 2/2)kubectl get peerauthentication -Akubectl logs -n <namespace> <pod-name> -c istio-proxyCertificate Issues
kubectl get certificate -Akubectl logs -n cert-manager deployment/cert-managerCreate entry in TODO_INCIDENTS.md:
## Incident #X: [Alert Name] - [Brief Description]
**Status**: 🔴 Active / 🟡 Investigating / 🟢 Resolved
**Detected**: 2025-11-17 08:31:40 UTC
**Severity**: Critical / Warning / Info
**Components Affected**:
- Component 1
- Component 2
### Summary
Brief description of what happened.
### Investigation
**Timeline**:
- 08:31 - Alert fired
- 08:35 - Checked pod status, found CrashLoopBackOff
- 08:40 - Reviewed logs, identified error message
- 08:45 - Identified root cause
**Evidence Collected**:
1. **Pod Status**:
NAME READY STATUS RESTARTS component-xxx 0/2 CrashLoopBackOff 5
2. **Error Logs**:
ERROR: Failed to connect to database: connection refused
3. **Events**:
Back-off restarting failed container
### Root Cause Analysis
**Root Cause**: [Specific technical reason]
**Why it happened**:
- Contributing factor 1
- Contributing factor 2
**Why alert fired**:
- PromQL query: `query_here`
- Query returned: `value`
- Threshold: `> threshold`
### Resolution
**Fix Applied**:
```bash
# Commands to fix the issue
kubectl apply -f ...
Verification:
# Commands to verify fix
kubectl get pods -n <namespace>
# Output showing healthy state
Time to Resolution: XX minutes
alert-uid
### 6. Validate Fix
**After applying fix**:
```bash
# 1. Verify pods are healthy
kubectl get pods -n <namespace>
# 2. Check alert stopped firing
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
-u admin:admin123 | python3 -c "
import sys, json
alerts = json.load(sys.stdin)
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
for a in firing:
print(f\"{a['labels']['alertname']}: {a['labels']['severity']}\")
"
# 3. Run integration tests
pytest tests/integration/test_<component>.py -v
# 4. Check platform status
./scripts/platform-status.sh
# 5. Verify in Grafana UI
open https://grafana.localtest.me:9443/alerting/list
# Investigation flow
kubectl get pods -n <namespace> # Get pod status
kubectl describe pod <pod-name> -n <namespace> # Check events
kubectl logs <pod-name> -n <namespace> # Check logs (if available)
# Common causes:
# - ImagePullBackOff: Image not found or not loaded
# - CrashLoopBackOff: Application exits on startup
# - Pending: Resource constraints or scheduling issues
# - Init:Error: Init container failed
# Investigation flow
kubectl get pods -n <namespace> -l app=<service> # Check pods
kubectl get svc -n <namespace> <service-name> # Check service
kubectl get endpoints -n <namespace> <service-name> # Check endpoints
kubectl get httproute -n <namespace> # Check routes (if using Gateway API)
# Test connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
-- curl -v http://<service-name>.<namespace>.svc:PORT
# Investigation flow
kubectl top pods -n <namespace> --sort-by=memory # Memory usage
kubectl top pods -n <namespace> --sort-by=cpu # CPU usage
# Check limits
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'
# Check for OOM kills
kubectl get events -n <namespace> | grep OOM
# Investigation flow
kubectl get pods -n <namespace> # Check RESTARTS column
kubectl describe pod <pod-name> -n <namespace> # Check restart reason
kubectl logs <pod-name> -n <namespace> --previous # Logs before crash
# Check restart metrics
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=increase(kube_pod_container_status_restarts_total{pod="<pod-name>"}[1h])'
# Investigation flow
argocd app get <app-name> --port-forward ... # Get status
argocd app diff <app-name> --port-forward ... # See differences
# Check last sync
argocd app history <app-name> --port-forward ...
# Force sync if needed
argocd app sync <app-name> --force --prune --port-forward ...
Use check-alerts skill to find which alerts are firing:
# This automatically invokes check-alerts skill
"What alerts are currently firing?"
Use check-logs skill to query specific error patterns:
# This automatically invokes check-logs skill
"Show me error logs from keycloak namespace in the last 10 minutes"
Use check-metrics skill to verify service health:
# This automatically invokes check-metrics skill
"What's the CPU usage of pods in observability namespace?"
Critical (P0):
High (P1):
Medium (P2):
Low (P3):
🤖 Generated with Claude Code