Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.
Runbooks and diagnostic workflows for common Kubernetes incidents.
Use this skill when:
| Priority | Rule | Impact | Tools |
|---|---|---|---|
| 1 | Check control plane first | CRITICAL | get_pods(namespace="kube-system") |
| 2 | Assess node health | CRITICAL | get_nodes |
| 3 | Gather events before changes | HIGH | get_events |
| 4 | Document timeline | HIGH | Manual notes |
| 5 | Rollback if safe | MEDIUM | rollback_deployment |
| Incident | First Tool | Next Steps |
|---|---|---|
| Pod failure | get_pod_logs(previous=True) | describe_pod, get_events |
| Node down | describe_node | Check kubelet logs |
| Service unreachable | get_endpoints | get_network_policies |
| Control plane | get_pods(namespace="kube-system") | Check API server logs |
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)
| Indicator | Severity | Action |
|---|---|---|
| Multiple nodes NotReady | Critical | Escalate immediately |
| kube-system pods failing | Critical | Control plane issue |
| Single pod CrashLoop | Medium | Debug pod |
| High latency | Medium | Check resources |
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)
Common Causes:
describe_pod(name, namespace)
get_secrets(namespace)
describe_pod(name, namespace)
get_nodes()
get_events(namespace)
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)
istio_analyze_tool(namespace)
istio_proxy_status_tool()
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")
delete_pod(name, namespace, grace_period=0, force=True)
rollback_deployment(name, namespace, revision=0)
rollback_helm_release(name, namespace, revision=1)
For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.
Check all clusters:
for context in ["prod-1", "prod-2", "staging"]:
get_nodes(context=context)
get_pods(namespace="kube-system", context=context)
get_events(namespace="kube-system", context=context)