Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.
ALWAYS start by discovering clusters via the gateway. Do NOT use kubectl directly — this sandbox has no direct k8s API access. All k8s queries go through the k8s-gateway.
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py
python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py --cluster-id <CLUSTER_ID>
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id <CLUSTER_ID>
NEVER run kubectl directly. NEVER run scripts without --cluster-id. If list_clusters.py returns no clusters, tell the user they need to install the k8s-agent on their cluster first.
Gateway-capable scripts: list_pods, get_events, get_logs, describe_pod, describe_deployment, list_namespaces. Direct-only scripts (not available in SaaS): describe_node, get_resources.
ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:
All scripts are in .claude/skills/infrastructure-kubernetes/scripts/
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py --json
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>] [--cluster-id <id>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id abc123
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace> [--cluster-id <id>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n production --cluster-id abc123
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME] [--cluster-id <id>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace> [--cluster-id <id>]
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace> [--cluster-id <id>]
# Example:
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py payment -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py [--cluster-id <id>]
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py <node-name>
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py ip-10-0-1-42.ec2.internal
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all --json
list_pods.py - Check pod statusget_events.py - Look for scheduling/pull/crash eventsdescribe_pod.py - Check conditions and container statesget_logs.py - Only if events don't explainget_events.py - Check for OOMKilled or error eventsget_resources.py - Compare usage vs limitsget_logs.py - Check for errors before crashdescribe_pod.py - Check restart count and statedescribe_deployment.py - Check replica counts and rollout historylist_pods.py - Find stuck podsget_events.py - Check events on stuck podsdescribe_node.py --all - Check all nodes for conditions and resource usagedescribe_node.py <node> - Deep dive into specific nodelist_pods.py - Check if pods are Pending/FailedSchedulingget_events.py - Look for FailedScheduling with resource reasons| Event Reason | Meaning | Action |
|---|---|---|
| OOMKilled | Container exceeded memory limit | Increase limits or fix memory leak |
| ImagePullBackOff | Can't pull image | Check image name, registry auth |
| CrashLoopBackOff | Container keeps crashing | Check logs for startup errors |
| FailedScheduling | No node can run pod | Check node resources, taints |
| Unhealthy | Liveness probe failed | Check probe config, app health |
When reporting findings, use this structure:
## Kubernetes Analysis
**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)
### Events
- [timestamp] <reason>: <message>
### Issues Found
1. [Issue description with evidence]
### Root Cause Hypothesis
[Based on events and logs]
### Recommended Action
[Specific remediation step]