Name: Kubernetes Debug
Author: incidentfox

搵技能.../

Kubernetes Debug | Skills Pool

python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py --json

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>] [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id abc123

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace> [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n production --cluster-id abc123

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME] [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment

python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace> [--cluster-id <id>]

python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace> [--cluster-id <id>]

# Example:
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py payment -n otel-demo

python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py [--cluster-id <id>]

python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>

python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py <node-name>
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py ip-10-0-1-42.ec2.internal
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all --json

Event Reason	Meaning	Action
OOMKilled	Container exceeded memory limit	Increase limits or fix memory leak
ImagePullBackOff	Can't pull image	Check image name, registry auth
CrashLoopBackOff	Container keeps crashing	Check logs for startup errors
FailedScheduling	No node can run pod	Check node resources, taints
Unhealthy	Liveness probe failed	Check probe config, app health

## Kubernetes Analysis

**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)

### Events
- [timestamp] <reason>: <message>

### Issues Found
1. [Issue description with evidence]

### Root Cause Hypothesis
[Based on events and logs]

### Recommended Action
[Specific remediation step]

Kubernetes Debug

Kubernetes Debugging

Core Principle: Gateway First, Events Before Logs

Step 1: Discover clusters (MANDATORY first step)

Step 2: Use --cluster-id on all scripts

Kubernetes Debug

Kubernetes Debugging

Core Principle: Gateway First, Events Before Logs

Step 1: Discover clusters (MANDATORY first step)

Step 2: Use --cluster-id on all scripts

Available Scripts

list_clusters.py - Discover available remote clusters

list_pods.py - List pods with status

get_events.py - Get pod events (USE FIRST!)

get_logs.py - Get pod logs

describe_pod.py - Detailed pod info

describe_deployment.py - Deployment status and rollout history

list_namespaces.py - List all namespaces

get_resources.py - Resource usage vs limits (direct-only)

describe_node.py - Node status, conditions, and resource usage (direct-only)

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

Pod Restarting (OOMKilled/Crashes)

Deployment Not Progressing

Node Resource Issues (High CPU/Memory, FailedScheduling)

Common Issues & Solutions

Output Format

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow