Kubernetes operations, troubleshooting, and platform engineering. Trigger for: kubectl, pods, deployments, services, ingress, Helm charts, K8s manifests, RBAC, pod security, network policies, CrashLoopBackOff, OOMKilled, ImagePullBackOff, node scheduling, HPA, cluster health, container debugging, rollbacks. Implicit queries: "which apps are failing", "compare namespaces", "what is broken", "is everything healthy", "check cluster status", "why is my service down", "pod keeps restarting", "container out of memory", "set resource limits", "create namespace with quotas", "audit permissions", "canary deployment", "rolling restart", "scale up", "scale down". Tools: kustomize, kubectl, kubelet, kubeconfig, helm, helmfile, prometheus, servicemonitor, grafana, ArgoCD, flux, gitops. Trigger even without the word Kubernetes.
This skill covers day-to-day Kubernetes operations (troubleshooting, debugging, scaling) and platform engineering (manifests, Helm, RBAC, autoscaling, monitoring).
Scripts: Always run scripts with --help first. Do not read script source unless debugging the script itself.
References: Load reference files on demand based on the task at hand. Do not pre-load all references.
Slash commands: Users can also invoke these directly:
/k8s-skills:k8s-debug [pod] [namespace] — Diagnose a pod or deployment/k8s-skills:k8s-deploy [action] [name] [namespace] — Deploy, rollback, or restart/k8s-skills:k8s-health [namespace] — Cluster or namespace health check| Category | Command | Purpose |
|---|---|---|
| Inspect | kubectl get pods -n <ns> | List pods in namespace |
kubectl get pods -A --field-selector status.phase!=Running |
| Find non-running pods across cluster |
| Inspect | kubectl get all -n <ns> | All resources in namespace |
| Inspect | kubectl get nodes -o wide | Node status with IPs and versions |
| Inspect | kubectl top pods -n <ns> | Pod CPU/memory usage |
| Inspect | kubectl top nodes | Node CPU/memory usage |
| Debug | kubectl describe pod <pod> -n <ns> | Full pod details + events |
| Debug | kubectl logs <pod> -n <ns> --tail=100 | Recent logs |
| Debug | kubectl logs <pod> -n <ns> --previous | Logs from crashed container |
| Debug | kubectl logs <pod> -n <ns> -c <container> | Specific container logs |
| Debug | kubectl exec -it <pod> -n <ns> -- /bin/sh | Shell into container |
| Debug | kubectl debug <pod> -it --image=busybox -n <ns> | Ephemeral debug container |
| Debug | kubectl port-forward svc/<svc> <local>:<remote> -n <ns> | Test service connectivity |
| Debug | kubectl get events -n <ns> --sort-by=.lastTimestamp | Recent events sorted |
| Deploy | kubectl apply -f <file> | Apply manifest declaratively |
| Deploy | kubectl rollout status deployment/<name> -n <ns> | Watch rollout progress |
| Deploy | kubectl rollout restart deployment/<name> -n <ns> | Graceful rolling restart |
| Deploy | kubectl rollout undo deployment/<name> -n <ns> | Rollback to previous revision |
| Deploy | kubectl rollout history deployment/<name> -n <ns> | View revision history |
| Scale | kubectl scale deployment/<name> --replicas=<N> -n <ns> | Manual horizontal scale |
| Config | kubectl config current-context | Show active context |
| Config | kubectl config use-context <ctx> | Switch cluster/context |
| Config | kubectl config get-contexts | List all contexts |
Follow the diagnostic path: get → describe → logs → exec
Pod not healthy?
│
├─ Status: CrashLoopBackOff
│ ├─ Check: kubectl logs <pod> --previous
│ ├─ Check: kubectl describe pod <pod> → Events section
│ ├─ Exit code 137 (OOMKilled)?
│ │ └─ Increase memory limits, check for memory leaks
│ ├─ Exit code 1 (app error)?
│ │ └─ Read logs, fix application startup
│ └─ Probe failure?
│ └─ Adjust initialDelaySeconds, check endpoint health
│
├─ Status: ImagePullBackOff
│ ├─ Check: kubectl describe pod <pod> → image name/tag
│ ├─ Registry auth? → Verify imagePullSecrets
│ ├─ Image exists? → Check registry directly
│ └─ Network? → Can node reach registry?
│
├─ Status: Pending
│ ├─ Check: kubectl describe pod <pod> → Events
│ ├─ Insufficient resources? → kubectl describe nodes, check allocatable
│ ├─ NodeSelector/affinity mismatch? → Verify node labels
│ ├─ Taint not tolerated? → Add tolerations or remove taint
│ └─ PVC not bound? → kubectl get pvc -n <ns>
│
├─ Status: Init:Error / Init:CrashLoopBackOff
│ ├─ Check: kubectl logs <pod> -c <init-container>
│ └─ Common: DB migration failed, config dependency not ready
│
├─ Status: Evicted
│ ├─ Check: kubectl describe node <node> → Conditions
│ ├─ DiskPressure? → Clean up node disk
│ └─ MemoryPressure? → Check resource quotas, set proper requests
│
└─ Status: Terminating (stuck)
├─ Check: kubectl describe pod <pod> → finalizers
├─ Finalizer stuck? → Patch to remove finalizer
└─ Last resort: kubectl delete pod <pod> --grace-period=0 --force
For detailed debugging workflows with step-by-step resolution for every error state, read Troubleshooting Guide.
# 1. Validate manifest before applying
kubectl diff -f manifest.yaml
# 2. Apply the change
kubectl apply -f manifest.yaml
# 3. Monitor rollout
kubectl rollout status deployment/<name> -n <ns> --timeout=300s
# 4. Verify pods are healthy
kubectl get pods -n <ns> -l app=<name>
kubectl logs -l app=<name> -n <ns> --tail=20
# 5. Verify service endpoints
kubectl get endpoints <service> -n <ns>
# Check revision history
kubectl rollout history deployment/<name> -n <ns>
# Rollback to previous revision
kubectl rollout undo deployment/<name> -n <ns>
# Rollback to specific revision
kubectl rollout undo deployment/<name> -n <ns> --to-revision=<N>
# Verify rollback completed
kubectl rollout status deployment/<name> -n <ns>
When writing Deployment manifests, configure the update strategy: