Skill File

Kubernetes Operations & Platform Engineering

Name: Kubernetes Operations & Platform Engineering
Author: clouddrove

Kubernetes operations, troubleshooting, and platform engineering. Trigger for: kubectl, pods, deployments, services, ingress, Helm charts, K8s manifests, RBAC, pod security, network policies, CrashLoopBackOff, OOMKilled, ImagePullBackOff, node scheduling, HPA, cluster health, container debugging, rollbacks. Implicit queries: "which apps are failing", "compare namespaces", "what is broken", "is everything healthy", "check cluster status", "why is my service down", "pod keeps restarting", "container out of memory", "set resource limits", "create namespace with quotas", "audit permissions", "canary deployment", "rolling restart", "scale up", "scale down". Tools: kustomize, kubectl, kubelet, kubeconfig, helm, helmfile, prometheus, servicemonitor, grafana, ArgoCD, flux, gitops. Trigger even without the word Kubernetes.

clouddrove0 starsMar 28, 2026

Occupation
Categories: Containers

Skill Content

This skill covers day-to-day Kubernetes operations (troubleshooting, debugging, scaling) and platform engineering (manifests, Helm, RBAC, autoscaling, monitoring).

Scripts: Always run scripts with --help first. Do not read script source unless debugging the script itself.

References: Load reference files on demand based on the task at hand. Do not pre-load all references.

Slash commands: Users can also invoke these directly:

/k8s-skills:k8s-debug [pod] [namespace] — Diagnose a pod or deployment
/k8s-skills:k8s-deploy [action] [name] [namespace] — Deploy, rollback, or restart
/k8s-skills:k8s-health [namespace] — Cluster or namespace health check

Quick Command Reference

Category	Command	Purpose
Inspect	`kubectl get pods -n <ns>`	List pods in namespace

Related Skills

Kubernetes Operations & Platform Engineering | Skills Pool

Skill File

Kubernetes Operations & Platform Engineering

clouddrove0 starsMar 28, 2026

Occupation
Categories: Containers

Skill Content

This skill covers day-to-day Kubernetes operations (troubleshooting, debugging, scaling) and platform engineering (manifests, Helm, RBAC, autoscaling, monitoring).

Scripts: Always run scripts with --help first. Do not read script source unless debugging the script itself.

References: Load reference files on demand based on the task at hand. Do not pre-load all references.

Slash commands: Users can also invoke these directly:

/k8s-skills:k8s-debug [pod] [namespace] — Diagnose a pod or deployment
/k8s-skills:k8s-deploy [action] [name] [namespace] — Deploy, rollback, or restart
/k8s-skills:k8s-health [namespace] — Cluster or namespace health check

Quick Command Reference

Category	Command	Purpose
Inspect	`kubectl get pods -n <ns>`	List pods in namespace

Related Skills

Pod not healthy?
│
├─ Status: CrashLoopBackOff
│  ├─ Check: kubectl logs <pod> --previous
│  ├─ Check: kubectl describe pod <pod> → Events section
│  ├─ Exit code 137 (OOMKilled)?
│  │  └─ Increase memory limits, check for memory leaks
│  ├─ Exit code 1 (app error)?
│  │  └─ Read logs, fix application startup
│  └─ Probe failure?
│     └─ Adjust initialDelaySeconds, check endpoint health
│
├─ Status: ImagePullBackOff
│  ├─ Check: kubectl describe pod <pod> → image name/tag
│  ├─ Registry auth? → Verify imagePullSecrets
│  ├─ Image exists? → Check registry directly
│  └─ Network? → Can node reach registry?
│
├─ Status: Pending
│  ├─ Check: kubectl describe pod <pod> → Events
│  ├─ Insufficient resources? → kubectl describe nodes, check allocatable
│  ├─ NodeSelector/affinity mismatch? → Verify node labels
│  ├─ Taint not tolerated? → Add tolerations or remove taint
│  └─ PVC not bound? → kubectl get pvc -n <ns>
│
├─ Status: Init:Error / Init:CrashLoopBackOff
│  ├─ Check: kubectl logs <pod> -c <init-container>
│  └─ Common: DB migration failed, config dependency not ready
│
├─ Status: Evicted
│  ├─ Check: kubectl describe node <node> → Conditions
│  ├─ DiskPressure? → Clean up node disk
│  └─ MemoryPressure? → Check resource quotas, set proper requests
│
└─ Status: Terminating (stuck)
   ├─ Check: kubectl describe pod <pod> → finalizers
   ├─ Finalizer stuck? → Patch to remove finalizer
   └─ Last resort: kubectl delete pod <pod> --grace-period=0 --force

# 1. Validate manifest before applying
kubectl diff -f manifest.yaml

# 2. Apply the change
kubectl apply -f manifest.yaml

# 3. Monitor rollout
kubectl rollout status deployment/<name> -n <ns> --timeout=300s

# 4. Verify pods are healthy
kubectl get pods -n <ns> -l app=<name>
kubectl logs -l app=<name> -n <ns> --tail=20

# 5. Verify service endpoints
kubectl get endpoints <service> -n <ns>

# Check revision history
kubectl rollout history deployment/<name> -n <ns>

# Rollback to previous revision
kubectl rollout undo deployment/<name> -n <ns>

# Rollback to specific revision
kubectl rollout undo deployment/<name> -n <ns> --to-revision=<N>

# Verify rollback completed
kubectl rollout status deployment/<name> -n <ns>

Kubernetes Operations & Platform Engineering

Quick Command Reference

Kubernetes Operations & Platform Engineering

Quick Command Reference

Pod Troubleshooting

Deployment Workflow

Standard Apply → Monitor → Verify

Rollback

Rolling Update Strategy

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow