A comprehensive diagnostic skill that analyzes the health, resource usage, and scheduling issues within a specified Kubernetes namespace. It identifies quota constraints, pending pods, workload status anomalies, inefficient resource allocation, and critical events to provide actionable remediation guidance.
These are instructions for an AI agent to diagnose and troubleshoot issues in a Kubernetes namespace.
Please help me check the resource quotas (ResourceQuota) and current resource usage in the Kubernetes namespace knowledge-graph. Summarize the following:
Identify pending pods:
kubectl get pods -n knowledge-graph --field-selector=status.phase=Pending
For each pod, run:
kubectl describe pod <pod-name> -n knowledge-graph
Parse Events section to determine cause:
Recommend fixes accordingly.
Identify CrashLoopBackOff pods: kubectl get pods -n knowledge-graph | grep -E "CrashLoopBackOff|Error"
For each CrashLoopBackOff pod, perform deep analysis:
A. Check container termination reason: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.status.containerStatuses[*].lastState.terminated}'
Common termination reasons:
OOMKilled (exitCode: 137) → Memory limit exceededError (exitCode: 1) → Application errorError (exitCode: 137) → SIGKILL (usually OOM or external kill)Error (exitCode: 139) → SIGSEGV (segmentation fault)Error (exitCode: 143) → SIGTERM (graceful shutdown failed)B. Check current and previous container logs: kubectl logs <pod-name> -n knowledge-graph --tail=100 kubectl logs <pod-name> -n knowledge-graph --previous --tail=100
C. Check resource configuration: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.spec.containers[*].resources}' | jq .
D. Check for common issues:
| Termination Reason | Root Cause | Remediation |
|---|---|---|
| OOMKilled | Memory usage exceeds limit | Increase memory limits in deployment/pod spec |
| Error (exitCode: 1) | Application startup failure | Check logs for application errors, config issues |
| Error (exitCode: 137) | SIGKILL (external kill or OOM) | Check OOM events, increase memory or fix memory leak |
| Error (exitCode: 139) | Segmentation fault | Debug application, check for incompatible libraries |
| Error (exitCode: 143) | SIGTERM handling issue | Increase terminationGracePeriodSeconds, fix signal handling |
E. For OOMKilled specifically:
Check current resource requests vs limits ratio
Recommend increasing memory limit (typically 2x current limit)
Example patch command:
kubectl patch deployment <deployment-name> -n knowledge-graph --type='json' -p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "<new-limit>"},
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "<new-request>"}
]'
F. Check pod restart count and frequency: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.status.containerStatuses[*].restartCount}'
High restart counts indicate:
Identify ImagePullBackOff pods: kubectl get pods -n knowledge-graph | grep -E "ImagePullBackOff|ErrImagePull"
For each ImagePullBackOff pod, check:
A. Get image details: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.spec.containers[*].image}'
B. Check events for specific error: kubectl describe pod <pod-name> -n knowledge-graph | grep -A5 "Events:"
C. Common causes and remediation:
| Error Message | Root Cause | Remediation |
|---|---|---|
| "repository does not exist" | Wrong image name/tag | Verify image name and tag exist in registry |
| "unauthorized" | Missing/invalid credentials | Create/update imagePullSecrets |
| "manifest unknown" | Tag doesn't exist | Verify tag exists, check for typos |
| "connection refused" | Registry unreachable | Check network, firewall, registry status |
| "x509: certificate" | TLS/SSL issues | Add CA cert or configure insecure registry |
D. For private registry issues:
kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.spec.imagePullSecrets}'
Please analyze the resource allocation in the Kubernetes namespace knowledge-graph and answer the following questions:
Commands:
kubectl get pods -n knowledge-graph -o custom-columns=
'NAME:.metadata.name,CPU_REQ:.spec.containers[].resources.requests.cpu,CPU_LIM:.spec.containers[].resources.limits.cpu,MEM_REQ:.spec.containers[].resources.requests.memory,MEM_LIM:.spec.containers[].resources.limits.memory'
kubectl top pods -n knowledge-graph
Analyze for common issues:
Please check the events in the Kubernetes namespace knowledge-graph and answer the following questions:
Command:
kubectl get events -n knowledge-graph --sort-by=.lastTimestamp
Filter for Type = Warning or Error: kubectl get events -n knowledge-graph --field-selector type=Warning --sort-by=.lastTimestamp
Group by Reason and correlate with affected objects
Map common reasons to solutions:
| Event Reason | Description | Remediation |
|---|---|---|
| ImagePullBackOff | Cannot pull container image | See Step 2.2 for detailed diagnosis |
| CrashLoopBackOff | Container keeps crashing | See Step 2.1 for detailed diagnosis |
| BackOff | Container restart backoff | Check logs, may be OOMKilled or app error |
| OOMKilled | Out of memory | Increase memory limits |
| FailedScheduling | Cannot schedule pod | Check resources, node selectors, taints |
| FailedAttachVolume | Volume attach failed | Check PVC, StorageClass, node issues |
| FailedMount | Volume mount failed | Check PVC bound, node access to storage |
| Unhealthy | Probe failed | Check liveness/readiness probe config |
| NodeNotReady | Node is not ready | Check node status, kubelet logs |
| Evicted | Pod evicted | Check node resources, pod priority |
Please generate a comprehensive report for the Kubernetes namespace knowledge-graph that includes: