Skills suchen.../

K8s Namespace Troubleshooter | Skills Pool

Identify CrashLoopBackOff pods: kubectl get pods -n knowledge-graph | grep -E "CrashLoopBackOff|Error"

For each CrashLoopBackOff pod, perform deep analysis:

A. Check container termination reason: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.status.containerStatuses[*].lastState.terminated}'

Common termination reasons:

OOMKilled (exitCode: 137) → Memory limit exceeded
Error (exitCode: 1) → Application error
Error (exitCode: 137) → SIGKILL (usually OOM or external kill)
Error (exitCode: 139) → SIGSEGV (segmentation fault)
Error (exitCode: 143) → SIGTERM (graceful shutdown failed)

B. Check current and previous container logs: kubectl logs <pod-name> -n knowledge-graph --tail=100 kubectl logs <pod-name> -n knowledge-graph --previous --tail=100

C. Check resource configuration: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.spec.containers[*].resources}' | jq .

D. Check for common issues:

Termination Reason	Root Cause	Remediation
OOMKilled	Memory usage exceeds limit	Increase memory limits in deployment/pod spec
Error (exitCode: 1)	Application startup failure	Check logs for application errors, config issues
Error (exitCode: 137)	SIGKILL (external kill or OOM)	Check OOM events, increase memory or fix memory leak
Error (exitCode: 139)	Segmentation fault	Debug application, check for incompatible libraries
Error (exitCode: 143)	SIGTERM handling issue	Increase terminationGracePeriodSeconds, fix signal handling

E. For OOMKilled specifically:

Check current resource requests vs limits ratio
Recommend increasing memory limit (typically 2x current limit)

Example patch command:

kubectl patch deployment <deployment-name> -n knowledge-graph --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "<new-limit>"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "<new-request>"}
]'

F. Check pod restart count and frequency: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.status.containerStatuses[*].restartCount}'

High restart counts indicate:

Persistent issues requiring investigation
Potential liveness probe misconfiguration
Resource constraint problems

Identify ImagePullBackOff pods: kubectl get pods -n knowledge-graph | grep -E "ImagePullBackOff|ErrImagePull"

For each ImagePullBackOff pod, check:

A. Get image details: kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.spec.containers[*].image}'

B. Check events for specific error: kubectl describe pod <pod-name> -n knowledge-graph | grep -A5 "Events:"

C. Common causes and remediation:

Error Message	Root Cause	Remediation
"repository does not exist"	Wrong image name/tag	Verify image name and tag exist in registry
"unauthorized"	Missing/invalid credentials	Create/update imagePullSecrets
"manifest unknown"	Tag doesn't exist	Verify tag exists, check for typos
"connection refused"	Registry unreachable	Check network, firewall, registry status
"x509: certificate"	TLS/SSL issues	Add CA cert or configure insecure registry

D. For private registry issues:

Check if imagePullSecrets is configured

kubectl get pod <pod-name> -n knowledge-graph -o jsonpath='{.spec.imagePullSecrets}'

Command:
kubectl get events -n knowledge-graph --sort-by=.lastTimestamp
Filter for Type = Warning or Error: kubectl get events -n knowledge-graph --field-selector type=Warning --sort-by=.lastTimestamp
Group by Reason and correlate with affected objects

Map common reasons to solutions:

Event Reason	Description	Remediation
ImagePullBackOff	Cannot pull container image	See Step 2.2 for detailed diagnosis
CrashLoopBackOff	Container keeps crashing	See Step 2.1 for detailed diagnosis
BackOff	Container restart backoff	Check logs, may be OOMKilled or app error
OOMKilled	Out of memory	Increase memory limits
FailedScheduling	Cannot schedule pod	Check resources, node selectors, taints
FailedAttachVolume	Volume attach failed	Check PVC, StorageClass, node issues
FailedMount	Volume mount failed	Check PVC bound, node access to storage
Unhealthy	Probe failed	Check liveness/readiness probe config
NodeNotReady	Node is not ready	Check node status, kubelet logs
Evicted	Pod evicted	Check node resources, pod priority

K8s Namespace Troubleshooter

Best Practices

🔍 Diagnostic Workflow (Internal Logic)

Step 1: Resource Quota & Usage Analysis

K8s Namespace Troubleshooter

Best Practices

🔍 Diagnostic Workflow (Internal Logic)

Step 1: Resource Quota & Usage Analysis

Step 2: Pending Pod Root Cause Analysis

Step 2.1: CrashLoopBackOff Pod Diagnosis

Step 2.2: ImagePullBackOff Pod Diagnosis

Check if imagePullSecrets is configured

Step 3: Resource Allocation Efficiency Audit

Get resource requests and limits for all pods

Check actual resource usage (requires metrics-server)

Step 4: Event Log Inspection

Step 5: Comprehensive Report Generation

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow