Name: K8s Gpu Pod Troubleshooter
Author: Project-HAMi

スキルを検索.../

K8s Gpu Pod Troubleshooter | Skills Pool

# Find the namespace where HAMi scheduler is running
HAMI_NS=$(kubectl get pods -A -l app=hami-scheduler -o jsonpath='{.items[0].metadata.namespace}' 2>/dev/null)
echo "HAMi is installed in namespace: $HAMI_NS"

# Alternative: find by pod name pattern
kubectl get pods -A --no-headers | grep -E "hami-scheduler|hami-device-plugin|hami-vgpu" | awk '{print "Namespace: "$1, "Pod: "$2}'

Discovery Issue	Root Cause	Remediation
No HAMi pods found	HAMi not installed or different naming	Check Helm releases: `helm list -A \| grep hami`
Multiple namespaces	Multiple HAMi installations	Verify which installation is active via webhook
Pods not labeled	Custom Helm values used	Search by pod name pattern instead of labels

# Using discovered namespace
kubectl get pods -n $HAMI_NS -l app=hami-scheduler
kubectl logs -n $HAMI_NS -l app=hami-scheduler --tail=50

# Alternative: search all namespaces
kubectl get pods -A -l app=hami-scheduler
kubectl logs -A -l app=hami-scheduler --tail=50

# Using discovered namespace
kubectl get pods -n $HAMI_NS -l app=hami-device-plugin
kubectl logs -n $HAMI_NS -l app=hami-device-plugin --tail=50

# Alternative: search all namespaces
kubectl get pods -A -l app=hami-device-plugin

kubectl get pods -A -l app=hami-vgpu-monitor

kubectl get mutatingwebhookconfiguration | grep -i hami
kubectl describe mutatingwebhookconfiguration hami-webhook

Component Status	Root Cause	Remediation
Scheduler not running	Deployment issue, image pull failure	Check deployment events, verify image availability
Device plugin CrashLoop	NVIDIA driver mismatch, permission issues	Check node NVIDIA driver, verify privileged mode
Webhook not registered	Helm install incomplete, cert issues	Re-run helm install, check cert-manager/patch job
Monitor not running	Optional component, may not affect core functionality	Check if monitoring is enabled in values.yaml

kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu,GPU_MEM:.status.capacity.nvidia\.com/gpumem,GPU_CORES:.status.capacity.nvidia\.com/gpucores'

kubectl describe nodes | grep -A 20 "Allocated resources"

# Using discovered namespace from Step 0
kubectl get configmap -n $HAMI_NS -l app=hami -o yaml

# Alternative: search all namespaces for HAMi configmaps
kubectl get configmap -A | grep -i hami
kubectl get configmap hami-scheduler-device -n $HAMI_NS -o yaml

Resource Issue	Root Cause	Remediation
GPU count shows 0	Device plugin not running, NVIDIA driver missing	Deploy device plugin, install NVIDIA drivers
Low gpumem capacity	deviceMemoryScaling not configured	Set `nvidia.deviceMemoryScaling` > 1 for overcommit
Insufficient GPU slots	deviceSplitCount too low	Increase `nvidia.deviceSplitCount` in ConfigMap

kubectl get pods -n <namespace> --field-selector=status.phase=Pending -o wide

kubectl describe pod <pod-name> -n <namespace>

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}' | jq .

Event Message	Root Cause	Remediation
`Insufficient nvidia.com/gpu`	No GPU slots available	Reduce GPU requests, increase deviceSplitCount, or add GPU nodes
`Insufficient nvidia.com/gpumem`	Insufficient GPU memory	Reduce gpumem request, enable deviceMemoryScaling, or use larger GPU
`Insufficient nvidia.com/gpucores`	GPU core allocation exhausted	Reduce gpucores request or wait for GPU release
`0/N nodes are available: N node(s) didn't match Pod's node affinity/selector`	Node selector mismatch	Check node labels, verify GPU node selectors
`0/N nodes are available: N Insufficient nvidia.com/gpu`	All GPUs fully allocated	Scale down other workloads or add GPU capacity

# Using discovered namespace from Step 0
kubectl logs -n $HAMI_NS -l app=hami-scheduler --tail=100 | grep -i "error\|fail\|insufficient"

# Alternative: search all namespaces
kubectl logs -A -l app=hami-scheduler --tail=100 | grep -i "error\|fail\|insufficient"

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 "hami.io\|nvidia.com"

kubectl get mutatingwebhookconfiguration hami-webhook -o yaml | grep -A 20 "namespaceSelector\|objectSelector"

kubectl get namespace <namespace> --show-labels

kubectl get pod <pod-name> -n <namespace> --show-labels

Webhook Issue	Root Cause	Remediation
Pod not mutated (no hami annotations)	Namespace not selected by webhook	Add required labels to namespace or update namespaceSelector
Pod not mutated (scheduler not set)	objectSelector filtering out pod	Add required labels to pod or update objectSelector
Webhook timeout/failure	Scheduler service unavailable	Check hami-scheduler service and endpoints
Pod rejected by webhook	failurePolicy=Fail and webhook error	Set failurePolicy=Ignore or fix webhook

# Using discovered namespace from Step 0
kubectl get svc -n $HAMI_NS | grep -i hami
kubectl get endpoints -n $HAMI_NS | grep -i hami

# Alternative: search all namespaces
kubectl get svc -A | grep -i hami
kubectl get endpoints -A | grep -i hami

apiVersion: v1

K8s Gpu Pod Troubleshooter

Best Practices

🔍 Diagnostic Workflow (Internal Logic)

Step 0: Discover HAMi Installation Namespace

K8s Gpu Pod Troubleshooter

Best Practices

🔍 Diagnostic Workflow (Internal Logic)

Step 0: Discover HAMi Installation Namespace

Step 1: HAMi Component Health Check

Step 2: GPU Resource Availability Analysis

Step 3: Pending GPU Pod Root Cause Analysis

Step 4: HAMi Webhook Mutation Analysis

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow