A comprehensive diagnostic skill for troubleshooting GPU pod scheduling and allocation issues in Kubernetes clusters using HAMi (Heterogeneous AI Computing Virtualization Middleware). It identifies GPU resource constraints, webhook configuration problems, device plugin issues, and scheduler policy misconfigurations to provide actionable remediation guidance.
These are instructions for an AI agent to diagnose and troubleshoot GPU-related issues in a Kubernetes namespace with HAMi deployed.
HAMi may be installed in any namespace (not necessarily kube-system). First, discover where HAMi components are deployed:
A. Find all HAMi pods across all namespaces:
kubectl get pods -A | grep -i hami
B. Identify the HAMi namespace and store it for subsequent commands:
# Find the namespace where HAMi scheduler is running
HAMI_NS=$(kubectl get pods -A -l app=hami-scheduler -o jsonpath='{.items[0].metadata.namespace}' 2>/dev/null)
echo "HAMi is installed in namespace: $HAMI_NS"
C. If HAMi pods are not found by label, search by name:
# Alternative: find by pod name pattern
kubectl get pods -A --no-headers | grep -E "hami-scheduler|hami-device-plugin|hami-vgpu" | awk '{print "Namespace: "$1, "Pod: "$2}'
| Discovery Issue | Root Cause | Remediation |
|---|---|---|
| No HAMi pods found | HAMi not installed or different naming | Check Helm releases: helm list -A | grep hami |
| Multiple namespaces | Multiple HAMi installations | Verify which installation is active via webhook |
| Pods not labeled | Custom Helm values used | Search by pod name pattern instead of labels |
Before diagnosing GPU pod issues, verify that all HAMi components are running properly:
Note: Replace
$HAMI_NSwith the namespace discovered in Step 0, or use-Aflag to search all namespaces.
A. Check HAMi Scheduler:
# Using discovered namespace
kubectl get pods -n $HAMI_NS -l app=hami-scheduler
kubectl logs -n $HAMI_NS -l app=hami-scheduler --tail=50
# Alternative: search all namespaces
kubectl get pods -A -l app=hami-scheduler
kubectl logs -A -l app=hami-scheduler --tail=50
B. Check HAMi Device Plugin (DaemonSet):
# Using discovered namespace
kubectl get pods -n $HAMI_NS -l app=hami-device-plugin
kubectl logs -n $HAMI_NS -l app=hami-device-plugin --tail=50
# Alternative: search all namespaces
kubectl get pods -A -l app=hami-device-plugin
C. Check HAMi vGPU Monitor:
kubectl get pods -A -l app=hami-vgpu-monitor
D. Verify MutatingWebhookConfiguration:
kubectl get mutatingwebhookconfiguration | grep -i hami
kubectl describe mutatingwebhookconfiguration hami-webhook
| Component Status | Root Cause | Remediation |
|---|---|---|
| Scheduler not running | Deployment issue, image pull failure | Check deployment events, verify image availability |
| Device plugin CrashLoop | NVIDIA driver mismatch, permission issues | Check node NVIDIA driver, verify privileged mode |
| Webhook not registered | Helm install incomplete, cert issues | Re-run helm install, check cert-manager/patch job |
| Monitor not running | Optional component, may not affect core functionality | Check if monitoring is enabled in values.yaml |
Check the available GPU resources across the cluster and identify resource constraints:
A. Check node GPU capacity and allocatable resources:
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu,GPU_MEM:.status.capacity.nvidia\.com/gpumem,GPU_CORES:.status.capacity.nvidia\.com/gpucores'
B. Check actual GPU allocation:
kubectl describe nodes | grep -A 20 "Allocated resources"
C. Check HAMi device ConfigMap for scaling settings:
# Using discovered namespace from Step 0
kubectl get configmap -n $HAMI_NS -l app=hami -o yaml
# Alternative: search all namespaces for HAMi configmaps
kubectl get configmap -A | grep -i hami
kubectl get configmap hami-scheduler-device -n $HAMI_NS -o yaml
Key configuration parameters to verify:
nvidia.deviceMemoryScaling: Memory overcommit ratio (default: 1)nvidia.deviceSplitCount: Max tasks per GPU (default: 10)nvidia.defaultMem: Default memory allocation in MB (0 = 100%)nvidia.defaultCores: Default GPU core percentage (0 = no limit)| Resource Issue | Root Cause | Remediation |
|---|---|---|
| GPU count shows 0 | Device plugin not running, NVIDIA driver missing | Deploy device plugin, install NVIDIA drivers |
| Low gpumem capacity | deviceMemoryScaling not configured | Set nvidia.deviceMemoryScaling > 1 for overcommit |
| Insufficient GPU slots | deviceSplitCount too low | Increase nvidia.deviceSplitCount in ConfigMap |
Identify and diagnose pending GPU pods:
A. Find pending pods requesting GPU resources:
kubectl get pods -n <namespace> --field-selector=status.phase=Pending -o wide
B. For each pending GPU pod, check detailed status:
kubectl describe pod <pod-name> -n <namespace>
C. Check pod GPU resource requests:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}' | jq .
Common GPU-related pending reasons:
| Event Message | Root Cause | Remediation |
|---|---|---|
Insufficient nvidia.com/gpu | No GPU slots available | Reduce GPU requests, increase deviceSplitCount, or add GPU nodes |
Insufficient nvidia.com/gpumem | Insufficient GPU memory | Reduce gpumem request, enable deviceMemoryScaling, or use larger GPU |
Insufficient nvidia.com/gpucores | GPU core allocation exhausted | Reduce gpucores request or wait for GPU release |
0/N nodes are available: N node(s) didn't match Pod's node affinity/selector | Node selector mismatch | Check node labels, verify GPU node selectors |
0/N nodes are available: N Insufficient nvidia.com/gpu | All GPUs fully allocated | Scale down other workloads or add GPU capacity |
D. Check HAMi scheduler logs for allocation failures:
# Using discovered namespace from Step 0
kubectl logs -n $HAMI_NS -l app=hami-scheduler --tail=100 | grep -i "error\|fail\|insufficient"
# Alternative: search all namespaces
kubectl logs -A -l app=hami-scheduler --tail=100 | grep -i "error\|fail\|insufficient"
Verify that the HAMi webhook is properly mutating GPU pods:
A. Check if pod was mutated by HAMi webhook:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 "hami.io\|nvidia.com"
B. Verify webhook is targeting the namespace:
kubectl get mutatingwebhookconfiguration hami-webhook -o yaml | grep -A 20 "namespaceSelector\|objectSelector"
C. Check if namespace has required labels (if namespaceSelector is configured):
kubectl get namespace <namespace> --show-labels
D. Verify pod has required labels (if objectSelector is configured):
kubectl get pod <pod-name> -n <namespace> --show-labels
| Webhook Issue | Root Cause | Remediation |
|---|---|---|
| Pod not mutated (no hami annotations) | Namespace not selected by webhook | Add required labels to namespace or update namespaceSelector |
| Pod not mutated (scheduler not set) | objectSelector filtering out pod | Add required labels to pod or update objectSelector |
| Webhook timeout/failure | Scheduler service unavailable | Check hami-scheduler service and endpoints |
| Pod rejected by webhook | failurePolicy=Fail and webhook error | Set failurePolicy=Ignore or fix webhook |
E. Check webhook service availability:
# Using discovered namespace from Step 0
kubectl get svc -n $HAMI_NS | grep -i hami
kubectl get endpoints -n $HAMI_NS | grep -i hami
# Alternative: search all namespaces
kubectl get svc -A | grep -i hami
kubectl get endpoints -A | grep -i hami
F. Test webhook by creating a test pod:
apiVersion: v1