Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.
A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.
vllm/vllm-openai:latest image by default (user can specify a different version)kubectl configured with access to a Kubernetes clusterBefore deploying, check if the hf-token Kubernetes secret exists in the target namespace:
kubectl get secret hf-token -n <namespace>
kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>
This is required for gated models (e.g., meta-llama/Meta-Llama-3.1-8B). For public models, the secret is optional but recommended to avoid rate limits.
Before applying, check if a vLLM deployment already exists:
kubectl get deployment vllm -n <namespace>
Apply the template YAML files to deploy vLLM:
kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>
Wait for the deployment to roll out:
kubectl rollout status deployment/vllm -n <namespace> --timeout=600s
Verify the pod is running and ready:
kubectl get pods -n <namespace> -l app=vllm
Confirm the pod shows READY 1/1 and STATUS Running. If the pod is not ready yet, wait and check again. If it's in CrashLoopBackOff or Error, check the logs with kubectl logs -n <namespace> -l app=vllm.
Once the pod is ready, print a summary message to the user in this format (replace placeholders with actual values):
🎉 **vLLM Deployment Successful!**
| Resource | Name | Status |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> Ready |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | Running |
| Image | <image> | |
| Model | <model> | |
**To test the API, run these two commands in your terminal:**
**1. Open a port-forward** (this connects your local port <port> to the vLLM service inside the cluster):
kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>
**2. In a separate terminal**, send a test request to the OpenAI-compatible API:
curl -s http://localhost:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool
If everything is working, you'll get a JSON response with the model's reply.
The templates use the following defaults:
| Parameter | Default Value |
|---|---|
| Image | vllm/vllm-openai:latest |
| Model | Qwen/Qwen2.5-1.5B-Instruct |
| Port | 8000 |
| Replicas | 1 |
| GPU count | 1 |
| GPU memory utilization | 0.85 |
| Tensor parallel size | 1 |
| CPU request / limit | 12 / 128 |
| Memory request / limit | 100Gi / 400Gi |
| Shared memory (dshm) | 80Gi |
When the user requests changes, modify the template YAML files before applying. The following can be customized:
image: vllm/vllm-openai:<version> in templates/vllm-deployment.yaml (default: latest). Use a specific version tag like v0.17.1 if the user requests it.vllm serve command inside the Deployment args.vllm serve command in the Deployment args (e.g., --max-model-len 4096, --kv-cache-dtype fp8, --enforce-eager, --generation-config vllm).replicas: in the Deployment spec.nvidia.com/gpu in both requests and limits under resources.--tensor-parallel-size flag to match the GPU count.cpu and memory values under requests and limits.containerPort in the Deployment, port/targetPort in the Service, the port in all health probes (liveness, readiness, startup), AND add --port <port> to the vllm serve command in args. All four must match.-n <namespace>.sizeLimit of the dshm emptyDir volume.Edit the template files using the Edit tool, then apply the modified templates.
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
When the user asks to clean up or delete the vLLM deployment, run the following steps:
kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>
kubectl delete secret hf-token -n <namespace>
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
vLLM deployment has been cleaned up from namespace <namespace>.
Deleted: Deployment/vllm, Service/vllm-svc
HF token secret: <kept/deleted>
kubectl describe pod <pod-name> for scheduling errors. Ensure NVIDIA GPU Operator or device plugin is installed.memory limits in the Deployment, or use a smaller model.kubectl logs <pod-name>. Ensure hf-token secret exists for gated models. Increase failureThreshold on the startup probe if needed.kubectl get secret hf-token -n <namespace>. Check the token is valid.nvidia.com/gpu resource is requested and the NVIDIA device plugin is running on the node.