vLLM Kubernetes Deployment

A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.

What this skill does

Deploy vLLM as a Kubernetes Deployment + Service with NVIDIA GPU support
Check if a vLLM deployment already exists before deploying
Check if the Hugging Face token secret exists, and ask the user for their token if not
Use the vllm/vllm-openai:latest image by default (user can specify a different version)
Provide sensible default configuration that users can customize (model, replicas, GPU count, extra vLLM flags, etc.)

Prerequisites

kubectl configured with access to a Kubernetes cluster
NVIDIA GPU Operator or device plugin installed on cluster nodes
Hugging Face token (required for gated models like Llama, optional for public models)

vLLM Kubernetes Deployment

A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.

What this skill does

Deploy vLLM as a Kubernetes Deployment + Service with NVIDIA GPU support
Check if a vLLM deployment already exists before deploying
Check if the Hugging Face token secret exists, and ask the user for their token if not
Use the vllm/vllm-openai:latest image by default (user can specify a different version)
Provide sensible default configuration that users can customize (model, replicas, GPU count, extra vLLM flags, etc.)

Prerequisites

kubectl configured with access to a Kubernetes cluster
NVIDIA GPU Operator or device plugin installed on cluster nodes
Hugging Face token (required for gated models like Llama, optional for public models)

Parameter	Default Value
Image	`vllm/vllm-openai:latest`
Model	`Qwen/Qwen2.5-1.5B-Instruct`
Port	`8000`
Replicas	`1`
GPU count	`1`
GPU memory utilization	`0.85`
Tensor parallel size	`1`
CPU request / limit	`12` / `128`
Memory request / limit	`100Gi` / `400Gi`
Shared memory (dshm)	`80Gi`

Vllm Deploy K8s

vLLM Kubernetes Deployment

What this skill does

Prerequisites

Vllm Deploy K8s

vLLM Kubernetes Deployment

What this skill does

Prerequisites

Deployment Steps

Step 1: Check HF token secret

Step 2: Check if deployment already exists

Step 3: Deploy

Step 4: Wait and verify

Step 5: Print deployment summary

Default Configuration

Customization

Status Check

Cleanup

Troubleshooting

References

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow