Help users deploy LLM models to Kubernetes using the KAITO kubectl plugin. Use this skill whenever the user mentions deploying an LLM, AI model, or language model to Kubernetes, or asks about KAITO, kaito workspaces, GPU inference on k8s, or wants to run models like Llama, Phi, Mistral, DeepSeek, Falcon, Qwen, or Gemma on a Kubernetes cluster. Also trigger when the user mentions "kubectl kaito", model serving on k8s, or wants to set up an inference endpoint in Kubernetes — even if they don't say "KAITO" explicitly.
Help users deploy LLM models to Kubernetes using the kubectl kaito plugin. The goal is to ask the right questions, recommend a model and configuration, and produce a ready-to-run kubectl kaito deploy command.
KAITO (Kubernetes AI Toolchain Operator) automates AI model inference on Kubernetes. The kubectl kaito plugin simplifies this by turning a few flags into a complete GPU-provisioned deployment. Users don't need to write YAML — the plugin generates the Workspace CRD automatically.
When a user wants to deploy a model, walk through these steps:
Before anything else, verify the plugin is available:
kubectl kaito --help
If the command fails (not found), stop and ask the user to install it first:
# Via Krew (recommended)
kubectl krew install kaito
# Or download from GitHub releases
# https://github.com/kaito-project/kaito-kubectl-plugin/releases
Do not proceed with deploy commands until the plugin is confirmed installed.
Ask (or infer from context) these key things:
qwen2.5-coder-7b or phi-4-mini).phi-4, llama-3.1-8b-instruct) and any HuggingFace model via its model card ID (e.g., Qwen/Qwen3-4B-Instruct-2507).If the user already specified a model and enough context, skip the questions and generate the command directly.
Use the kubectl kaito models command to list and inspect available preset models:
# List all supported preset models
kubectl kaito models list
# List with detailed info (instance types, GPU memory, node counts)
kubectl kaito models list --detailed
# Get full details for a specific model
kubectl kaito models describe <model-name>
# JSON output for parsing
kubectl kaito models list --output json
Use this to:
If the user asks for a model that doesn't exist in the KAITO preset list, search for it on HuggingFace Hub using the API:
curl -s "https://huggingface.co/api/models?search=<query>&filter=text-generation&sort=downloads&direction=-1&limit=5"
This returns JSON with matching models. Key fields to use:
id — the full model ID to pass to --model (e.g., Qwen/Qwen3-4B-Instruct-2507)downloads — popularity indicatortags — check for relevant tags like text-generation, conversationalpipeline_tag — the model's task typesiblings — file list (look for config.json to estimate size)Workflow when model is not a preset:
org/model-name)--model-access-secret if the model is gated/privateTo check model details (size, config, gating):
curl -s "https://huggingface.co/api/models/<org>/<model-name>"
Look at:
gated field — if truthy, the model requires a HuggingFace token (needs --model-access-secret)safetensors.total — total parameter count in bytes, useful for GPU sizingcardData.license — license info to mention to the userGPU sizing from parameter count:
Standard_NC6s_v3 (1× V100 16GB)Standard_NC6s_v3 (16GB) or Standard_NC24ads_A100_v4 (80GB) for faster inferenceStandard_NC24ads_A100_v4 (1× A100 80GB)Standard_NC48ads_A100_v4 (2× A100 160GB)Standard_NC96ads_A100_v4 (4× A100 320GB)Build the kubectl kaito deploy command with the appropriate flags.
Minimal command (preset model):
kubectl kaito deploy \
--workspace-name <name> \
--model <model-name> \
--instance-type <gpu-sku>
HuggingFace model (needs access secret):
kubectl kaito deploy \
--workspace-name <name> \
--model <org/model-name> \
--instance-type <gpu-sku> \
--model-access-secret <secret-name>
Full flag reference:
| Flag | Purpose | When to use |
|---|---|---|
--workspace-name | Name for the Workspace resource | Always (required) |
--model | Model name or HuggingFace ID | Always (required) |
--instance-type | GPU VM SKU (e.g., Standard_NC24ads_A100_v4) | When auto-provisioning is on |
--count | Number of GPU nodes (default: 1) | Large models needing multi-node |
--model-access-secret | K8s secret with HuggingFace token | Private/gated HuggingFace models |
--inference-config | ConfigMap name or path to YAML config | Custom vLLM/runtime params |
--adapters | LoRA adapters to load | When using fine-tuned adapters |
--enable-load-balancer | Create external LoadBalancer | When external access is needed |
--dry-run | Show config without deploying | When user wants to preview |
--namespace | Target namespace | When not using default namespace |
After giving the command, briefly tell the user:
Workspace custom resourcekubectl kaito status --workspace-name <name> --watchkubectl kaito get-endpoint --workspace-name <name>kubectl kaito chat --workspace-name <name>These are typical Azure GPU SKUs — adjust if the user is on a different cloud:
| SKU | GPUs | GPU Memory | Good for |
|---|---|---|---|
Standard_NC6s_v3 | 1× V100 | 16 GB | Small models (≤7B params) |
Standard_NC24ads_A100_v4 | 1× A100 | 80 GB | Medium models (7B–14B) |
Standard_NC48ads_A100_v4 | 2× A100 | 160 GB | Large models (30B–40B) |
Standard_NC96ads_A100_v4 | 4× A100 | 320 GB | Very large models (70B+) |
If the user needs custom inference parameters (e.g., max sequence length, GPU memory utilization), help them create a config YAML: