AI Agent stack: Hermes Agent (self-improving AI assistant) routed through in-cluster LiteLLM proxy with OpenRouter fallback chains (free→free2→cheap). Built for ARM64 (Raspberry Pi CM4) using in-cluster kaniko build. Includes Docker registry:2 for storing custom ARM64 images.
| Component | Image | Version | Namespace | Notes |
|---|---|---|---|---|
| Docker registry | registry:2 | 2 | registry | ARM64-compatible image storage (5Gi PVC) |
| LiteLLM proxy | ghcr.io/berriai/litellm | main-latest | ai | In-cluster OpenRouter router with fallbacks |
| Hermes Agent | registry.registry:5000/ai/hermes-agent | 0.7.0 | ai | Gateway mode + Telegram polling + MCP sidecar |
| kubernetes-mcp-server | registry.registry:5000/ai/kubernetes-mcp-server | v0.0.60 | ai (sidecar) | K8s read-only MCP server sidecar in Hermes pod |
| HolmesGPT | robusta/holmes (Helm) |
| 0.24.0 |
| ai |
| SRE assistant — K8s + logs + Prometheus toolsets |
| Holmes UI | nginx:alpine | — | ai | Chat UI at holmes-ui.cluster.home (no kaniko, ConfigMap) |
| Kaniko | gcr.io/kaniko-project/executor | latest | kaniko | In-cluster ARM64 image builder |
┌─────────────────────────────────────────────────────────────┐
│ Namespace: ai │
│ │
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
│ │ Hermes Agent │───▶│ LiteLLM Proxy │ │
│ │ model=free │ │ port 4000 │ │
│ │ OPENAI_API_BASE│ │ fallback: free→free2→cheap │ │
│ └─────────────────┘ └─────────────┬────────────────┘ │
│ │ HTTPS:443 │
│ ▼ │
│ OpenRouter API │
│ (external) │
│ PVC: hermes-data (/opt/data) │
│ Secret: litellm-secrets (OPENROUTER_API_KEY) │
│ Secret: hermes-secrets (OPENROUTER_API_KEY + bot tokens) │
└─────────────────────────────────────────────────────────────┘
Namespace: registry
registry:2 pod ← kaniko pushes here ← Kaniko job (namespace: kaniko)
registries.yaml on K3s nodes → mirror registry.registry:5000 → ClusterIP
Hermes Agent official Docker image (nousresearch/hermes-agent) is amd64-only.
For ARM64 clusters (Raspberry Pi CM4), we build in-cluster using kaniko.
Build process:
https://github.com/NousResearch/hermes-agent--snapshot-mode=redo for low memory)registry.registry:5000/ai/hermes-agent:0.7.0Build time: ~60 min on Raspberry Pi CM4 (heavy: debian + nodejs + pip deps + ffmpeg)
Kaniko gotchas:
--snapshot-mode=redo — uses mtime for change detection (much less memory than default)backoffLimit: 3 — OOM on agent node caused earlier failures3600sHermes does NOT call OpenRouter directly. It calls the in-cluster LiteLLM proxy:
OPENAI_API_BASE=http://litellm-proxy.ai.svc.cluster.local:4000
OPENAI_API_KEY=sk-hermes-internal (LiteLLM master key)
HERMES_MODEL=free
LiteLLM config (roles/install-litellm-proxy/tasks/main.yml):
| Virtual model | Real model | Provider |
|---|---|---|
free | openrouter/qwen/qwen3-coder:free | coding-first free tier |
free2 | openrouter/google/gemini-2.0-flash-exp:free | Google free fallback |
cheap | openrouter/qwen/qwen-turbo | reliable paid fallback |
strong | openrouter/deepseek/deepseek-chat-v3-0324 | best balance for hard tasks |
url: http://127.0.0.1:8080/mcp.type: sse was not enough in practice; the HTTP endpoint had to be explicit.kubectl/oc inside the Hermes container were not required for MCP to work once the client connected correctly.kubernetes-mcp-server sidecar + /opt/data + serviceAccountName + mcp_servers.kubernetes.url=/mcp.metrics-server or a custom bridge; Prometheus alone is not enough for pods_top / nodes_top.TELEGRAM_ALLOWED_USERS and the gateway platform allowed_users list. Keep those aligned to a single user ID when you want a private bot.Fallback chain: free → free2 → cheap (automatic, transparent to Hermes).
Use cheap or strong directly when you want to skip free tiers.
LiteLLM exposes AI traffic and billing metrics to Prometheus when success_callback: ["prometheus"] is set.
The install-litellm-proxy role automates observability:
ServiceMonitor looking for labels app: litellm-proxy (the LiteLLM service must have these labels in its metadata or Prometheus will silently ignore it).grafana_dashboard: "1" into the monitoring namespace. The Grafana sidecar automatically provisions it.Key Metrics Tracked:
litellm_requests_metric_total: Total successful and failed API requests.litellm_request_total_latency_seconds_bucket: Request latency histograms.litellm_tokens_metric_total: Prompt and completion tokens processed.litellm_spend_metric_total: Estimated USD cost of the inference based on model pricing.litellm_deployment_successful_fallbacks_total: Counter for when fallback chains trigger successfully.make ai-registry
make ai-hermes-build # hermes-agent (~60 min on CM4)
make ai-kubernetes-mcp-build # kubernetes-mcp-server sidecar (~1 min)
# Monitor with:
kubectl get jobs -n kaniko
kubectl logs -n kaniko job/build-hermes-arm64 -f | grep -v "npm WARN"
kubectl logs -n kaniko job/build-kubernetes-mcp-server-arm64 -f
Both images are required before Hermes deploy — hermes-agent-mcp pod has 2 containers.
make ai-hermes-deploy
make ai # registry + hermes-build + kubernetes-mcp-build + hermes-deploy (~70 min total)
Create roles/install-litellm-proxy/defaults/secrets.yml (gitignored):
hermes_openrouter_api_key: "sk-or-v1-..."
hermes_telegram_token: "" # optional
hermes_discord_token: "" # optional
LiteLLM proxy loads this same file automatically.
Edit roles/install-hermes-agent/defaults/main.yml:
hermes_model: "free" # default — uses LiteLLM fallback chain
hermes_model: "cheap" # skip free tiers entirely
| Component | CPU req | CPU limit | Mem req | Mem limit |
|---|---|---|---|---|
| LiteLLM proxy | 100m | 500m | 128Mi | 512Mi |
| Hermes Agent | 100m | 500m | 128Mi | 512Mi |
# Hermes web UI (requires ingress stack)