Monitoring, logging, and alerting for Carcará — Prometheus, Grafana, Loki, Promtail, DCGM Exporter, and Node Exporter configuration and dashboard management
This skill governs the full observability stack: metrics collection, log aggregation, dashboards, and alerting — while strictly maintaining the privacy guarantee that no conversation content ever appears in any metric, log, or alert.
| Component | Port | Purpose |
|---|---|---|
| Prometheus | :9090 | Metrics collection and storage |
| Grafana | :3000 | Dashboards and alerting |
| Loki | :3100 | Log aggregation |
| Promtail | — | Log shipping agent (runs on GPU nodes) |
| DCGM Exporter | :9400 | NVIDIA GPU metrics |
| Node Exporter | :9100 |
| System metrics (CPU, RAM, disk, network) |
Defined in configs/base/prometheus.yaml:
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['gpu-node1:8000', 'gpu-node2:8000'] # auto-updated by deploy-fleet.sh
- job_name: 'litellm'
static_configs:
- targets: ['litellm:4000']
- job_name: 'node-exporter'
static_configs:
- targets: ['gpu-node1:9100', 'gpu-node2:9100']
- job_name: 'dcgm-exporter'
static_configs:
- targets: ['gpu-node1:9400', 'gpu-node2:9400']
Rule: When adding new GPU nodes via scripts/add-node.sh, Prometheus targets are auto-updated.
Three core dashboards in configs/base/grafana/dashboards/:
vllm.json)litellm.json)system.json)| Alert | Condition | Severity | Action |
|---|---|---|---|
| vLLM Down | up{job="vllm"} == 0 for 2min | Critical | Check Slurm job, resubmit |
| GPU Memory Critical | DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95 | Warning | Reduce max_model_len or add GPUs |
| GPU Temperature High | DCGM_FI_DEV_GPU_TEMP > 80 | Warning | Check cooling infrastructure |
| Error Rate High | rate(litellm_errors[5m]) > 0.05 | Warning | Check model health, review logs |
| Queue Saturation | vllm:num_requests_waiting > 10 for 5min | Warning | Add replicas or scale nodes |
Promtail ships logs from GPU nodes to Loki on the management node:
# configs/base/promtail-config.yaml