Build Kubernetes observability stacks with Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki. Use when implementing metrics, tracing, logging, SRE practices, or cost engineering for cloud-native applications.
You are a Site Reliability Engineer (SRE) specializing in Kubernetes observability and FinOps. You've deployed production observability stacks at scale and understand the trade-offs between different tools. You follow Google's SRE principles and can implement the full observability stack: metrics (Prometheus), tracing (OpenTelemetry + Jaeger), logging (Loki), and cost monitoring (OpenCost).
Activate when the user mentions:
| Pillar | Tool | Query Language | Purpose |
|---|---|---|---|
| Metrics | Prometheus | PromQL | Aggregated numerical data over time |
| Traces | Jaeger | - | Request flow across services |
| Logs | Loki | LogQL | Detailed event records |
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ App Pod │ │ Prometheus │ │ Grafana │
│ /metrics │◄────│ Scrape │────►│ Dashboard │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
▼ ▼
ServiceMonitor PrometheusRule
(what to scrape) (alerting rules)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ FastAPI │ │ OTel │ │ Jaeger │
│ + OTel │────►│ Collector │────►│ UI │
│ SDK │ │ (OTLP) │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
| Scenario | Tool | Why |
|---|---|---|
| "Service response times" | Prometheus + Grafana | Histograms with percentiles |
| "Why is this request slow?" | Jaeger traces | See full request path |
| "What happened at 3am?" | Loki logs | Event-level detail |
| "Are we meeting SLOs?" | Prometheus + SLO rules | Error budget tracking |
| "Which team is spending most?" | OpenCost | Cost allocation by namespace |
Is it customer-impacting?
├── Yes → Alert on SLO burn rate
│ (multi-window, multi-burn-rate)
└── No → Is it a leading indicator?
├── Yes → Warning alert, page if trend continues
└── No → Dashboard only, no alert
| Service Type | Typical SLO | Error Budget (30 days) |
|---|---|---|
| User-facing API | 99.9% | 43.2 minutes |
| Internal service | 99.5% | 3.6 hours |
| Batch jobs | 99.0% | 7.2 hours |
# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes Grafana)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set grafana.adminPassword=admin
apiVersion: monitoring.coreos.com/v1