Use when user configures Prometheus, writes PromQL queries, creates alerting rules, sets up recording rules, configures Alertmanager routing, or integrates Prometheus with Grafana or service discovery. Do NOT use for Datadog, CloudWatch, or other proprietary monitoring. Do NOT use for OpenTelemetry collector setup (use opentelemetry-instrumentation skill).
Use the correct type for each measurement:
rate() or increase() before graphing.histogram_quantile() aggregation.Follow these rules for all custom metrics:
http_requests_total, .process_cpu_seconds_total_seconds, _bytes, _total, _info, _ratio._total.myapp_http_requests_total.method, status_code, endpoint, instance, job.metric_relabel_configs:metric_relabel_configs:
- source_labels: [pod_uid]
action: labeldrop
- source_labels: [__name__]
regex: "expensive_metric_.+"
action: drop
http_requests_total{method="GET", status="200"} # exact match
http_requests_total{status=~"5.."} # regex match
http_requests_total{method!="OPTIONS"} # negative match
http_requests_total{job="api"}[5m] # range vector
sum by (method) (rate(http_requests_total[5m]))
avg without (instance) (node_memory_MemAvailable_bytes)
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
count by (job) (up)
rate(http_requests_total[5m]) # per-second rate (counters)
increase(http_requests_total[1h]) # total increase (counters)
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) # predict 4h ahead
absent(up{job="payment-service"}) # detect missing series
clamp_min(free_disk_bytes, 0)
rate(http_requests_total[5m])[30m:1m] # subquery
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) # CPU
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) # memory
rate(node_disk_io_time_seconds_total[5m]) # disk I/O
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Pre-compute expensive queries. Use the naming convention level:metric:operations.