Application metrics with RED/USE methods, Prometheus types, dashboards, and SLO alerting. Use when instrumenting an application with metrics, designing dashboards, setting up alerting, choosing between metric types, defining SLIs/SLOs, or applying RED, USE, or the four golden signals.
Metrics turn a running system into numbers you can alert and chart on. Use the four golden signals (latency, traffic, errors, saturation), RED for request paths, USE for resources, and Prometheus histograms for latency. Drive alerting from SLO burn rate, not raw thresholds.
These four cover the health of any request-serving system:
For APIs, web services, and microservices, track at every boundary:
For CPU, memory, disk, network, and queues, track at every resource:
| Type | Use For | Example |
|---|---|---|
| Counter | Monotonically increasing values | http_requests_total |
| Gauge | Values that go up and down | temperature_celsius |
| Histogram | Distribution of values in buckets | http_request_duration_seconds |
| Summary | Pre-calculated quantiles on the client | rpc_duration_seconds |
Use counters for anything you'll rate() over: requests, errors,
bytes. Use gauges for current state. Prefer histograms over summaries
in almost all cases — histograms aggregate across instances, summaries
do not. Suffix counters with _total and use unit suffixes
(_seconds, _bytes). See references/metric-types.md for code
examples and label cardinality guidance.
Start every service dashboard with the golden signals. Use p50/p95/p99 for latency, never just averages — p99 is often 10x worse than p50. Layer dashboards: overview (all services) -> service detail -> instance detail. Include deployment markers as vertical annotations. For each downstream dependency, show a RED row so failures attribute correctly.
Agent-specific failure modes — provider-neutral pause-and-self-check items:
status_code_class="2xx", not status_code="200".sum(rate(summary_quantile[5m])) produces meaningless results when you have multiple replicas. Histograms aggregate correctly: histogram_quantile(0.99, sum by (le) (rate(histogram_bucket[5m]))). Prefer histograms for all latency metrics.http_requests_total as an absolute number shows a meaningless cumulative value that grows without bound. Always apply rate() or irate() to counters to show the per-second rate of change, which is the operationally meaningful signal.Pick 3-5 SLIs reflecting user experience (availability, latency, correctness). Set targets from user expectations, not system capability. Track error budgets — if SLO is 99.9% you have 0.1% (43 min/month). When the budget is exhausted, freeze feature work and focus on reliability.
For any new service, instrument at minimum:
Keep labels low-cardinality (< 10 values each in practice). Never use
user IDs, email addresses, or full request paths as label values. Use
bounded categories like status="2xx" rather than per-status-code.
A metric with 3 labels of 10 values each = 1,000 series; a 4th label
of 100 values = 100,000 series and Prometheus memory pain. See
references/metric-types.md for PromQL patterns covering rate,
error %, histogram quantiles, saturation, and burn-rate.