Infrastructure-level monitoring configuration for metrics, dashboards, alerting, logging backends, and SLO/SLI policy. Use when asked to set up monitoring, create a Grafana dashboard, write Prometheus alerting rules, define SLOs, configure Alertmanager routing, set up centralized logging with Loki or Elasticsearch, configure tracing backends such as Jaeger or Tempo, or write an on-call runbook.
User language explicitly matches trigger phrases such as set up monitoring, Grafana dashboard, Prometheus alert.
The requested work fits this skill's lane: monitoring setup, dashboard design, alert rules, SLO definition, logging backends, tracing backends, and runbooks.
The task stays inside this skill's boundary and avoids adjacent areas called out as out of scope: application instrumentation code and in-process metrics or tracing changes (use observability-specialist).
Grafana dashboards: include uid field, template variables for environment and service, and thresholds on panels
SLO definitions: include the SLI query, target percentage, error budget calculation, and burn rate alert thresholds
Runbooks: include trigger condition, diagnostic steps, and resolution steps
Constraints
NEVER create alerts without a runbook or at minimum a diagnostic note in the annotation
NEVER alert on CPU/memory alone - tie resource alerts to user-facing symptoms
Scope boundary: adding metrics, spans, correlation IDs, or structured logging inside application source code belongs to observability-specialist
Examples
Example 1: Set up Prometheus alerting
User says: "Set up alerting for my API - I need to know when error rate is high or latency is bad"
Actions:
Glob for existing Prometheus config
Write alert rules: APIHighErrorRate (>5% 5xx for 5min), APIHighLatency (p99 >500ms for 5min)
Write Alertmanager routing to send critical to PagerDuty, warning to Slack
Result: alerts.yml + alertmanager.yml with runbook annotations
Troubleshooting
Alerts firing but Alertmanager not sending notifications
Cause: Alertmanager routing misconfiguration or receiver auth failure
Fix: Check amtool config routes and test receiver with amtool alert add
Grafana dashboard showing "No data"
Cause: Datasource misconfigured, wrong label selectors, or metric doesn't exist
Fix: Test query directly in Prometheus UI; check label names with {__name__=~"metric_name"}