Implement and validate open source observability stacks using Prometheus, Alertmanager, Grafana, Loki, Tempo, and the OpenTelemetry Collector. Use when the story needs backend observability infrastructure, dashboards, alerting, or log/trace storage built from open source tools.
Implement and validate observability infrastructure using Prometheus, Alertmanager, Grafana, Loki, Tempo, and the OpenTelemetry Collector.
Follow docs/guidelines/shared-operating-policy.md#extension-pack-activation-rule — this skill belongs to the observability extension pack.
# Validate Prometheus config and recording/alert rules
promtool check config observability/prometheus/prometheus.yml
promtool check rules observability/prometheus/*.rules.yml
# Validate Alertmanager config
promtool check-config observability/alertmanager/alertmanager.yml
# Inspect Grafana dashboards (JSON validation)
cd observability/grafana/dashboards && for f in *.json; do jq . "$f" > /dev/null && echo "✓ $f" || echo "✗ $f"; done
# Validate Loki configuration YAML syntax
yamllint observability/loki/loki-config.yaml
# Validate Tempo configuration YAML syntax
yamllint observability/tempo/tempo-config.yaml
# Validate OpenTelemetry Collector configuration
otelcol validate --config observability/otel-collector.yaml
# or if using otel-contrib:
otelcol-contrib validate --config observability/otel-collector.yaml
# Verify metric label cardinality and low-cardinality constraint compliance
rg '\$__(\w+)' observability/prometheus/recording-rules.yml observability/prometheus/alert-rules.yml
# Check for any obvious high-cardinality patterns (user IDs, request IDs in labels)
rg '(user_id|request_id|pod_name|trace_id)' observability/prometheus/ observability/loki/
# Test Prometheus scrape targets (requires running Prometheus)
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, instance: .labels.instance, health: .health}'
# Query test: sample a metric (requires running Prometheus)
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result[0]'
Follow the universal checklist at docs/guidelines/shared-operating-policy.md#completion-checklist and confirm all items pass before marking work complete.