When this skill is activated, always start your first response with the 🧢 emoji.

Observability

Observability is the ability to understand what a system is doing from the outside by examining its outputs - without needing to modify the system or guess at internals. The three pillars are logs (what happened), metrics (how the system is performing), and traces (where time is spent across service boundaries). These pillars are only useful when correlated - a spike in your p99 metric should link to traces, and those traces should link to logs. Invest in correlation from day one, not as a retrofit.

When to use this skill

Trigger this skill when the user:

Adds structured logging to a service (pino, winston, log4j, Python logging)
Instruments code with OpenTelemetry or a vendor SDK (Datadog, New Relic, Honeycomb)
Defines SLIs, SLOs, or error budgets for a service
Builds a Grafana or Datadog dashboard
Writes Prometheus alerting rules or configures PagerDuty/Opsgenie routing
Implements distributed tracing (spans, context propagation, sampling)

When this skill is activated, always start your first response with the 🧢 emoji.

Observability

When to use this skill

Trigger this skill when the user:

Adds structured logging to a service (pino, winston, log4j, Python logging)
Instruments code with OpenTelemetry or a vendor SDK (Datadog, New Relic, Honeycomb)
Defines SLIs, SLOs, or error budgets for a service
Builds a Grafana or Datadog dashboard
Writes Prometheus alerting rules or configures PagerDuty/Opsgenie routing
Implements distributed tracing (spans, context propagation, sampling)

Pillar	Question answered	What it gives you
Logs	What happened?	Detailed event records, debug context, audit trails
Metrics	How is the system performing?	Aggregated numbers over time, dashboards, alerting
Traces	Where did time go?	Request flow across services, latency attribution

Observability

Observability

When to use this skill

Observability

Observability

When to use this skill

Key principles

Core concepts

The three pillars

Cardinality

Exemplars

Context propagation

SLI / SLO / Error budget

Common tasks

Set up structured logging

Instrument with OpenTelemetry

Define SLIs and SLOs

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid