Add structured logging, metrics, and distributed tracing to any service or monolith
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars are logs, metrics, and traces. A system without observability is a black box — you cannot debug production issues, validate deployments, or understand user-facing behavior.
Works in monoliths (single service, shared log/metrics output) and distributed systems (requires correlation IDs and distributed tracing).
Use this skill when:
Rules:
ERROR — unexpected failure requiring investigationWARN — unexpected but handled, or degraded behaviorINFO — significant business events (order placed, user logged in)DEBUG — developer diagnostics (do not enable in production by default)"payment failed". Good: {"event": "payment_failed", "order_id": "abc", "reason": "gateway_timeout", "attempt": 2}.Minimum fields on every log entry:
{
"timestamp": "2025-03-15T21:00:00Z",
"level": "ERROR",
"service": "checkout-service",
"trace_id": "abc123",
"span_id": "def456",
"message": "Payment gateway timeout",
"order_id": "ord_789",
"duration_ms": 5023
}
Key metric types:
| Type | Use for | Example |
|---|---|---|
| Counter | Things that accumulate | http_requests_total |
| Gauge | Current values | active_connections, queue_depth |
| Histogram | Latency and size distributions | http_request_duration_seconds |
The Four Golden Signals (Google SRE): Instrument these for every service:
USE Method (for infrastructure):
For monoliths: Add trace IDs to log lines to correlate events within a single request lifecycle.
For distributed systems:
trace_id at the entry point (API gateway or first service to receive the request).trace_id and span_id in HTTP headers (traceparent in W3C Trace Context format).span_id and the parent's ID.OpenTelemetry is the standard instrumentation library — use it over vendor-specific SDKs where possible.
When applying this skill, the agent should: