Use this skill when implementing logging, metrics, distributed tracing, alerting, or defining SLOs. Triggers on structured logging, Prometheus, Grafana, OpenTelemetry, Datadog, distributed tracing, error tracking, dashboards, alert fatigue, SLIs, SLOs, error budgets, and any task requiring system observability or monitoring setup.
When this skill is activated, always start your first response with the 🧢 emoji.
Observability is the ability to understand what a system is doing from the outside by examining its outputs - without needing to modify the system or guess at internals. The three pillars are logs (what happened), metrics (how the system is performing), and traces (where time is spent across service boundaries). These pillars are only useful when correlated - a spike in your p99 metric should link to traces, and those traces should link to logs. Invest in correlation from day one, not as a retrofit.
Trigger this skill when the user:
Do NOT trigger this skill for:
Structured logging always - Every log line should be machine-parseable JSON with consistent fields. Plain-text logs cannot be queried, filtered, or aggregated at scale. Correlation IDs are non-negotiable.
USE for resources, RED for services - Resources (CPU, memory, connections) are measured with Utilization/Saturation/Errors. Services (APIs, queues) are measured with Rate/Errors/Duration. Knowing which method applies tells you which metrics to instrument before you write a single line of code.
Instrument at boundaries - Service ingress/egress, database calls, external HTTP calls, and message queue produce/consume operations are the minimum instrumentation surface. Everything else is optional until proven necessary.
Alert on symptoms, not causes - Alert when users are impacted (high error rate, high latency). Do not page on CPU at 80% or a memory warning - those are causes to investigate, not symptoms to wake someone up for.
SLOs drive decisions - Every reliability trade-off should reference an error budget. If budget is healthy, ship features. If budget is burning, stop and fix reliability. SLOs without error budgets are just numbers on a slide.
| Pillar | Question answered | What it gives you |
|---|---|---|
| Logs | What happened? | Detailed event records, debug context, audit trails |
| Metrics | How is the system performing? | Aggregated numbers over time, dashboards, alerting |
| Traces | Where did time go? | Request flow across services, latency attribution |
Every unique combination of label values in a metric creates a new time series in your
metrics backend. user_id as a metric label will create millions of time series and
kill Prometheus. Keep metric label cardinality under ~100 unique values per label.
Use logs or traces for high-cardinality data (user IDs, request IDs, emails).
Exemplars are trace IDs embedded in metric data points. When you see a p99 spike on a histogram, an exemplar lets you jump directly to a trace that caused it. OpenTelemetry and Prometheus support exemplars natively. Enable them - they are the bridge between metrics and traces.
Context propagation is the mechanism by which a trace ID flows through service boundaries.
The W3C traceparent header is the standard format. Every service must: extract the
header on ingress, attach it to async context, and inject it into all outbound calls.
Failing to propagate breaks trace continuity silently.
successful_requests / total_requests1 - SLO. For a 99.9% SLO, the budget is 0.1% - about 43 minutes
of downtime per month. Burn rate measures how fast you consume it.Use pino for Node.js (fastest), winston for flexibility. Always include a correlation
ID middleware that attaches traceId to every log automatically.
// logger.ts - pino with correlation ID support
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: process.env.SERVICE_NAME ?? 'unknown',
version: process.env.SERVICE_VERSION ?? '0.0.0',
},
timestamp: pino.stdTimeFunctions.isoTime,
redact: ['req.headers.authorization', 'body.password', 'body.token'],
});
// Express middleware - binds traceId to every child logger in the request scope
export function loggerMiddleware(req: Request, res: Response, next: NextFunction) {
const traceId = req.headers['traceparent'] as string
?? req.headers['x-request-id'] as string
?? crypto.randomUUID();
req.log = logger.child({ traceId, method: req.method, path: req.path });
res.setHeader('x-request-id', traceId);
next();
}
// Usage in a route handler
app.post('/orders', async (req, res) => {
req.log.info({ orderId: body.id }, 'Processing order');
try {
const result = await orderService.create(body);
req.log.info({ orderId: result.id, durationMs: Date.now() - start }, 'Order created');
res.json(result);
} catch (err) {
req.log.error({ err, orderId: body.id }, 'Order creation failed');
res.status(500).json({ error: 'internal_error' });
}
});
Use the Node.js SDK with auto-instrumentation for HTTP, Express, and common DB clients. Add manual spans only for business-critical operations.
// instrumentation.ts - must be loaded before any other module (Node --require flag)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { ParentBasedSampler, TraceIdRatioBased } from '@opentelemetry/sdk-trace-node';
const sdk = new NodeSDK({
serviceName: process.env.SERVICE_NAME ?? 'my-service',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter(),
exportIntervalMillis: 15_000,
}),
sampler: new ParentBasedSampler({
root: new TraceIdRatioBased(0.1), // 10% head-based sampling
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
// Manual span for a business operation
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processPayment(orderId: string, amount: number) {
return tracer.startActiveSpan('payment.process', async (span) => {
span.setAttributes({ 'order.id': orderId, 'payment.amount': amount });
try {
const result = await stripe.charges.create({ amount, currency: 'usd' });
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
span.recordException(err as Error);
throw err;
} finally {
span.end();
}
});
}
Load
instrumentation.tsbefore your app withnode --require ./dist/instrumentation.js server.js. Seereferences/opentelemetry-setup.mdfor exporters, processors, and Python setup.
Define SLIs from the user's perspective first, then map to metrics you can measure.
# slos.yaml - document alongside your service