Name: Observability And Monitoring
Author: nickgallick

Observability & Monitoring

The Three Pillars

1. Logs (Structured JSON with pino)

import pino from 'pino'

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  redact: ['req.headers.authorization', 'password', 'token', '*.secret'],
})

// Middleware: generate requestId, thread through lifecycle
function withRequestId(handler) {
  return (req, res) => {
    const requestId = crypto.randomUUID()
    const childLogger = logger.child({ requestId, userId: req.user?.id })
    req.log = childLogger
    req.log.info({ method: req.method, path: req.url }, 'request_start')
    const start = performance.now()
    const result = handler(req, res)
    req.log.info({ durationMs: Math.round(performance.now() - start) }, 'request_end')
    return result
  }
}

Required fields: timestamp, level, message, , , , tokens, passwords, PII, full request bodies, credit card numbers

Observability & Monitoring

The Three Pillars

1. Logs (Structured JSON with pino)

import pino from 'pino'

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  redact: ['req.headers.authorization', 'password', 'token', '*.secret'],
})

// Middleware: generate requestId, thread through lifecycle
function withRequestId(handler) {
  return (req, res) => {
    const requestId = crypto.randomUUID()
    const childLogger = logger.child({ requestId, userId: req.user?.id })
    req.log = childLogger
    req.log.info({ method: req.method, path: req.url }, 'request_start')
    const start = performance.now()
    const result = handler(req, res)
    req.log.info({ durationMs: Math.round(performance.now() - start) }, 'request_end')
    return result
  }
}

Required fields: timestamp, level, message, , , , tokens, passwords, PII, full request bodies, credit card numbers

Category	Metric	Alert Threshold
Request rate	req/s per endpoint	Sudden drop >50% = incident
Error rate	5xx / total requests	>1% = urgent, >5% = page
Latency	p50, p95, p99 per endpoint	p99 >5s = urgent
Saturation	DB connections, memory, CPU	>80% = urgent
Business	challenges/hour, entries/day, coins/day	Anomaly detection

Severity	Criteria	Action	Example
Page	Error rate >5% for 5min, health checks failing, payments down	Wake on-call	Judge API returning 500s
Urgent	Error rate >1%, p99 >5s, DB connections >80%	Fix today	Slow leaderboard queries
Warning	p95 >2s, error rate >0.5%, new error type	Fix this week	New error in transcript parser

Observability And Monitoring

Observability & Monitoring

The Three Pillars

1. Logs (Structured JSON with pino)

Observability And Monitoring

Observability & Monitoring

The Three Pillars

1. Logs (Structured JSON with pino)

2. Metrics

3. Traces

Alerting Without Fatigue

Runbook Template

Incident Response

Postmortem (Blameless)

Sources

Changelog

Things Mac

Trello

Production Scheduling

Jira Integration

Production Scheduling

Cost Aware Llm Pipeline