Name: Observability
Author: edercnj

搜索技能.../

Observability | Skills Pool

Type	Description	Use Case
Request-based	Ratio of good events to total events in a request/response model	HTTP services, gRPC endpoints, API calls
Window-based	Fraction of time windows where the metric meets a threshold	Batch jobs, background workers, scheduled tasks

Method	Description	Pros	Cons
Server-side	Measured at the application or load balancer	Easy to implement, low overhead	Misses client-perceived failures (DNS, network)
Client-side	Measured at the client or edge proxy	Reflects user experience accurately	Higher implementation complexity, data collection cost
Synthetic	Probes that simulate user requests at regular intervals	Detects issues before users report them	Does not reflect real traffic patterns

Window	Description	Best For
Rolling	Sliding window (e.g., last 30 days from now)	Continuous monitoring, real-time dashboards
Calendar	Fixed calendar period (e.g., January 1-31)	Business reporting, compliance audits

SLI = (good events / valid events) x 100%

Where:
- good events  = requests that meet the quality threshold
- valid events = all requests minus excluded categories
                 (health checks, internal probes, test traffic)

Error Budget = (1 - SLO Target) x Time Window

Examples:
  99.9% SLO over 30 days = 0.1% x 43,200 min = 43.2 min downtime allowed
  99.95% SLO over 30 days = 0.05% x 43,200 min = 21.6 min downtime allowed
  99.99% SLO over 30 days = 0.01% x 43,200 min = 4.32 min downtime allowed

Alert Type	Burn Rate	Window	Severity	Action
Fast Burn	14.4x	1 hour	P1 — Page	Immediate response, on-call paged
Slow Burn	6x	6 hours	P2 — Ticket	Investigation within 30 min (business hours)
Steady Burn	3x	3 days	P3 — Notification	Review in next working session

Budget Consumed	Action	Owner
< 50%	Normal operations	Engineering Team
50%	Review recent deployments and changes	Tech Lead
75%	Freeze non-critical deployments	Engineering Manager
90%	Prioritize reliability work exclusively	SRE + Engineering Manager
100%	Freeze all deployments; incident response	VP Engineering

Severity	Route	Response Time	Escalation
P1 — Critical	PagerDuty on-call	< 5 min ACK	Auto-escalate after 15 min
P2 — High	Slack #sre-alerts + Jira ticket	< 30 min (business hours)	Escalate after 2 hours
P3 — Medium	Slack #sre-notifications	Next business day	Review in weekly SRE meeting
P4 — Low	Dashboard only	Best effort	Batch review monthly

Strategy	Description
Deduplication	Group identical alerts within a 5-minute window
Grouping	Aggregate related alerts (same service, same SLI) into a single notification
Silencing	Suppress known-noisy alerts during maintenance or known degradation
Flap Detection	Suppress alerts that rapidly toggle between firing and resolved
Actionability Review	Monthly review: every alert must have a runbook; remove alerts nobody acts on

Observability

Knowledge Pack: Observability

Purpose

Quick Reference (always in context)

Detailed References

Observability

Knowledge Pack: Observability

Purpose

Quick Reference (always in context)

Detailed References

SLO/SLI Framework

SLI Types

SLI Measurement Methods

Window Types

SLI Specification Pattern

Error Budget Management

Error Budget Calculation

Burn Rate Alerts

Exhaustion Policy (Escalation Ladder)

Budget Allocation Between Teams

Alerting Strategy

Alert Routing by Severity

PagerDuty/OpsGenie Integration Patterns

On-Call Rotation Alerting

Alert Fatigue Prevention

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid

Observability

Knowledge Pack: Observability

Purpose

Quick Reference (always in context)

Detailed References

Observability

Knowledge Pack: Observability

Purpose

Quick Reference (always in context)

Detailed References

SLO/SLI Framework

SLI Types

SLI Measurement Methods

Window Types

SLI Specification Pattern

Error Budget Management

Error Budget Calculation

Burn Rate Alerts

Exhaustion Policy (Escalation Ladder)

Budget Allocation Between Teams

Alerting Strategy

Alert Routing by Severity

PagerDuty/OpsGenie Integration Patterns

On-Call Rotation Alerting

Alert Fatigue Prevention

Related Knowledge Packs

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid