Name: Site Reliability Engineer
Author: MDLDev-site

Site Reliability Engineer | Skills Pool

API Availability SLO: 99.9% of requests succeed over 30 days
    = 43 minutes of downtime per month allowed

# SLO: API Availability

**SLI:** % of HTTP requests returning 2xx/3xx status
**Target:** 99.9% over 30 days
**Measurement window:** Rolling 30 days
**Error budget:** 0.1% = 43 minutes downtime/month

**What happens if we breach:**
- <99.9%: Feature freeze, focus on reliability
- >99.9%: Spend error budget on velocity (ship faster)

Month: January
SLO: 99.9% availability
Error budget: 0.1% = 43 minutes

Week 1: 10 min downtime (23% consumed)
Week 2: 5 min downtime (12% consumed)
Week 3: 20 min downtime (47% consumed)
Week 4: 2 min downtime (5% consumed)

Total: 37 min downtime (86% consumed)
Remaining: 6 min (14% left)

Action: Slow down deployments, focus on stability fixes

{
  "timestamp": "2025-01-27T03:15:42Z",
  "level": "ERROR",
  "service": "api-gateway",
  "trace_id": "abc-123-def",
  "message": "Database connection timeout",
  "error": "psycopg2.OperationalError: timeout",
  "user_id": 456,
  "endpoint": "/api/checkout"
}

# Prometheus metrics (Golden Signals)
http_requests_total.labels(status='200').inc()  # Traffic
http_request_duration_seconds.observe(0.234)     # Latency
http_requests_failed_total.labels(status='500').inc()  # Errors
db_connections_active.set(42)  # Saturation

Request: POST /api/checkout (500ms total)
├─ [API Gateway] 10ms
├─ [Auth Service] 15ms
├─ [Payment Service] 450ms ◄── Bottleneck!
│  ├─ [Database Query] 420ms ◄── Root cause
│  └─ [Stripe API] 20ms
└─ [Notifications] 5ms

# Latency p99
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# Error rate (5xx)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Saturation (CPU)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Option 1: Rollback (safest, fastest)

# Rollback to previous version
kubectl rollout undo deployment/api-gateway

Option 2: Scale (if capacity issue)

# Scale up replicas
kubectl scale deployment/api-gateway --replicas=10

Option 3: Hotfix (if rollback not possible)

# Deploy emergency fix
kubectl set image deployment/api-gateway api=v1.2.3-hotfix

## Incident Update (15 min)

**Status:** INVESTIGATING
**Impact:** API latency elevated (p99: 2s, SLO: 500ms)
**Affected:** 20% of users (checkout flow)
**Action:** Rolled back deployment v1.2.3 → v1.2.2
**ETA:** Monitoring for 15 min, expect recovery by 3:45 AM

# Postmortem: API Latency Incident (2025-01-27)

## Summary
On 2025-01-27 at 03:15 UTC, API latency increased from 200ms p99 → 2s p99, affecting 20% of users for 30 minutes. Root cause: Database connection pool exhausted due to slow query introduced in v1.2.3.

## Impact
- **Duration:** 30 minutes (03:15 - 03:45 UTC)
- **Users affected:** 20% (checkout flow)
- **Error budget consumed:** 5% of monthly budget

## Timeline
- **03:15:** Alert fired: "API latency p99 > 1s"
- **03:18:** On-call engineer (Alice) paged
- **03:20:** Incident channel created, investigation started
- **03:25:** Identified recent deploy (v1.2.3) as suspect
- **03:30:** Rolled back to v1.2.2
- **03:35:** Latency recovered to 200ms p99
- **03:45:** Incident closed

## Root Cause
Deployment v1.2.3 introduced a new database query without proper indexing:
```sql
-- Slow query (missing index on created_at)
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 10;


⸻

## 5. On-Call Best Practices

### 5.1 On-Call Rotation

**Typical schedule:**
- **Primary on-call:** First responder (gets paged)
- **Secondary on-call:** Backup (escalation if primary unavailable)
- **Rotation:** 1 week shifts (Mon 9am - Mon 9am)

**Handoff checklist:**
```markdown
## On-Call Handoff (2025-01-27)

### Ongoing Incidents
- None (all clear)

### Recent Changes
- Deployed v1.2.3 to production (2025-01-26 14:00)
- Scaled database read replicas 2 → 4 (2025-01-25 10:00)

### Known Issues
- Redis cache occasionally slow (non-critical, debugging)

### Upcoming
- Planned maintenance: Database upgrade (2025-01-30 02:00 UTC)

### Runbooks to Review
- /runbooks/database-connection-pool-exhausted.md
- /runbooks/redis-failover.md

# Kubernetes liveness probe (auto-restart if unhealthy)
apiVersion: v1

Site Reliability Engineer

The Site Reliability Engineer (SRE)

0. Core Principles (The SRE Way)

Site Reliability Engineer

The Site Reliability Engineer (SRE)

0. Core Principles (The SRE Way)

1. Personality & Tone

Before vs After

1.1 SRE Voice

2. SLIs, SLOs, and Error Budgets

2.1 SLI (Service Level Indicator)

2.2 SLO (Service Level Objective)

2.3 Error Budget

3. Observability (Logs, Metrics, Traces)

3.1 The Three Pillars

3.2 Golden Signals (Google SRE Book)

4. Incident Response

4.1 Incident Roles

4.2 Incident Response Workflow

4.3 Blameless Postmortem Template

What Went Well

What Went Wrong

Action Items

Lessons Learned

5.2 Reducing Toil

Defi Amm Security

Nodejs Keccak256

Syncable Entity Builder And Validation

Nft Standards

Solidity Security

Defi Protocol Templates