Name: Incident Commander
Author: MDLDev-site

Buscar habilidades.../

Incident Commander | Skills Pool

"We're still working on it. Not sure when it'll be fixed."

"Update #3 - 10:45 AM
Root cause identified: Memory leak in user-service v2.5
Mitigation: Rolling back to v2.4 (in progress, 5 min ETA)
Impact: 30% of API requests failing
Next update: 11:00 AM"

"Who deployed this broken code? This is unacceptable!"

"We deployed a change that caused issues. Let's roll back now.
In the post-mortem, we'll examine how it passed tests and
improve our deployment pipeline. Focus on mitigation first."

Severity	Impact	Response Time	Examples	Communication Cadence
SEV1	Critical system down, revenue impact, data loss	Immediate, all hands	Payment processing down, data breach, total site outage	Every 15-30 min
SEV2	Major degradation, workaround exists	<30 min	Slow API (50% degradation), intermittent errors affecting 10%+ users	Every 30-60 min
SEV3	Minor impact, limited users	<2 hours	UI bug, non-critical feature broken, single region degraded	Hourly or on resolution
SEV4	Cosmetic, no user impact	Best effort	Typo, logging noise, internal tool slow	No stakeholder updates

def determine_severity(incident):
    """
    Calculate incident severity based on impact factors
    """
    # Critical factors
    if incident.revenue_impacted or incident.data_breach:
        return "SEV1"

    if incident.percent_users_affected >= 25:
        return "SEV1"

    if incident.core_functionality_down and not incident.workaround_exists:
        return "SEV1"

    # Major factors
    if incident.percent_users_affected >= 10:
        return "SEV2"

    if incident.performance_degradation >= 50:  # 50% slower
        return "SEV2"

    if incident.critical_customer_affected:  # Enterprise customer down
        return "SEV2"

    # Minor factors
    if incident.percent_users_affected >= 1:
        return "SEV3"

    if incident.non_critical_feature_broken:
        return "SEV3"

    # Cosmetic
    return "SEV4"

# Example usage:
incident = Incident(
    revenue_impacted=False,
    percent_users_affected=15,
    core_functionality_down=False,
    performance_degradation=60
)
severity = determine_severity(incident)  # Returns "SEV2"

                    ┌─────────────────────┐
                    │ Incident Commander  │ (Makes decisions, owns outcome)
                    └──────────┬──────────┘
                               │
         ┌─────────────────────┼─────────────────────┐
         │                     │                     │
┌────────▼─────────┐  ┌────────▼────────┐  ┌────────▼────────┐
│ Technical Lead   │  │ Communications  │  │     Scribe      │
│                  │  │      Lead       │  │                 │
└────────┬─────────┘  └─────────────────┘  └────���────────────┘
         │
         │ (Coordinates)
         │
    ┌────┴────┬────────┬────────┐
    │         │        │        │
┌───▼──┐  ┌──▼───┐ ┌──▼───┐ ┌──▼───┐
│ SME  │  │ SME  │ │ SME  │ │ SME  │
│ DB   │  │ Net  │ │ API  │ │ Infra│
└──────┘  └──────┘ └──────┘ └──────┘

# incident_response.py
from datetime import datetime
from enum import Enum

class Severity(Enum):
    SEV1 = "SEV1"
    SEV2 = "SEV2"
    SEV3 = "SEV3"
    SEV4 = "SEV4"

class Incident:
    def __init__(self, title, severity, detected_at=None):
        self.title = title
        self.severity = severity
        self.detected_at = detected_at or datetime.now()
        self.declared_at = None
        self.commander = None
        self.timeline = []
        self.status = "detected"

    def declare(self, commander):
        """
        Officially declare incident and assign commander
        """
        self.declared_at = datetime.now()
        self.commander = commander
        self.status = "active"

        self.log_event(f"Incident declared by {commander}")

        # Auto-actions based on severity
        if self.severity == Severity.SEV1:
            self.page_oncall_team()
            self.create_war_room()
            self.post_to_status_page("Investigating")

        return self

    def log_event(self, description):
        """
        Log timestamped event to incident timeline
        """
        event = {
            "timestamp": datetime.now(),
            "description": description
        }
        self.timeline.append(event)
        print(f"[{event['timestamp'].strftime('%H:%M:%S')}] {description}")

# Usage:
incident = Incident(
    title="Payment API returning 500 errors",
    severity=Severity.SEV1
)
incident.declare(commander="[email protected]")
# Output: [10:05:23] Incident declared by [email protected]

┌──────────────────────────────────────┐
│ 1. What is the user impact?          │ → "Users cannot checkout" (clear, specific)
├──────────────────────────────────────┤
│ 2. How many users affected?          │ → "15% of checkout attempts" (quantified)
├──────────────────────────────────────┤
│ 3. Is this actively getting worse?   │ → "Yes, error rate climbing" (trend)
├──────────────────────────────────────┤
│ 4. What changed recently?            │ → "Deployment 30 min ago" (timeline)
├──────────────────────────────────────┤
│ 5. Can we mitigate immediately?      │ → "Rollback available" (mitigation path)
└──────────────────────────────────────┘

def choose_mitigation_strategy(incident_context):
    """
    Mitigation priority: Fast > Safe > Perfect
    """
    # Strategy 1: Rollback (fastest, safest)
    if incident_context.recent_deployment and incident_context.has_rollback:
        return {
            "action": "ROLLBACK",
            "command": "kubectl rollout undo deployment/payment-service",
            "eta_minutes": 5,
            "risk": "low",
            "rationale": "Recent deployment correlation, rollback tested in staging"
        }

    # Strategy 2: Failover (fast, medium risk)
    if incident_context.has_failover_replica:
        return {
            "action": "FAILOVER",
            "command": "aws rds failover-db-cluster --db-cluster-identifier prod",
            "eta_minutes": 10,
            "risk": "medium",
            "rationale": "Primary DB unresponsive, failover to replica"
        }

    # Strategy 3: Feature Flag / Circuit Breaker (fast, safe)
    if incident_context.feature_flagged:
        return {
            "action": "DISABLE_FEATURE",
            "command": "feature-flag disable --flag new-checkout-flow",
            "eta_minutes": 2,
            "risk": "low",
            "rationale": "Isolate failing feature, restore rest of service"
        }

    # Strategy 4: Traffic Shed (buys time, lossy)
    if incident_context.overloaded:
        return {
            "action": "RATE_LIMIT",
            "command": "kubectl scale deployment/api --replicas=20",
            "eta_minutes": 5,
            "risk": "medium",
            "rationale": "Scale up + enable rate limiting to prevent cascade"
        }

    # Strategy 5: Hot Fix (slowest, highest risk)
    if incident_context.fix_identified and incident_context.fix_simple:
        return {
            "action": "HOTFIX",
            "command": "git cherry-pick <fix-commit> && deploy prod",
            "eta_minutes": 20,
            "risk": "high",
            "rationale": "No rollback available, fix is one-line change"
        }

    # Escalate: No clear mitigation path
    return {
        "action": "ESCALATE",
        "command": "Page VP Engineering + Principal Engineers",
        "eta_minutes": None,
        "risk": "N/A",
        "rationale": "No immediate mitigation available, need expert assessment"
    }

# Example:
context = IncidentContext(
    recent_deployment=True,
    has_rollback=True,
    deployment_time_minutes_ago=25
)
strategy = choose_mitigation_strategy(context)
print(f"Mitigation: {strategy['action']} (ETA: {strategy['eta_minutes']} min)")
# Output: Mitigation: ROLLBACK (ETA: 5 min)

SEV1 Mitigation Protocol:
- [ ] Identify recent changes (deployments, configs, infra)
- [ ] Propose mitigation strategy to IC
- [ ] Get IC approval (verbal OK in war room)
- [ ] Execute mitigation (TL drives, SMEs assist)
- [ ] Monitor metrics for 5 minutes post-mitigation
- [ ] Confirm user impact reduced
- [ ] Update status page ("Identified" → "Monitoring")

# While mitigation is executing, investigation team gathers data

# 1. Recent deployments
git log --since="2 hours ago" --oneline
kubectl rollout history deployment/payment-service

# 2. Error logs (last 30 minutes)
kubectl logs -l app=payment-service --since=30m | grep -i error

# 3. Metrics dashboard
# Check Datadog/Grafana:
#   - Error rate spike
#   - Latency (p50, p95, p99)
#   - Database connections
#   - CPU/memory utilization

# 4. Distributed traces (find slow requests)
# Honeycomb/Jaeger: filter for slow traces (>5s)

# 5. Database queries
# Slow query log, lock waits, connection pool exhaustion

## Hypothesis #1: Database Connection Pool Exhaustion

**Evidence:**
- Error logs: "Unable to acquire connection from pool"
- Metrics: Connection pool usage at 100% (max 50 connections)
- Timing: Started 10:00 AM (aligns with deployment at 9:55 AM)

**Test:**
- Check connection pool config in new deployment
- Review code changes for connection leaks

**Result:**
- ✅ CONFIRMED: New code opens connections but doesn't close in error path
- Fix: Wrap DB calls in try/finally to ensure connection.close()

def validate_resolution(incident):
    """
    Validate all resolution criteria before declaring resolved
    """
    checks = {
        "metrics_baseline": check_metrics_healthy(),
        "alerts_cleared": check_no_active_alerts(),
        "user_reports_stopped": check_support_tickets_declining(),
        "root_cause_known": incident.root_cause is not None,
        "mitigation_stable": check_mitigation_stable_15min()
    }

    if all(checks.values()):
        incident.resolve()
        return True
    else:
        failed_checks = [k for k, v in checks.items() if not v]
        print(f"Cannot resolve yet. Failed checks: {failed_checks}")
        return False

# Example:
validate_resolution(incident)
# Output: Cannot resolve yet. Failed checks: ['mitigation_stable']
# (Wait for mitigation to be stable for 15 min before declaring resolved)

1. Update status page:
   "Resolved - All systems operational. We apologize for the inconvenience."

2. Notify stakeholders:
   - Internal: Slack #incidents
   - External: CEO, affected customers, sales team

3. Transition to post-mortem:
   - Schedule post-mortem meeting (within 48 hours)
   - Assign scribe to draft initial timeline
   - Identify action item owners

# Incident Post-Mortem: Payment Service Outage

**Date:** 2025-01-15
**Severity:** SEV1
**Duration:** 2h 15m (10:00 AM - 12:15 PM PST)
**Commander:** Alice Chen
**Participants:** Alice (IC), Bob (TL), Carol (Comms), Dave (Scribe), SRE Team

## Executive Summary
Payment service was unavailable for 2h 15m, affecting 15% of users (est. 5,000 failed checkouts).
Root cause: Database connection pool exhaustion due to connection leak in new deployment.
Mitigation: Rolled back deployment, then deployed fix with proper connection handling.
Estimated revenue impact: $50K in delayed transactions (recovered post-incident).

## Impact
- **Users affected:** ~5,000 (15% of checkout attempts)
- **Services degraded:** Payment API, Checkout flow
- **Revenue impact:** $50K delayed (recovered), $5K abandoned carts (lost)
- **Customer complaints:** 23 support tickets, 8 social media mentions

## Timeline (All times PST)

| Time | Event |
|------|-------|
| 09:55 AM | Deployment of payment-service v2.5 started |
| 10:00 AM | Alert: Payment API error rate > 10% |
| 10:02 AM | User reports: "Cannot complete checkout" |
| 10:05 AM | IC (Alice) declared SEV1, assembled team |
| 10:08 AM | Status page: "Investigating checkout issues" |
| 10:12 AM | TL identified: DB connection pool at 100% |
| 10:15 AM | Hypothesis: Connection leak in new deployment |
| 10:18 AM | Decision: Rollback to v2.4 |
| 10:23 AM | Rollback completed, monitoring metrics |
| 10:30 AM | Metrics improving: Error rate dropping |
| 10:45 AM | Error rate back to <1% (normal) |
| 11:00 AM | Code review: Found connection leak in error handling |
| 11:30 AM | Fix deployed (v2.5.1) with proper connection.close() |
| 12:00 PM | Monitoring stable, all metrics green |
| 12:15 PM | Incident resolved, status page updated |

## Root Cause (5 Whys)

1. **Why did checkouts fail?**
   Payment API returned 500 errors.

2. **Why did the API return 500 errors?**
   Could not acquire database connections.

3. **Why could it not acquire connections?**
   Connection pool was exhausted (50/50 connections in use).

4. **Why was the pool exhausted?**
   New code in v2.5 opened connections but didn't close them in error paths.

5. **Why didn't tests catch this?**
   Load tests didn't simulate error scenarios (only happy path).

**Root Cause:** Connection leak in payment-service v2.5, introduced in PR #1234.
Code opened DB connections in try block but only closed in success path,
not in exception handler (error path).

## What Went Well ✅
- **Fast detection:** Alert fired within 5 min of impact
- **Clear command structure:** Roles assigned immediately, no confusion
- **Effective communication:** Updates every 15 min, stakeholders informed
- **Quick rollback:** Automated rollback completed in 5 min
- **Blameless culture:** Team focused on fixing, not blaming

## What Went Wrong ❌
- **Insufficient load testing:** Load tests didn't include error scenarios
- **No connection pool monitoring:** No alert for pool exhaustion (reactively discovered)
- **Code review missed leak:** PR approved without catching resource leak
- **Deployment timing:** Deployed at 9:55 AM (peak traffic time)

## Action Items

| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| **P0** | Add connection pool metrics to dashboard + alert at 80% | SRE Team | 2025-01-18 | ✅ Done |
| **P0** | Fix connection leak in payment-service (ensure finally block) | Team Alpha | 2025-01-17 | ✅ Done |
| **P1** | Add error scenario to load tests (DB unavailable, slow queries) | QA Team | 2025-01-25 | 🟡 In Progress |
| **P1** | Update deployment policy: No prod deploys 9 AM - 12 PM (peak) | DevOps | 2025-01-20 | ✅ Done |
| **P2** | Code review checklist: Add "Resource cleanup (connections, files)" | Engineering | 2025-02-01 | ⏳ Pending |
| **P2** | Chaos engineering drill: Simulate DB connection pool exhaustion | SRE Team | 2025-02-15 | ⏳ Pending |

## Lessons Learned
1. **Monitor resource pools proactively:** Always add metrics/alerts for bounded resources
   (connection pools, thread pools, file handles, memory).
2. **Load test unhappy paths:** Error scenarios often reveal resource leaks.
3. **Deployment timing matters:** Avoid deployments during peak traffic unless critical.
4. **Rollback is a superpower:** Fast, automated rollback saved 90+ min of debugging.

## Follow-Up
- 30-day review scheduled for 2025-02-15 to verify action items completed.
- Share learnings in Engineering All-Hands (2025-01-22).

1. [5 min]  Incident overview (Scribe presents)
2. [15 min] Timeline walkthrough (What happened when?)
3. [10 min] Root cause discussion (5 Whys)
4. [10 min] What went well / What went wrong
5. [15 min] Action items discussion (Assign owners, due dates)
6. [5 min]  Lessons learned + closing thoughts

🚨 SEV1 INCIDENT DECLARED 🚨

**System:** Payment Processing (payment-service)
**Impact:** Users cannot complete checkout (15% error rate)
**Started:** 10:00 AM PST

**Roles:**
- IC: @alice (Incident Commander - makes decisions)
- TL: @bob (Technical Lead - coordinates debugging)
- Comms: @carol (Status page updates, stakeholder notifications)
- Scribe: @dave (Timeline tracking)

**Status:** Investigating
**ETA:** Next update in 15 minutes (10:15 AM)

**War Room:** #incident-2025-01-15-payment (this channel)
**Zoom Bridge:** https://zoom.us/j/incident-war-room

📋 UPDATE #2 - 10:15 AM

**Investigation:**
- Root cause identified: DB connection pool exhausted
- Recent deployment (v2.5) suspected

**Mitigation:**
- Action: Rolling back to v2.4
- Status: In progress (ETA 5 min)
- Expected outcome: Restore service

**Impact:**
- Still 15% error rate
- Support tickets: 12 new reports

**Next update:** 10:30 AM

📋 UPDATE #3 - 10:30 AM

**Mitigation:**
- Rollback completed at 10:23 AM
- Monitoring: Error rate dropping (15% → 8% → 4%)

**Status:**
- Service recovering
- Monitoring for 15 min before declaring stable

**Next update:** 10:45 AM or on resolution

✅ INCIDENT RESOLVED - 12:15 PM

**Duration:** 2h 15m (10:00 AM - 12:15 PM PST)
**Impact:** ~5,000 users affected (15% of checkout attempts)
**Root Cause:** Database connection leak in deployment v2.5
**Resolution:** Rollback + deployed fix (v2.5.1)

**Outcome:**
- Service fully restored
- All metrics green
- Fix deployed to prevent recurrence

**Next Steps:**
- Post-mortem meeting: Tomorrow (Jan 16) at 2 PM
- Action items to be assigned and tracked

Thank you to @bob @carol @dave and SRE team for quick response!

Phase 1: Investigating (First 10 min)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🟡 Investigating
We are investigating reports of checkout issues.
Our team is actively working to identify the cause.
Updates will be posted here every 15 minutes.

Posted: Jan 15, 10:05 AM PST


Phase 2: Identified (Mitigation in progress)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🟠 Identified
We have identified a database issue affecting checkout.
Our team is deploying a fix. Estimated time to resolution: 30 minutes.

Posted: Jan 15, 10:15 AM PST
Updated: Jan 15, 10:30 AM PST


Phase 3: Monitoring (Fix deployed, watching)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🟢 Monitoring
The fix has been deployed and checkout functionality is recovering.
We are monitoring the situation to ensure stability.

Posted: Jan 15, 10:45 AM PST


Phase 4: Resolved
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Resolved
The issue has been fully resolved. All systems are operational.
Checkout is functioning normally. We sincerely apologize for the inconvenience.

If you experienced issues, delayed transactions are now processing automatically.
For questions, contact [email protected].

Posted: Jan 15, 12:15 PM PST

Subject: [SEV1] Payment Outage - Resolved

Hi [CEO],

Quick summary of today's SEV1 incident:

IMPACT:
- Payment service down 2h 15m (10:00 AM - 12:15 PM)
- 15% of users affected (~5,000 failed checkout attempts)
- Revenue: ~$50K delayed (now processing), ~$5K lost (abandoned carts)

ROOT CAUSE:
- Database connection leak in new deployment
- Load testing didn't catch error scenario

RESOLUTION:
- Rolled back deployment immediately
- Deployed fix to prevent recurrence
- All systems now stable

PREVENTION:
1. Added connection pool monitoring/alerts
2. Enhanced load tests to include error scenarios
3. Updated deployment timing policy (avoid peak hours)

POST-MORTEM:
Tomorrow (Jan 16) at 2 PM. Action items will be tracked.

Happy to discuss further if needed.

- Alice (Incident Commander)

Subject: Service Disruption Notice - January 15

Dear Customer,

We experienced a service disruption today (January 15) from 10:00 AM to 12:15 PM PST
that affected checkout functionality on our platform.

WHAT HAPPENED:
A technical issue with our payment service prevented some users from completing purchases.

IMPACT:
If you attempted to check out during this window, your transaction may have failed.
All delayed transactions have now been processed automatically.

RESOLUTION:
Our engineering team identified and resolved the issue. All systems are now fully operational.

WE'RE SORRY:
We sincerely apologize for any inconvenience this caused.
We've implemented additional monitoring to prevent similar issues in the future.

QUESTIONS:
If you have any concerns or questions, please contact our support team at
[email protected] or call 1-800-XXX-XXXX.

Thank you for your patience and understanding.

- The Engineering Team

Subject: [SEV1] Checkout Outage - Talking Points for Customers

Team,

We had a checkout outage today (10 AM - 12:15 PM PST). Here's what to tell customers:

CUSTOMER QUESTION: "Why couldn't I check out?"
YOUR ANSWER: "We had a temporary technical issue with our payment system
             that has been fully resolved. Your delayed transaction has been
             processed automatically. We apologize for the inconvenience."

CUSTOMER QUESTION: "Will this happen again?"
YOUR ANSWER: "We've implemented additional monitoring and safeguards to prevent
             similar issues. Our engineering team conducted a thorough review
             and made improvements to our systems."

CUSTOMER QUESTION: "Can I get a refund/credit for the inconvenience?"
YOUR ANSWER: [Escalate to manager for case-by-case decision]

IMPACT:
- 2h 15m outage
- ~15% of checkout attempts affected
- All transactions now processing normally

If you get questions, feel free to share this email or escalate to me.

- Alice (Engineering)

# Runbook: Database Failover (RDS)

**Purpose:** Failover primary RDS instance to standby replica

**Trigger Scenarios:**
- Primary database unresponsive (health check failures)
- High replication lag (>60 seconds)
- Planned maintenance (OS patches, upgrades)

**Prerequisites:**
- ✅ AWS Console access (Production account)
- ✅ AWS CLI configured (`aws configure list`)
- ✅ RDS credentials stored in 1Password (search "RDS prod")
- ✅ Slack access to #incidents channel
- ✅ IC approval (for unplanned failover)

## Steps

### 1. Verify Issue
**Confirm primary database is actually down before failing over.**

```bash
# Check RDS instance status
aws rds describe-db-instances \
  --db-instance-identifier prod-primary \
  --query 'DBInstances[0].DBInstanceStatus'

# Expected output: "available" (healthy) or "failed" (unhealthy)

# Check CloudWatch metrics (last 10 min)
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=prod-primary \
  --start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

# If connections = 0 and status = "failed", proceed to failover

🔄 DATABASE FAILOVER INITIATED

Instance: prod-primary → prod-replica
Reason: Primary unresponsive (health checks failing)
ETA: 2-5 minutes
Expected impact: Brief connection interruptions (~30 sec)

Proceeding with failover now.

# Failover to standby replica
aws rds failover-db-cluster \
  --db-cluster-identifier prod-cluster

# Expected output:
# {
#   "DBCluster": {
#     "Status": "failing-over",
#     ...
#   }
# }

# Watch failover status (poll every 10 seconds)
watch -n 10 'aws rds describe-db-clusters \
  --db-cluster-identifier prod-cluster \
  --query "DBClusters[0].Status"'

# Status progression:
# "failing-over" → "available" (failover complete)

# Verify new primary endpoint
aws rds describe-db-clusters \
  --db-cluster-identifier prod-cluster \
  --query 'DBClusters[0].Endpoint'

# Expected: prod-cluster.cluster-xyz.us-east-1.rds.amazonaws.com

# Test connectivity
psql -h prod-cluster.cluster-xyz.us-east-1.rds.amazonaws.com \
     -U admin -d production -c "SELECT 1;"

# Expected output: 1 row returned (connection successful)

# Check application error rates (should return to normal)
curl https://api.company.com/health

# Expected: {"status": "healthy", "database": "connected"}

# Check Datadog dashboard:
# https://app.datadoghq.com/dashboard/prod-overview
# Verify: Error rate <1%, Latency p95 <500ms


### 5.2 ChatOps Automation (Incident Bot)

```python
# incident_bot.py (Slack Bot for ChatOps)
from slack_sdk import WebClient
from datetime import datetime
import os

slack_client = WebClient(token=os.environ["SLACK_BOT_TOKEN"])

class IncidentBot:
    def start_incident(self, severity, title, commander):
        """
        /incident start sev1 "Payment API down"

        Auto-creates:
        - Incident channel (#incident-2025-01-15-payment)
        - Zoom bridge
        - Incident doc (Google Doc)
        - Pages on-call team
        """
        incident_id = f"INC-{datetime.now().strftime('%Y%m%d-%H%M')}"
        channel_name = f"incident-{datetime.now().strftime('%Y-%m-%d')}-{title[:20].lower().replace(' ', '-')}"

        # Create Slack channel
        response = slack_client.conversations_create(
            name=channel_name,
            is_private=False
        )
        channel_id = response["channel"]["id"]

        # Set channel topic
        slack_client.conversations_setTopic(
            channel=channel_id,
            topic=f"{severity}: {title} | IC: {commander} | Started: {datetime.now().strftime('%H:%M PST')}"
        )

        # Post initial message
        slack_client.chat_postMessage(
            channel=channel_id,
            text=f"""
🚨 {severity} INCIDENT DECLARED 🚨

**Incident ID:** {incident_id}
**Title:** {title}
**Commander:** {commander}
**Started:** {datetime.now().strftime('%Y-%m-%d %H:%M PST')}

**Next Steps:**
1. IC: Assign roles (TL, Comms, Scribe)
2. TL: Begin investigation
3. Comms: Update status page
4. Scribe: Track timeline

**Commands:**
- `/incident update "message"` - Post update
- `/incident resolve` - Mark resolved
- `/incident help` - Show all commands
            """
        )

        # Page on-call team (if SEV1/SEV2)
        if severity in ["SEV1", "SEV2"]:
            self.page_oncall(incident_id, title, channel_name)

        return {
            "incident_id": incident_id,
            "channel": channel_name,
            "zoom_bridge": self.create_zoom_bridge(),
            "incident_doc": self.create_incident_doc(incident_id, title)
        }

    def post_update(self, channel, update_text):
        """
        /incident update "Rollback in progress, ETA 5 min"
        """
        slack_client.chat_postMessage(
            channel=channel,
            text=f"📋 UPDATE - {datetime.now().strftime('%H:%M')}\n\n{update_text}"
        )

        # Auto-post to status page (if configured)
        self.update_status_page(update_text)

    def resolve_incident(self, channel):
        """
        /incident resolve

        Auto-triggers:
        - Status page update ("Resolved")
        - Post-mortem doc creation
        - Schedules post-mortem meeting
        """
        slack_client.chat_postMessage(
            channel=channel,
            text=f"""
✅ INCIDENT RESOLVED - {datetime.now().strftime('%H:%M')}

**Next Steps:**
1. Scribe: Finalize timeline in incident doc
2. IC: Schedule post-mortem (within 48 hours)
3. Team: Complete action items

Post-mortem doc: [Auto-generated link]
            """
        )

        # Archive channel after 7 days
        self.schedule_channel_archive(channel, days=7)

# Usage in Slack:
# /incident start sev1 "Payment API down"
# /incident update "Mitigation in progress"
# /incident resolve

# Game Day: Database Failover Drill

**Date:** 2025-01-20 (3rd Friday of month)
**Time:** 10:00 AM - 11:30 AM PST
**Participants:** SRE team, Backend engineers, IC rotation

**Objectives:**
1. Practice RDS failover procedure
2. Validate runbook accuracy
3. Test monitoring/alerting
4. Measure response time (target: <10 min detection → mitigation)

**Scenario:**
"Primary database becomes unresponsive due to disk IOPS exhaustion.
You are paged at 10:00 AM. Application is experiencing 50% error rate."

**Setup (Pre-Drill):**
```bash
# Chaos engineering: Throttle IOPS on primary RDS instance
aws rds modify-db-instance \
  --db-instance-identifier staging-primary \
  --iops 100 \  # Normal: 3000 IOPS
  --apply-immediately


### 6.2 Chaos Engineering Scenarios

```python
# chaos_scenarios.py
# Regular chaos tests to build resilience

class ChaosScenarios:
    """
    Run these in staging monthly, production quarterly (during low-traffic)
    """

    def scenario_1_kill_random_pods(self):
        """
        Scenario: Random pod termination (test auto-restart, health checks)
        """
        return {
            "name": "Pod Chaos",
            "command": "kubectl delete pod -l app=payment-service --random",
            "expected_behavior": "Kubernetes restarts pod, no user impact",
            "success_criteria": "Error rate stays <1%, p95 latency <500ms"
        }

    def scenario_2_network_latency(self):
        """
        Scenario: Inject 500ms latency to database calls
        """
        return {
            "name": "Network Latency",
            "command": "toxiproxy-cli toxic add -t latency -a latency=500 db-proxy",
            "expected_behavior": "Timeouts trigger circuit breaker, fallback to cache",
            "success_criteria": "Degraded mode activates, user sees stale data (acceptable)"
        }

    def scenario_3_dependency_failure(self):
        """
        Scenario: Third-party API (Stripe) returns 500 errors
        """
        return {
            "name": "Dependency Failure",
            "command": "mock-stripe --failure-rate 100",
            "expected_behavior": "Retry logic exhausts, fails gracefully with user message",
            "success_criteria": "User sees 'Payment temporarily unavailable, try again'"
        }

    def scenario_4_disk_full(self):
        """
        Scenario: /var/log fills up (test log rotation, disk alerts)
        """
        return {
            "name": "Disk Full",
            "command": "dd if=/dev/zero of=/var/log/fill bs=1M count=10000",
            "expected_behavior": "Alert fires, log rotation cleans up, app continues",
            "success_criteria": "Alert within 5 min, auto-remediation cleans disk"
        }

# Run chaos tests with controlled blast radius:
# - Staging: Full chaos (any scenario, any time)
# - Production: Controlled chaos (pre-approved scenarios, low-traffic windows)

# pagerduty-config.yml
escalation_policy:
  name: "Production Incidents"
  levels:
    - level: 1
      targets:
        - type: user
          user_id: "P123456"  # On-call SRE
      escalation_delay_minutes: 15

    - level: 2
      targets:
        - type: schedule
          schedule_id: "PABCDEF"  # Engineering Manager rotation
      escalation_delay_minutes: 15

    - level: 3
      targets:
        - type: user
          user_id: "P789012"  # VP Engineering
      escalation_delay_minutes: 0  # Immediate

alert_rules:
  - name: "High Error Rate"
    condition: "error_rate > 5% for 5 minutes"
    severity: "SEV1"
    escalation_policy: "Production Incidents"

# Update status page programmatically
import requests

def update_status_page(status, message):
    """
    status: "investigating" | "identified" | "monitoring" | "resolved"
    """
    api_key = os.environ["STATUSPAGE_API_KEY"]
    page_id = "abc123"

    response = requests.post(
        f"https://api.statuspage.io/v1/pages/{page_id}/incidents",
        headers={"Authorization": f"OAuth {api_key}"},
        json={
            "incident": {
                "name": "Checkout Issues",
                "status": status,
                "body": message,
                "component_ids": ["payment-api"],
                "impact_override": "major"  # minor | major | critical
            }
        }
    )
    return response.json()

# Usage:
update_status_page("investigating", "We are investigating reports of checkout issues.")

# datadog_metrics.py
from datadog import initialize, api

options = {
    'api_key': os.environ['DD_API_KEY'],
    'app_key': os.environ['DD_APP_KEY']
}
initialize(**options)

# Send incident metrics
api.Event.create(
    title="SEV1 Incident: Payment Outage",
    text="Duration: 2h 15m. Root cause: DB connection leak.",
    tags=["severity:sev1", "service:payment", "team:backend"],
    alert_type="error"
)

# Find slow traces during incident
# Query: duration_ms > 5000 AND service = "payment-api"
# Group by: endpoint, database_query
# Visualize: Heatmap of slow requests

Incident Commander

The Incident Commander

0. Core Principles (The Command Center)

Incident Commander

The Incident Commander

0. Core Principles (The Command Center)

1. Personality & Tone

1.1 Before vs. After

2. Incident Response Framework

2.1 Severity Levels (Detailed)

2.2 Incident Roles (Expanded)

2.3 Incident Command Structure

3. Incident Lifecycle (Deep Dive)

3.1 Detection & Declaration

3.2 Triage & Mitigation (The Golden Hour)

3.3 Investigation (Parallel Track)

3.4 Resolution & Validation

3.5 Post-Mortem (Learning Phase)

4. Communication Playbook (Expanded)

4.1 Internal Communication (War Room)

4.2 External Communication (Status Page)

4.3 Stakeholder Communication Templates

5. Runbooks & Automation (Production-Ready)

5.1 Runbook Template (Detailed)

2. Notify Team

3. Initiate Failover

4. Monitor Failover Progress

5. Validate New Primary

6. Check Application Health

7. Post-Failover Actions

Rollback

Troubleshooting

Contacts

Post-Execution

6. Drills & Practice (Game Days)

6.1 Monthly Game Day Template

7. Technology & Tools

7.1 Incident Management Platforms

7.2 Observability Stack

8. Optional Command Shortcuts

9. Mantras

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api