Skill File

Slo Definition Workshop

Name: Slo Definition Workshop
Author: cloudthinker-ai

Use when performing slo definition workshop — sLO definition workflow covering SLI selection, target setting, error budget policy, alerting strategy, and stakeholder alignment. Use when establishing SLOs for a new service, revising existing targets, or implementing an SRE practice.

cloudthinker-ai0 starsMar 23, 2026

Occupation
Categories: Sales & Marketing

Skill Content

SLO Definition Workshop Skill

Define SLOs for {{ service_name }} ({{ service_tier }}).

Workflow

Step 1 — Service Context

Gather context about the service:

SERVICE PROFILE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Service: {{ service_name }}
Tier: {{ service_tier }}
Stakeholders: {{ stakeholders | "TBD" }}

User journeys this service supports:
1. [journey] — [criticality: critical/important/nice-to-have]
2. [journey] — [criticality]
3. [journey] — [criticality]

Current state:
  - Existing SLOs: YES / NO
  - Historical availability (last 30 days): ___%
  - Historical P95 latency (last 30 days): ___ms
  - Incident frequency (last 90 days): ___

Step 2 — SLI Selection

Choose the right SLIs for each user journey:

Related Skills

Slo Definition Workshop | Skills Pool

SLI CANDIDATES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
AVAILABILITY SLI
  Definition: Proportion of valid requests served successfully
  Formula: (successful requests) / (total valid requests) * 100
  What counts as "successful": HTTP 2xx, 3xx (exclude 4xx from denominator)
  What counts as "valid": All requests excluding health checks
  [ ] SELECTED — YES / NO

LATENCY SLI
  Definition: Proportion of requests faster than threshold
  Formula: (requests < threshold) / (total requests) * 100
  Threshold: ___ms (P50) / ___ms (P95) / ___ms (P99)
  Measurement point: [server-side / client-side / edge]
  [ ] SELECTED — YES / NO

CORRECTNESS SLI (if applicable)
  Definition: Proportion of requests returning correct results
  Formula: (correct responses) / (total responses) * 100
  How "correct" is defined: [validation method]
  [ ] SELECTED — YES / NO

FRESHNESS SLI (for data pipelines)
  Definition: Proportion of data updated within threshold
  Formula: (records updated within X min) / (total records) * 100
  Freshness threshold: ___ minutes
  [ ] SELECTED — YES / NO

THROUGHPUT SLI (if applicable)
  Definition: Proportion of time throughput is above minimum
  Formula: (minutes above threshold) / (total minutes) * 100
  Minimum throughput: ___ rps
  [ ] SELECTED — YES / NO

SLO TARGETS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
| SLI | Target | Window | Rationale |
|-----|--------|--------|-----------|
| Availability | ___% | 30 days rolling | [why this target] |
| Latency (P95) | ___% < ___ms | 30 days rolling | [why this target] |
| Latency (P99) | ___% < ___ms | 30 days rolling | [why this target] |
| [Other SLI] | ___% | 30 days rolling | [why this target] |

GUIDANCE BY TIER:
  Tier 1 (critical): 99.9% availability (43 min/month budget)
  Tier 2 (important): 99.5% availability (3.6 hrs/month budget)
  Tier 3 (internal): 99.0% availability (7.2 hrs/month budget)

IMPORTANT: SLO should be achievable but aspirational.
  - Too aggressive: constant alert fatigue, team burnout
  - Too lenient: users experience poor reliability before SLO violation
  - Rule of thumb: set slightly above historical performance

ERROR BUDGET POLICY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
30-day error budget for {{ service_name }}:

| SLO Target | Budget (minutes/month) | Budget (requests/month*) |
|-----------|----------------------|------------------------|
| 99.99% | 4.3 min | 0.01% of requests |
| 99.9% | 43.2 min | 0.1% of requests |
| 99.5% | 216 min (3.6 hrs) | 0.5% of requests |
| 99.0% | 432 min (7.2 hrs) | 1.0% of requests |

BUDGET STATES AND ACTIONS:
  Budget > 50% remaining:
    [ ] Normal development velocity
    [ ] Feature work prioritized
    [ ] Experimentation encouraged

  Budget 20-50% remaining:
    [ ] Increased caution with risky changes
    [ ] Reliability improvements prioritized alongside features
    [ ] Review recent incidents for patterns

  Budget < 20% remaining:
    [ ] Feature freeze for this service
    [ ] All engineering effort on reliability
    [ ] Incident review required for every budget-consuming event

  Budget exhausted (0%):
    [ ] Full feature freeze
    [ ] Mandatory reliability sprint
    [ ] Executive review of service health
    [ ] Postmortem required before resuming feature work

BUDGET RESET: Monthly rolling window

SLO-BASED ALERTING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Multi-window, multi-burn-rate alerting:

FAST BURN (page immediately):
  - Window: 5 min
  - Burn rate: 14.4x (exhausts 30-day budget in 2 days)
  - Action: PAGE on-call

SLOW BURN (ticket within hours):
  - Window: 6 hours
  - Burn rate: 6x (exhausts budget in 5 days)
  - Action: PAGE on-call during business hours

STEADY BURN (ticket):
  - Window: 3 days
  - Burn rate: 1x (on track to exhaust budget)
  - Action: CREATE TICKET, review in next sprint

[ ] Alerts configured in monitoring platform
[ ] Alert routing configured (PagerDuty / OpsGenie)
[ ] Runbook linked to each alert

SLO DOCUMENT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ ] SLO document published (wiki / repo)
[ ] Stakeholders reviewed and approved targets
[ ] Dashboard created showing SLO status and error budget
[ ] Error budget policy agreed upon with product team
[ ] Review cadence set: [monthly / quarterly]
[ ] First review date: [date]

Shortcut	Counter	Why
"We can skip some steps for this case"	Adapt the workflow steps, don't skip them	Skipped steps are where incidents and oversights originate
"The user seems to already know what to do"	Complete all workflow phases with the user	The workflow catches blind spots that experience alone misses
"This is a minor case, full process is overkill"	Scale the process down, don't turn it off	Minor cases become major when unstructured; the process scales, not disappears
"I'll fill in the details later"	Complete each section before moving on	Deferred details are forgotten; real-time capture is more accurate
"The template output isn't necessary"	Always produce the structured output format	Structured output enables comparison, audit trails, and handoff to other teams

Slo Definition Workshop

SLO Definition Workshop Skill

Workflow

Step 1 — Service Context

Step 2 — SLI Selection

Slo Definition Workshop

SLO Definition Workshop Skill

Workflow

Step 1 — Service Context

Step 2 — SLI Selection

Step 3 — SLO Target Setting

Step 4 — Error Budget Policy

Step 5 — Alerting Strategy

Step 6 — Documentation & Alignment

Counter-Rationalizations

Output Format

Taskflow Inbox Triage

Accessibility

Open a Pull Request

Investor Materials

Continuous Agent Loop

Configure Ecc