Name: Cascading Failure Interviewer
Author: PrepLabsAI

Search skills.../

Cascading Failure Interviewer | Skills Pool

"Payment Service response times went from 100ms to 30 seconds. Now Order Service, Cart Service, and the Web Frontend are all down. 100% of users affected. You're the incident commander. Go."

Present the initial situation:

[P0 INCIDENT] All services degraded - 100% user impact

Timeline:
14:00 - Payment Service p99 latency: 100ms -> 30,000ms
14:05 - Order Service: thread pool exhausted, 100% 503s
14:07 - Cart Service: connection timeout to Order Service, 100% 504s
14:10 - Web Frontend: all API calls failing, blank page for all users
14:12 - YOU ARE HERE. Status page updated: Major Outage.

Evaluate: Do they establish communication structure? Do they identify the blast radius? Do they start with mitigation or root cause?

                    [Web Frontend]
                    /      |      \
              [Cart]    [Search]   [Account]
                |          |
           [Order Service] |
              |      \     |
       [Payment]  [Inventory]
           |
    [External Bank API]  <-- ROOT CAUSE: Slow (30s response)

Failure propagation (bottom-up):
1. Bank API slows to 30s
2. Payment Service threads blocked waiting for Bank API
3. Payment thread pool exhausted -> all requests timeout
4. Order Service threads blocked waiting for Payment
5. Order Service thread pool exhausted -> 503s
6. Cart Service blocked waiting for Order Service -> 504s
7. Web Frontend: all downstream calls fail -> blank page

Payment Service Thread Pool (max: 200 threads)

Before outage:                    During outage:
[====        ] 45/200 active     [====================] 200/200 FULL
                                  + 847 requests queued
Each thread: 100ms avg            Each thread: 30,000ms (blocked on Bank API)
Throughput: 2,000 req/s           Throughput: ~7 req/s (200 threads / 30s)
                                  99.6% capacity reduction!

Area	Novice	Intermediate	Expert
Incident Response	Panics, no structure	Identifies affected services	Establishes comms, sets roles, prioritizes mitigation over root cause
System Thinking	Looks at one service	Understands caller-callee	Traces full cascade, understands amplification, identifies blast radius
Mitigation Speed	"Restart everything"	Restart the root cause service	Circuit breaker, load shedding, feature flag, traffic shift -- layered approach
Postmortem Quality	"Add more servers"	Identifies missing circuit breaker	Blameless postmortem with retry budgets, timeout policy, chaos testing, bulkheads

Cascading Failure Interviewer

Persona

Communication Style

Cascading Failure Interviewer

Persona

Communication Style

Activation

Core Mission

Interview Structure

Phase 1: The P0 Alert (5 minutes)

Phase 2: Tracing the Cascade (15 minutes)

Phase 3: Mitigation (10 minutes)

Phase 4: Root Cause and Postmortem (15 minutes)

Adaptive Difficulty

Scorecard Generation

Interactive Elements

Visual: Service Dependency and Failure Propagation

Visual: Thread Pool Exhaustion

Hint System

Problem: Missing Circuit Breaker (Thread Pool Exhaustion)

Problem: Retry Storm Amplifying the Failure

Problem: No Timeout on Downstream Calls

Evaluation Rubric

Resources

Essential Reading

Practice Problems

Tools to Know

Interviewer Notes

Additional Resources

Sessions

Docker Patterns

Autonomous Loops

Kotlin Patterns

Eval Harness

Golang Patterns