Name: Java Debugging Prod Incidents
Author: HZeroxium

Debugging Production Incidents (Java) — SRE-first + JVM-safe tooling

Intent

During incidents, speed matters—but unstructured debugging causes more damage. This skill provides:

An SRE-style incident workflow (roles, comms, timeline)
A logs/metrics/traces-first diagnosis approach
Safe JVM diagnostics: thread dumps, JFR snippets, jcmd snapshots
A rollback / mitigate decision tree
A blameless postmortem template and “next guardrails” checklist

Scope

In scope

Incident triage and mitigation loop
Observability-first debugging
JVM diagnostics:
- thread dumps
- JFR capture
- GC/heap snapshots
Hypothesis-driven investigation
Rollback / feature-flag mitigation strategy
Postmortem and action items (prevention)

Debugging Production Incidents (Java) — SRE-first + JVM-safe tooling

Intent

During incidents, speed matters—but unstructured debugging causes more damage. This skill provides:

An SRE-style incident workflow (roles, comms, timeline)
A logs/metrics/traces-first diagnosis approach
Safe JVM diagnostics: thread dumps, JFR snippets, jcmd snapshots
A rollback / mitigate decision tree
A blameless postmortem template and “next guardrails” checklist

Scope

In scope

Incident triage and mitigation loop
Observability-first debugging
JVM diagnostics:
- thread dumps
- JFR capture
- GC/heap snapshots
Hypothesis-driven investigation
Rollback / feature-flag mitigation strategy
Postmortem and action items (prevention)

Java Debugging Prod Incidents

Debugging Production Incidents (Java) — SRE-first + JVM-safe tooling

Intent

Scope

In scope

Java Debugging Prod Incidents

Debugging Production Incidents (Java) — SRE-first + JVM-safe tooling

Intent

Scope

In scope

Out of scope

When to use

Required inputs (context to attach in Cursor)

Roles and workflow (SRE-style)

Step 1 — Declare incident and assign roles

Step 2 — Stabilize first (stop the bleeding)

Observability-first diagnosis

Step 3 — Define the impact and the symptom precisely

Step 4 — Use the “3 signals” triage order

JVM diagnostics (safe playbook)

Step 5 — Thread dump (fast, low risk)

Step 6 — JFR snippet (bounded capture)

Step 7 — Heap/GC snapshots (only if needed)

Hypothesis-driven loop (fast iterations)

Step 8 — Rank hypotheses and test the cheapest first

Rollback / mitigation decision tree

Step 9 — Decide: mitigate now vs fix forward

After recovery: verification and monitoring

Step 10 — Verify recovery

Postmortem (blameless) + next guardrails

Step 11 — Write a blameless postmortem

Step 12 — “Next guardrails” checklist (make incidents less likely)

Outputs / Artifacts

Definition of Done (DoD)

Common failure modes & fixes

Guardrails (What NOT to do)

References (primary)

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid