Java Debugging Prod Incidents | Skills Pool
Java Debugging Prod Incidents Production incident debugging playbook for Java services: triage with logs/metrics/traces, safe JVM diagnostics (jcmd/JFR/thread dumps), rollback decision tree, communication, and blameless postmortems. Use during outages or flaky production behavior.
Intent
During incidents, speed matters—but unstructured debugging causes more damage.
This skill provides:
An SRE-style incident workflow (roles, comms, timeline)
A logs/metrics/traces-first diagnosis approach
Safe JVM diagnostics: thread dumps, JFR snippets, jcmd snapshots
A rollback / mitigate decision tree
A blameless postmortem template and “next guardrails” checklist
Scope
In scope
Incident triage and mitigation loop
Observability-first debugging
JVM diagnostics:
thread dumps
JFR capture
GC/heap snapshots
Hypothesis-driven investigation
Rollback / feature-flag mitigation strategy
Postmortem and action items (prevention)
快速安裝
Java Debugging Prod Incidents npx skillvault add HZeroxium/hzeroxium-cursorkit-lib-skills-java-backend-java-debugging-prod-incidents-skill-md
星標 0
更新時間 2026年1月21日
職業
Out of scope
Full infra incident response for Kubernetes/network (separate ops skill)
Pen-test level forensic analysis (separate security response playbook)
When to use
production outage
SLO burn / severe latency spike
error rate spike
memory leak suspected
queue lag runaway
deadlocks / thread pool starvation
“works in staging, fails in prod” mystery
Required inputs (context to attach in Cursor)
Links or snapshots (not raw secrets):
dashboard panels (latency, CPU, GC, error rate)
recent deploys/config changes
key logs around start time
traces for representative failing requests
Service metadata:
version / commit
runtime (container/VM), JDK version
traffic shape changes (if any)
Roles and workflow (SRE-style)
Step 1 — Declare incident and assign roles
Incident Commander (IC): owns decisions and comms
Tech Lead: drives technical investigation
Comms: updates stakeholders/users
Scribe: writes timeline and captures actions
Deliverable: incident channel + timeline doc started.
Step 2 — Stabilize first (stop the bleeding)
rollback to last known good
disable feature via flag
reduce load (rate limit, shed traffic)
scale out if safe and helps (not always)
isolate failing dependency (circuit breaker)
Rule: prefer reversible mitigations over risky live fixes.
Deliverable: mitigation chosen + tracked in timeline.
Observability-first diagnosis
Step 3 — Define the impact and the symptom precisely
Who is impacted? which endpoints/tenants/regions?
What changed? (deploy/config/traffic/dependency)
Which SLO is burning? (latency, availability)
Deliverable: a one-paragraph symptom statement.
Step 4 — Use the “3 signals” triage order
Metrics (what is broken and when)
Logs (why requests fail, errors, timeouts)
Traces (where time is spent across dependencies)
latency increases + CPU flat: likely I/O waits, downstream slowness, locks
CPU spikes: hot loop, serialization, logging overhead, contention
GC spikes: allocation storms, memory leak, heap too small
errors spike: upstream/downstream change, auth expiry, config drift
Deliverable: top 2 hypotheses with supporting signals.
JVM diagnostics (safe playbook)
observability is insufficient, OR
you need thread/heap evidence, OR
the service is “alive but stuck”.
Step 5 — Thread dump (fast, low risk) Use a safe method (depends on permissions):
jcmd <pid> Thread.print is often preferred over legacy tools.
deadlocks
thread pool starvation
many threads blocked on the same lock
many threads waiting for DB connections
runaway retries/backoff loops
Deliverable: thread dump snippet + interpretation.
Step 6 — JFR snippet (bounded capture) Capture 30–120s around peak symptoms:
CPU + allocation + locks + thread states
This often answers “what is actually happening” quickly.
Deliverable: JFR file + short summary.
Step 7 — Heap/GC snapshots (only if needed)
capture GC log window
capture class histogram snapshot via jcmd
only capture heap dump if you have storage/privacy plan
Deliverable: evidence bundle for memory hypothesis.
Hypothesis-driven loop (fast iterations)
Step 8 — Rank hypotheses and test the cheapest first
Expected observation if true
Cheap test (canary, toggle, single node restart, config revert)
Risk assessment
making 10 changes at once
“SSH and tweak random flags”
“fixing” without evidence
Deliverable: hypothesis table (in timeline).
Rollback / mitigation decision tree
Step 9 — Decide: mitigate now vs fix forward Prefer rollback/flag-off if:
change is recent and correlated
fix is uncertain
impact is high
rollback is impossible or too risky
you have a high-confidence minimal patch
you can canary safely
Deliverable: decision + rationale + next checkpoint time.
After recovery: verification and monitoring
Step 10 — Verify recovery
confirm error rate normal
confirm latency p95/p99 stable
confirm downstream health
confirm no hidden queue lag or retry storms
Deliverable: “recovery confirmation” entry in timeline.
Postmortem (blameless) + next guardrails
Step 11 — Write a blameless postmortem Use a standard structure:
Summary + customer impact
Timeline (UTC + local time if needed)
Root cause and contributing factors
Detection and response analysis
What went well / what went poorly
Action items with owners and deadlines
Step 12 — “Next guardrails” checklist (make incidents less likely)
add missing timeouts and retries limits
add bulkheads / rate limits
add better alerts (SLO-based)
add regression tests
add runbooks for the failure mode
add feature flags for risky paths
enforce safer deploy practices (canary, bake time)
Deliverable: postmortem doc + action item tracker.
Outputs / Artifacts
Incident timeline doc (scribe notes)
Mitigation decision log
Evidence bundle (dashboards/logs/traces + optional JVM artifacts)
Postmortem document (blameless)
“Next guardrails” action list
Definition of Done (DoD)
Common failure modes & fixes
Symptom: incident drags on with random changes
Cause: no hypotheses, no IC role, no timeline
Fix: establish IC + scribe, hypothesis loop, safe mitigations
Symptom: recovery but reoccurs
Cause: no guardrails added; missing timeouts/backpressure
Fix: convert root cause into specific engineering controls
Symptom: debugging actions cause more outage
Cause: high-risk production changes and poor rollback
Fix: prefer reversible mitigations and canary; keep changes minimal
Guardrails (What NOT to do)
Do NOT paste secrets/tokens in incident channels.
Do NOT take heap dumps without privacy review and storage plan.
Do NOT run destructive commands on production hosts without explicit approval.
Do NOT “restart everything” without understanding cascading effects.
References (primary) 02
Intent