Name: Investigate Incident
Author: redhat-et

스킬 검색.../

Investigate Incident | Skills Pool

# List all available runbooks
ls -1 docs/runbooks/alerts/*.md

# For specific alert, check annotation
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
  -u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down'  # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
    print('Runbook URL:', rule['annotations'].get('runbook_url', 'N/A'))
"

# From docs/runbooks/alerts/prometheus-down.md

# 1. Check pod status
kubectl get pods -n observability -l app=prometheus

# 2. Check pod logs
kubectl logs -n observability deployment/prometheus --tail=100

# 3. Check events
kubectl get events -n observability --field-selector involvedObject.name=prometheus --sort-by='.lastTimestamp'

# 4. Test Prometheus endpoint
kubectl exec -n observability deployment/grafana -- \
  curl -s http://prometheus.observability.svc:9090/-/ready

# Get pod status in namespace
kubectl get pods -n <namespace>

# Detailed pod description
kubectl describe pod <pod-name> -n <namespace>

# Recent events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# All failing pods across platform
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull|Pending"

# Current container logs
kubectl logs -n <namespace> <pod-name> --tail=100

# Previous container (if crashed)
kubectl logs -n <namespace> <pod-name> --previous

# Specific container in pod
kubectl logs -n <namespace> <pod-name> -c <container-name>

# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true

# Query Loki for errors (last 5 minutes)
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={kubernetes_namespace_name="<namespace>"} |= "error"' \
  --data-urlencode 'limit=100' \
  --data-urlencode "start=$(date -u -v-5M +%s)000000000" \
  --data-urlencode "end=$(date -u +%s)000000000" | python3 -m json.tool

# Check if service is up
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=up{job="<job-name>"}' | python3 -m json.tool

# Check replica availability
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=kube_deployment_status_replicas_available{deployment="<name>"}' \
  | python3 -m json.tool

# Check pod restarts
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=kube_pod_container_status_restarts_total{pod=~"<pod-pattern>"}' \
  | python3 -m json.tool

# Check application health
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web

# Check sync status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -E "Degraded|OutOfSync"

# View recent sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web

## Incident #X: [Alert Name] - [Brief Description]

**Status**: 🔴 Active / 🟡 Investigating / 🟢 Resolved

**Detected**: 2025-11-17 08:31:40 UTC

**Severity**: Critical / Warning / Info

**Components Affected**:
- Component 1
- Component 2

### Summary

Brief description of what happened.

### Investigation

**Timeline**:
- 08:31 - Alert fired
- 08:35 - Checked pod status, found CrashLoopBackOff
- 08:40 - Reviewed logs, identified error message
- 08:45 - Identified root cause

**Evidence Collected**:

1. **Pod Status**:


2. **Error Logs**:


3. **Events**:


### Root Cause Analysis

**Root Cause**: [Specific technical reason]

**Why it happened**:
- Contributing factor 1
- Contributing factor 2

**Why alert fired**:
- PromQL query: `query_here`
- Query returned: `value`
- Threshold: `> threshold`

### Resolution

**Fix Applied**:
```bash
# Commands to fix the issue
kubectl apply -f ...

# Commands to verify fix
kubectl get pods -n <namespace>
# Output showing healthy state


### 6. Validate Fix

**After applying fix**:

```bash
# 1. Verify pods are healthy
kubectl get pods -n <namespace>

# 2. Check alert stopped firing
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
  -u admin:admin123 | python3 -c "
import sys, json
alerts = json.load(sys.stdin)
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
for a in firing:
    print(f\"{a['labels']['alertname']}: {a['labels']['severity']}\")
"

# 3. Run integration tests
pytest tests/integration/test_<component>.py -v

# 4. Check platform status
./scripts/platform-status.sh

# 5. Verify in Grafana UI
open https://grafana.localtest.me:9443/alerting/list

# Investigation flow
kubectl get pods -n <namespace>  # Get pod status
kubectl describe pod <pod-name> -n <namespace>  # Check events
kubectl logs <pod-name> -n <namespace>  # Check logs (if available)

# Common causes:
# - ImagePullBackOff: Image not found or not loaded
# - CrashLoopBackOff: Application exits on startup
# - Pending: Resource constraints or scheduling issues
# - Init:Error: Init container failed

# Investigation flow
kubectl get pods -n <namespace> -l app=<service>  # Check pods
kubectl get svc -n <namespace> <service-name>  # Check service
kubectl get endpoints -n <namespace> <service-name>  # Check endpoints
kubectl get httproute -n <namespace>  # Check routes (if using Gateway API)

# Test connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
  -- curl -v http://<service-name>.<namespace>.svc:PORT

# Investigation flow
kubectl top pods -n <namespace> --sort-by=memory  # Memory usage
kubectl top pods -n <namespace> --sort-by=cpu  # CPU usage

# Check limits
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'

# Check for OOM kills
kubectl get events -n <namespace> | grep OOM

# Investigation flow
kubectl get pods -n <namespace>  # Check RESTARTS column
kubectl describe pod <pod-name> -n <namespace>  # Check restart reason
kubectl logs <pod-name> -n <namespace> --previous  # Logs before crash

# Check restart metrics
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=increase(kube_pod_container_status_restarts_total{pod="<pod-name>"}[1h])'

# Investigation flow
argocd app get <app-name> --port-forward ...  # Get status
argocd app diff <app-name> --port-forward ...  # See differences

# Check last sync
argocd app history <app-name> --port-forward ...

# Force sync if needed
argocd app sync <app-name> --force --prune --port-forward ...

# This automatically invokes check-alerts skill
"What alerts are currently firing?"

# This automatically invokes check-logs skill
"Show me error logs from keycloak namespace in the last 10 minutes"

# This automatically invokes check-metrics skill
"What's the CPU usage of pods in observability namespace?"

Investigate Incident

Investigate Incident Skill

When to Use

What This Skill Does

Investigation Workflow

Investigate Incident

Investigate Incident Skill

When to Use

What This Skill Does

Investigation Workflow

1. Check If Alert Has Runbook

2. Follow Runbook Steps

3. Gather Comprehensive Evidence

4. Identify Root Cause

5. Create Incident Documentation

Lessons Learned

Common Investigation Patterns

Pattern 1: Pod Won't Start

Pattern 2: Service Unavailable

Pattern 3: High Resource Usage

Pattern 4: Frequent Restarts

Pattern 5: ArgoCD App OutOfSync

Integration with Other Skills

Incident Priority Guidelines

Pro Tips

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid

Investigate Incident

Investigate Incident Skill

When to Use

What This Skill Does

Investigation Workflow

Investigate Incident

Investigate Incident Skill

When to Use

What This Skill Does

Investigation Workflow

1. Check If Alert Has Runbook

2. Follow Runbook Steps

3. Gather Comprehensive Evidence

4. Identify Root Cause

5. Create Incident Documentation

Lessons Learned

Related

Common Investigation Patterns

Pattern 1: Pod Won't Start

Pattern 2: Service Unavailable

Pattern 3: High Resource Usage

Pattern 4: Frequent Restarts

Pattern 5: ArgoCD App OutOfSync

Integration with Other Skills

Incident Priority Guidelines

Related Documentation

Pro Tips

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid