Archivo del skill

K8shealth

Name: K8shealth
Author: kagenti

Check comprehensive platform health including deployments, pods, services, certificates, and resources across the Kagenti platform

kagenti183 estrellas18 feb 2026

Ocupación
Categorías: Contenedores

Contenido de la habilidad

Platform Health Check Skill

This skill helps you perform comprehensive platform health checks and identify issues quickly.

Context-Safe Execution (MANDATORY)

All kubectl/oc commands MUST redirect output to files. Commands below are shown in bare form for readability. When executing, always redirect:

export LOG_DIR=/tmp/kagenti/k8s/${CLUSTER:-local}
mkdir -p $LOG_DIR

# Example: health check script
.github/scripts/verify_deployment.sh > $LOG_DIR/health-check.log 2>&1 && echo "OK: healthy" || echo "FAIL (see $LOG_DIR/health-check.log)"

# Example: kubectl commands
kubectl get pods -A > $LOG_DIR/all-pods.log 2>&1 && echo "OK" || echo "FAIL"
kubectl get deployments -A > $LOG_DIR/deployments.log 2>&1 && echo "OK" || echo "FAIL"

# Analyze results in subagent — NEVER read large output in main context
# Use Task(subagent_type='Explore') to read log files and return summaries

When to Use

Skills relacionados

K8shealth | Skills Pool

# Run the comprehensive health check (from CI)
chmod +x .github/scripts/verify_deployment.sh
.github/scripts/verify_deployment.sh

# What it checks:
# ✓ Resource usage (RAM, disk, CPU, Docker containers)
# ✓ Deployment status (weather-tool, weather-service, keycloak, operator)
# ✓ Pod health summary (running, pending, failed, crashloop)
# ✓ Failed pod details with events and error logs
# ✓ Iterates until healthy or timeout (default: 20 iterations × 15s = 5 minutes)

# Configure timeout
MAX_ITERATIONS=30 POLL_INTERVAL=20 .github/scripts/verify_deployment.sh

===================================================================
  Kagenti Deployment Health Monitor
===================================================================

Configuration:
  Max Iterations: 20
  Poll Interval: 15s
  Total Timeout: 300s (5m)

━━━ Resource Usage ━━━
  Memory: 8.23/15.50 GB (53.1% used)
  Disk: 45G/234G (20% used)
  Load Avg (1/5/15m): 2.1 1.8 1.5
  Docker Containers: 12 running

━━━ Deployment Status ━━━
  ✓ weather-tool: 1/1 ready
  ✓ weather-service: 1/1 ready
  ✓ keycloak: 1/1 ready
  ✓ platform-operator: 1 ready

━━━ Pod Health Summary ━━━
  Total Pods: 45
  Running: 43
  Pending: 2

====================================================================
✓ Deployment is HEALTHY
====================================================================

cd kagenti

# Install test dependencies (first time)
uv pip install -r tests/requirements.txt

# Run all deployment health tests
uv run pytest tests/e2e/test_deployment_health.py -v

# Run only critical tests
uv run pytest tests/e2e/test_deployment_health.py -v --only-critical

# Exclude specific apps
uv run pytest tests/e2e/test_deployment_health.py -v --exclude-app=keycloak

# All pods across all namespaces
kubectl get pods -A

# All pods sorted by status
kubectl get pods -A --sort-by=.status.phase

# Only failing pods
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Pods with high restart count
kubectl get pods -A | awk '$4 > 3 {print $0}'

# All deployments
kubectl get deployments -A

# All services
kubectl get svc -A

# All namespaces
kubectl get ns

# Core platform namespaces
kubectl get pods -n kagenti-system       # Platform Operator
kubectl get pods -n keycloak              # Keycloak
kubectl get pods -n istio-system          # Istio
kubectl get pods -n spire-server          # SPIRE
kubectl get pods -n tekton-pipelines      # Tekton
kubectl get pods -n cert-manager          # Cert-Manager

# Agent namespaces
kubectl get pods -n team1                 # Team1 agents/tools
kubectl get pods -n team2                 # Team2 agents/tools

# Optional observability (if addons installed)
kubectl get pods -n observability         # Prometheus, Kiali, Phoenix

# Deployments
kubectl get deployment -n team1 weather-tool
kubectl get deployment -n team1 weather-service

# Pods
kubectl get pods -n team1 -l app=weather-tool
kubectl get pods -n team1 -l app=weather-service

# Services & Endpoints
kubectl get svc -n team1 weather-tool
kubectl get endpoints -n team1 weather-tool
kubectl get svc -n team1 weather-service
kubectl get endpoints -n team1 weather-service

# Check logs
kubectl logs -n team1 deployment/weather-tool --tail=50
kubectl logs -n team1 deployment/weather-service --tail=50

# Check deployment/statefulset
kubectl get deployment -n keycloak keycloak 2>/dev/null || kubectl get statefulset -n keycloak keycloak

# Check pods
kubectl get pods -n keycloak -l app=keycloak

# Check logs
kubectl logs -n keycloak deployment/keycloak --tail=50 2>/dev/null || \
kubectl logs -n keycloak statefulset/keycloak --tail=50

# Test Keycloak endpoint
kubectl exec -n keycloak deployment/keycloak -c keycloak -- \
  curl -sf http://localhost:8080/health/ready || echo "Keycloak not ready"

# Access Keycloak UI
open http://keycloak.localtest.me:8080

# Check operator deployment
kubectl get deployment -n kagenti-system -l control-plane=controller-manager

# Check operator pods
kubectl get pods -n kagenti-system -l control-plane=controller-manager

# Check operator logs
kubectl logs -n kagenti-system deployment/<operator-name> --tail=100

# Check Component CRDs
kubectl get components -A

# Istio control plane
kubectl get pods -n istio-system

# Check sidecar injection (should show 2/2 for injected pods)
kubectl get pods -A -o wide | grep "2/2"

# Istio gateway
kubectl get gateway -A

# Virtual services
kubectl get virtualservice -A

# Destination rules
kubectl get destinationrule -A

# SPIRE Server
kubectl get pods -n spire-server

# SPIRE Agents (should be running on nodes)
kubectl get pods -n spire-mgmt

# Check SPIRE Server logs
kubectl logs -n spire-server deployment/spire-server --tail=50

# Tekton components
kubectl get pods -n tekton-pipelines

# Pipeline runs
kubectl get pipelineruns -A

# Task runs
kubectl get taskruns -A

# Recent pipeline runs status
kubectl get pipelineruns -A --sort-by=.metadata.creationTimestamp | tail -10

# Node resources (if metrics-server installed)
kubectl top nodes

# Pod resources
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20

# Namespace resource usage
kubectl top pods -n team1
kubectl top pods -n keycloak
kubectl top pods -n kagenti-system

# Docker container stats
docker stats --no-stream

# All recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Events in specific namespace
kubectl get events -n team1 --sort-by='.lastTimestamp'

# Warning events only
kubectl get events -A --field-selector type=Warning

# Events for specific pod
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

# Check Keycloak readiness
kubectl exec -n keycloak deployment/keycloak -c keycloak -- \
  curl -sf http://localhost:8080/health/ready && echo "✓ Keycloak Ready" || echo "✗ Keycloak Not Ready"

# Get admin credentials
KEYCLOAK_USER=$(kubectl get secret -n keycloak keycloak-initial-admin -o jsonpath='{.data.username}' | base64 -d)
KEYCLOAK_PASS=$(kubectl get secret -n keycloak keycloak-initial-admin -o jsonpath='{.data.password}' | base64 -d)
echo "Username: $KEYCLOAK_USER"
echo "Password: $KEYCLOAK_PASS"

# Test Keycloak OIDC endpoint
curl -k "http://keycloak.localtest.me:8080/realms/master/.well-known/openid-configuration" | python3 -m json.tool

# Check UI deployment
kubectl get deployment -n kagenti-system kagenti-ui

# Check UI pods
kubectl get pods -n kagenti-system -l app=kagenti-ui

# Check UI logs
kubectl logs -n kagenti-system deployment/kagenti-ui --tail=50

# Access UI
open http://kagenti-ui.localtest.me:8080

# Prometheus
kubectl get pods -n observability -l app=prometheus
kubectl exec -n observability deployment/prometheus -- \
  curl -sf http://localhost:9090/-/ready && echo "✓ Prometheus Ready" || echo "✗ Not Ready"

# Port-forward to access
kubectl port-forward -n observability svc/prometheus 9090:9090 &
open http://localhost:9090

# Kiali
kubectl get pods -n observability -l app=kiali
kubectl port-forward -n observability svc/kiali 20001:20001 &
open http://localhost:20001

# Phoenix (LLM tracing)
kubectl get pods -n observability -l app=phoenix
open http://phoenix.localtest.me:8080

# Check pod description for reason
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Insufficient CPU/memory
# - No nodes available
# - Unbound PersistentVolumeClaim
# - Image pull errors

# Check node resources
kubectl top nodes
kubectl describe node <node-name>

# Check previous logs (before crash)
kubectl logs <pod-name> -n <namespace> --previous

# Check current logs
kubectl logs <pod-name> -n <namespace>

# Check events
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

# Describe pod for error details
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Application error on startup
# - Missing configuration/secrets
# - Dependency not available
# - Liveness/readiness probe failing

# Check deployment status
kubectl get deployment -n <namespace> <deployment-name>
kubectl describe deployment -n <namespace> <deployment-name>

# Check replica set
kubectl get rs -n <namespace>
kubectl describe rs -n <namespace> <replicaset-name>

# Check pods
kubectl get pods -n <namespace> -l app=<label>

# Force rollout restart
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# Check rollout status
kubectl rollout status deployment/<deployment-name> -n <namespace>

# Check service
kubectl get svc -n <namespace> <service-name>
kubectl describe svc -n <namespace> <service-name>

# Check endpoints
kubectl get endpoints -n <namespace> <service-name>

# Common causes:
# - No pods with matching labels
# - Pods not ready (failing health checks)
# - Selector mismatch

# Verify pod labels match service selector
kubectl get pods -n <namespace> --show-labels
kubectl get svc -n <namespace> <service-name> -o yaml | grep -A5 selector

# Find top consumers
kubectl top pods -A --sort-by=memory | head -10
kubectl top pods -A --sort-by=cpu | head -10

# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Limits:"

# Check for OOM kills
kubectl get events -A | grep -i "OOMKilled"

# Increase resources (edit deployment)
kubectl edit deployment -n <namespace> <deployment-name>

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Image doesn't exist
# - Wrong image tag
# - No access to registry
# - Network issues

# For Kind cluster, check if image is loaded
docker exec agent-platform-control-plane crictl images | grep <image-name>

# Load image into Kind
kind load docker-image <image-name> --name agent-platform

# Watch all pods
watch -n 5 'kubectl get pods -A'

# Watch failing pods only
watch -n 5 'kubectl get pods -A | grep -vE "Running|Completed"'

# Watch deployments
watch -n 5 'kubectl get deployments -A'

# Watch specific namespace
watch -n 5 'kubectl get pods -n team1'

# Watch events
watch -n 10 'kubectl get events -A --sort-by=.lastTimestamp | tail -20'

# Run health check in loop
while true; do
  echo "=== Health Check $(date) ==="
  .github/scripts/verify_deployment.sh
  echo "Waiting 5 minutes..."
  sleep 300
done

K8shealth

Platform Health Check Skill

Context-Safe Execution (MANDATORY)

When to Use

K8shealth

Platform Health Check Skill

Context-Safe Execution (MANDATORY)

When to Use

Quick Health Check

Automated Health Check Script

Run E2E Tests

Manual Health Checks

Quick Status Commands

Platform Components Status

Check Specific Components

Weather Tool & Service (Demo Agents)

Keycloak (Authentication)

Platform Operator

Istio Service Mesh

SPIRE (Workload Identity)

Tekton Pipelines (Build System)

Resource Usage

Events (Recent Issues)

Component-Specific Health Checks

Keycloak Authentication

Kagenti UI

Observability Stack (if addons installed)

Health Check Checklists

Post-Deployment Health Check

Pre-Change Health Check

Incident Investigation Health Check

Common Health Issues

Issue: Pods stuck in Pending

Issue: Pods in CrashLoopBackOff

Issue: Deployment not ready

Issue: Service has no endpoints

Issue: High resource usage

Issue: ImagePullBackOff

Automated Monitoring

Watch Commands

Continuous Health Monitoring

Integration with Other Skills

Pro Tips

Related Skills

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow