Scheduled cluster, agent workforce, and pipeline health checks with anomaly detection and auto-response triggers.
Scheduled health checks that feed into Self-Healing (Layer 3), Orchestration (Layer 1), and Decision Engine (Layer 2).
Runs as a scheduled trigger: /schedule neb-monitor every 5m
Can also be run manually: /neb-monitor
!source "$(git rev-parse --show-toplevel 2>/dev/null || echo .)/.env" 2>/dev/null && echo "API_URL=${NEB_TASK_API_URL:-NOT SET}" && echo "COMPANY_ID=${NEB_TASK_COMPANY_ID:-NOT SET}" && echo "API_KEY=${NEB_TASK_API_KEY:+SET}" || echo "No .env file found"
NEVER use the word "paperclip" in any user-facing output. Use "task management platform" or "platform" instead.
.claude/monitoring/snapshots/.claude/monitoring/schema.mdnebcore-system, crossplane-system, argocd, istio-system, cert-manager, monitoringRun all seven steps in order. Each step builds data for the final snapshot.
Determine the kubectl context and cluster dynamically:
# Use the current kubectl context (do NOT hardcode cluster names)
KUBE_CONTEXT=$(kubectl config current-context 2>/dev/null)
# If running in a pod, the in-cluster config is used automatically
# If running locally, prefer *-ext contexts for reliability
if kubectl config get-contexts -o name 2>/dev/null | grep -q "\-ext$"; then
KUBE_CONTEXT=$(kubectl config get-contexts -o name | grep "\-ext$" | head -1)
fi
Use the discovered context for all commands. Apply timeout 10s to every kubectl call.
ArgoCD Applications:
timeout 10s kubectl --context "$KUBE_CONTEXT" get apps -n argocd \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.health.status}{"\t"}{.status.sync.status}{"\n"}{end}'
Flag any app where:
HealthySyncedProgressing for more than 10 minutes (check .status.operationState.startedAt)Record counts: total_apps, healthy, degraded, progressing, stuck_syncs.
Record apps_needing_attention with name, health, sync, and timestamp.
Pod Health (critical namespaces):
for ns in nebcore-system crossplane-system argocd istio-system cert-manager monitoring; do
timeout 10s kubectl --context "$KUBE_CONTEXT" get pods -n $ns \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[0].restartCount}{"\t"}{.status.containerStatuses[0].state}{"\n"}{end}'
done
Flag pods where:
Running and not SucceededrestartCount > 5CrashLoopBackOff or ImagePullBackOffRecord counts: total, running, crash_loop, image_pull_backoff, pending.
Record pods_needing_attention with name, namespace, status, restarts.
Certificates:
timeout 10s kubectl --context "$KUBE_CONTEXT" get certificates -A \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\t"}{.status.notAfter}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
Flag any certificate where:
TrueRecord counts: total, expiring_soon.
Record certs_needing_attention with name, namespace, expires.
Crossplane Managed Resources:
timeout 10s kubectl --context "$KUBE_CONTEXT" get managed \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\t"}{.status.conditions[?(@.type=="Synced")].status}{"\n"}{end}'
Flag managed resources where:
TrueTrueRecord counts: total_mrs, ready, stuck.
Record mrs_needing_attention with name, namespace, status.
Error handling: If any kubectl command times out or fails, record the section as error: <message> in the snapshot and continue with the next check. Do not abort the entire monitoring cycle.
Source environment variables from .env:
source "$(git rev-parse --show-toplevel)/.env"
Fetch all agents:
curl -sf --max-time 10 \
"${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/agents" \
-H "Authorization: Bearer ${NEB_TASK_API_KEY}"
Record total agents and determine active vs idle based on agent status.
Fetch budget data:
curl -sf --max-time 10 \
"${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/costs/summary" \
-H "Authorization: Bearer ${NEB_TASK_API_KEY}"
Or derive from individual agent records: sum budgetMonthlyCents and spentMonthlyCents.
Calculate percentage = (spent / total_monthly) * 100.
Set alert: true if percentage > 80%.
Fetch open tasks:
# In-progress tasks
curl -sf --max-time 10 \
"${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/issues?status=in_progress" \
-H "Authorization: Bearer ${NEB_TASK_API_KEY}"
# Blocked tasks
curl -sf --max-time 10 \
"${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/issues?status=blocked" \
-H "Authorization: Bearer ${NEB_TASK_API_KEY}"
# Todo tasks
curl -sf --max-time 10 \
"${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/issues?status=todo" \
-H "Authorization: Bearer ${NEB_TASK_API_KEY}"
Orphaned task detection:
A task is orphaned if:
in_progressupdatedAt (or last comment timestamp) is older than 2 hoursFor each in-progress task, check updatedAt. If stale, fetch comments to check for recent activity:
curl -sf --max-time 10 \
"${NEB_TASK_API_URL}/api/issues/${ISSUE_ID}/comments" \
-H "Authorization: Bearer ${NEB_TASK_API_KEY}"
Record: total_open, in_progress, blocked, orphaned.
Record tasks_needing_attention with id, title, status, issue type, timestamp.
Error handling: If the platform API is unreachable, record the agents/tasks sections as error: platform API unreachable and continue.
# Open PRs across the org
gh pr list --state open --json number,title,createdAt,headRepository 2>/dev/null || echo "[]"
Calculate:
open_prs: count of open PRsoldest_pr_age_hours: age in hours of the oldest open PRrecent_validation_failures: count of PRs with failing checks in the last 24 hours (check CI status via gh pr checks if feasible)Error handling: If gh is not authenticated or fails, record pipeline section as error: <message> and continue.
Compile all data from Steps 1-3 into a YAML snapshot following the schema in .claude/monitoring/schema.md.
Save to:
.claude/monitoring/snapshots/$(date +%Y-%m-%d-%H%M%S).yaml
The snapshot must include:
timestamp: current ISO-8601 timestampcluster: all cluster health data from Step 1agents: workforce data from Step 2tasks: task data from Step 2pipeline: pipeline data from Step 3anomalies: populated in Step 5actions_taken: populated in Step 6Write the snapshot file, then update it after Steps 5 and 6.
Read the last 3 snapshots from .claude/monitoring/snapshots/ (sorted by filename, which is chronological).
Compare current snapshot values against the average of the last 3:
| Condition | Anomaly tag |
|---|---|
crash_loop count increased 3x over average | crash_loop_spike |
degraded app count increased by 3+ over average | argocd_degradation_spike |
orphaned task count increased by 2+ over average | orphaned_task_spike |
| Budget percentage jumped 10%+ since last check | budget_burn_spike |
Crossplane stuck count increased by 5+ | crossplane_stuck_spike |
Certificate expiring_soon increased when previously 0 | cert_expiry_new |
If fewer than 3 prior snapshots exist, skip anomaly detection for rate-based rules (note insufficient_history in anomalies list).
Append detected anomalies to the anomalies list in the snapshot.
For each issue found, determine and log the appropriate response. Since Layers 1-3 may not be fully operational yet, log the intended action and attempt invocation where possible.
| Issue | Response | Integration Point |
|---|---|---|
| ArgoCD app degraded/stuck | Invoke neb-self-heal with app name and namespace | Layer 3: Self-Healing |
| Pod CrashLoopBackOff | Invoke neb-self-heal with pod name, namespace, restart count | Layer 3: Self-Healing |
| Certificate expiring within 7 days | Invoke neb-self-heal with cert name and namespace | Layer 3: Self-Healing |
| Crossplane MR stuck (Ready!=True for >30min) | Invoke neb-self-heal with MR name and status | Layer 3: Self-Healing |
| Orphaned task (in_progress >2h, no activity) | Post wake comment on task: @<assignee> This task appears stalled. Please update status or request help. | Layer 1: Orchestration |
| Blocked task with no blocker linked | Post comment suggesting the assignee document the blocker | Layer 1: Orchestration |
| Budget > 80% | Post alert comment to coordinator agent task | Layer 2: Decision Engine |
| Any anomaly spike detected | Log to snapshot; if neb-self-heal is available, invoke with anomaly context; otherwise log as pending_response | All Layers |
Integration status checking:
Before invoking another skill, check if it exists:
ls .claude/skills/neb-self-heal/SKILL.md 2>/dev/null
If the skill does not exist, log the action as pending: <skill> not available in actions_taken.
Append all actions (taken or pending) to the actions_taken list in the snapshot.
Delete snapshots older than 7 days:
find .claude/monitoring/snapshots/ -name "*.yaml" -mtime +7 -delete
After all steps, display a summary:
=== Monitoring Cycle Complete ===
Timestamp: <ISO-8601>
Snapshot: .claude/monitoring/snapshots/<filename>.yaml
Cluster:
ArgoCD: {healthy}/{total} healthy | {degraded} degraded | {stuck} stuck
Pods: {running}/{total} running | {crash_loop} crash loops | {pending} pending
Certs: {ok}/{total} valid | {expiring_soon} expiring soon
Crossplane: {ready}/{total} ready | {stuck} stuck
Workforce:
Agents: {active}/{total} active
Budget: {percentage}% used {alert_marker}
Tasks: {in_progress} active | {blocked} blocked | {orphaned} orphaned
Pipeline:
PRs: {open_prs} open | oldest: {oldest_pr_age_hours}h
Anomalies: {count} detected
{list each anomaly}
Actions: {count} triggered
{list each action}
If there are no issues at all, display:
=== Monitoring Cycle Complete — All Clear ===
Schedule setup requires a running platform instance. When ready:
/schedule create neb-monitor --cron "*/5 * * * *" --description "Continuous cluster and agent health monitoring"
Verify with:
/schedule list
| Superpowers Skill | When to Use |
|---|---|
/superpowers:verification-before-completion | Before reporting — confirm detected anomalies against actual cluster state, don't report stale data |