Name: Neb Monitor
Author: nebinfra

Buscar habilidades.../

Neb Monitor | Skills Pool

# Use the current kubectl context (do NOT hardcode cluster names)
KUBE_CONTEXT=$(kubectl config current-context 2>/dev/null)
# If running in a pod, the in-cluster config is used automatically
# If running locally, prefer *-ext contexts for reliability
if kubectl config get-contexts -o name 2>/dev/null | grep -q "\-ext$"; then
  KUBE_CONTEXT=$(kubectl config get-contexts -o name | grep "\-ext$" | head -1)
fi

timeout 10s kubectl --context "$KUBE_CONTEXT" get apps -n argocd \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.health.status}{"\t"}{.status.sync.status}{"\n"}{end}'

for ns in nebcore-system crossplane-system argocd istio-system cert-manager monitoring; do
  timeout 10s kubectl --context "$KUBE_CONTEXT" get pods -n $ns \
    -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[0].restartCount}{"\t"}{.status.containerStatuses[0].state}{"\n"}{end}'
done

timeout 10s kubectl --context "$KUBE_CONTEXT" get certificates -A \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\t"}{.status.notAfter}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'

timeout 10s kubectl --context "$KUBE_CONTEXT" get managed \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\t"}{.status.conditions[?(@.type=="Synced")].status}{"\n"}{end}'

source "$(git rev-parse --show-toplevel)/.env"

curl -sf --max-time 10 \
  "${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/agents" \
  -H "Authorization: Bearer ${NEB_TASK_API_KEY}"

curl -sf --max-time 10 \
  "${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/costs/summary" \
  -H "Authorization: Bearer ${NEB_TASK_API_KEY}"

# In-progress tasks
curl -sf --max-time 10 \
  "${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/issues?status=in_progress" \
  -H "Authorization: Bearer ${NEB_TASK_API_KEY}"

# Blocked tasks
curl -sf --max-time 10 \
  "${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/issues?status=blocked" \
  -H "Authorization: Bearer ${NEB_TASK_API_KEY}"

# Todo tasks
curl -sf --max-time 10 \
  "${NEB_TASK_API_URL}/api/companies/${NEB_TASK_COMPANY_ID}/issues?status=todo" \
  -H "Authorization: Bearer ${NEB_TASK_API_KEY}"

curl -sf --max-time 10 \
  "${NEB_TASK_API_URL}/api/issues/${ISSUE_ID}/comments" \
  -H "Authorization: Bearer ${NEB_TASK_API_KEY}"

# Open PRs across the org
gh pr list --state open --json number,title,createdAt,headRepository 2>/dev/null || echo "[]"

.claude/monitoring/snapshots/$(date +%Y-%m-%d-%H%M%S).yaml

Condition	Anomaly tag
`crash_loop` count increased 3x over average	`crash_loop_spike`
`degraded` app count increased by 3+ over average	`argocd_degradation_spike`
`orphaned` task count increased by 2+ over average	`orphaned_task_spike`
Budget percentage jumped 10%+ since last check	`budget_burn_spike`
Crossplane `stuck` count increased by 5+	`crossplane_stuck_spike`
Certificate `expiring_soon` increased when previously 0	`cert_expiry_new`

Issue	Response	Integration Point
ArgoCD app degraded/stuck	Invoke `neb-self-heal` with app name and namespace	Layer 3: Self-Healing
Pod CrashLoopBackOff	Invoke `neb-self-heal` with pod name, namespace, restart count	Layer 3: Self-Healing
Certificate expiring within 7 days	Invoke `neb-self-heal` with cert name and namespace	Layer 3: Self-Healing
Crossplane MR stuck (Ready!=True for >30min)	Invoke `neb-self-heal` with MR name and status	Layer 3: Self-Healing
Orphaned task (in_progress >2h, no activity)	Post wake comment on task: `@<assignee> This task appears stalled. Please update status or request help.`	Layer 1: Orchestration
Blocked task with no blocker linked	Post comment suggesting the assignee document the blocker	Layer 1: Orchestration
Budget > 80%	Post alert comment to coordinator agent task	Layer 2: Decision Engine
Any anomaly spike detected	Log to snapshot; if `neb-self-heal` is available, invoke with anomaly context; otherwise log as `pending_response`	All Layers

ls .claude/skills/neb-self-heal/SKILL.md 2>/dev/null

find .claude/monitoring/snapshots/ -name "*.yaml" -mtime +7 -delete

=== Monitoring Cycle Complete ===

Timestamp: <ISO-8601>
Snapshot:  .claude/monitoring/snapshots/<filename>.yaml

Cluster:
  ArgoCD:      {healthy}/{total} healthy | {degraded} degraded | {stuck} stuck
  Pods:        {running}/{total} running | {crash_loop} crash loops | {pending} pending
  Certs:       {ok}/{total} valid | {expiring_soon} expiring soon
  Crossplane:  {ready}/{total} ready | {stuck} stuck

Workforce:
  Agents:  {active}/{total} active
  Budget:  {percentage}% used {alert_marker}
  Tasks:   {in_progress} active | {blocked} blocked | {orphaned} orphaned

Pipeline:
  PRs: {open_prs} open | oldest: {oldest_pr_age_hours}h

Anomalies: {count} detected
  {list each anomaly}

Actions: {count} triggered
  {list each action}

=== Monitoring Cycle Complete — All Clear ===

/schedule create neb-monitor --cron "*/5 * * * *" --description "Continuous cluster and agent health monitoring"

/schedule list

Neb Monitor

Continuous Monitor

Invocation

Current Context

Terminology

Constants

Neb Monitor

Continuous Monitor

Invocation

Current Context

Terminology

Constants

Process

Step 0: Discover Context

Step 1: Cluster Health Checks

Step 2: Agent Workforce Checks

Step 3: Pipeline Checks

Step 4: Build Snapshot

Step 5: Detect Anomalies

Step 6: Trigger Responses

Step 7: Rotate Snapshots

Output

Error Handling

Scheduling (Task 4 — Pending)

Superpowers Integration

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope