Name: Cluster Blame
Author: olliecrow

Cluster Blame

Audit active or recent Slurm queue state to find likely job-shape misconfigurations that strand shared cluster capacity (CPU, memory, GPU) and block scheduling for others. Use when users ask why resources appear idle, who may be blocking allocation, which jobs/users look misconfigured, or when preparing evidence for neutral outreach. Keep the workflow strictly read-only: inspect and report only, never cancel, edit, reprioritize, or otherwise mutate jobs or cluster state.

olliecrow0 星標2026年4月3日

職業
分類: 系統管理

Multi-agent collaboration

Encourage use of multiple agents/subagents when it is likely to improve speed, quality, or confidence.
Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
Use single-agent execution when scope is small or coordination overhead outweighs gains.

Overview

Inspect Slurm scheduler state and node packing to identify probable resource-stranding job submissions while avoiding false accusations.

Treat output as evidence-backed candidate attribution, not certainty: label findings by impact and confidence, separate policy effects from user-level misfit, and produce neutral follow-up language.

Cross-skill principles integrated

From cluster-monitor: default to quick-status first for operational questions, return concrete units and timestamps, make scope/identity explicit, and treat legitimate queue waiting (Priority, Resources, dependencies) as normal scheduler behavior until fit/fragmentation evidence shows avoidable stranding.

Multi-agent collaboration

Encourage use of multiple agents/subagents when it is likely to improve speed, quality, or confidence.
Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
Use single-agent execution when scope is small or coordination overhead outweighs gains.

Overview

Inspect Slurm scheduler state and node packing to identify probable resource-stranding job submissions while avoiding false accusations.

Treat output as evidence-backed candidate attribution, not certainty: label findings by impact and confidence, separate policy effects from user-level misfit, and produce neutral follow-up language.

Cross-skill principles integrated

From cluster-monitor: default to quick-status first for operational questions, return concrete units and timestamps, make scope/identity explicit, and treat legitimate queue waiting (Priority, Resources, dependencies) as normal scheduler behavior until fit/fragmentation evidence shows avoidable stranding.

Cluster Blame

Multi-agent collaboration

Overview

Cross-skill principles integrated

Cluster Blame

Multi-agent collaboration

Overview

Cross-skill principles integrated

Proactive autonomy and knowledge compounding

Long-task checkpoint cadence

Terminal state contract (must follow)

Terminal state examples (adapt to skill)

Quick-scan mode (must support)

Trigger phrases

Behavioral guardrails (must follow)

Scope and identity (must establish first)

Workflow

0) Mode selection

0.5) Quick-scan workflow (quick-scan mode only)

1) Preflight and evidence snapshot

2) Confirm cluster health before blaming users

3) Build fit and fragmentation view

4) Detect candidate misconfiguration patterns

5) Separate policy effects from submission effects

6) Rank top blockers by impact and confidence

7) Prepare neutral follow-up actions (no mutations)

Report format (required)

Prompt templates

Decision framing

Rationale capture

Repeat invocations

Mcporter

Sonoscli

Openhue

Healthcheck

Things Mac

Eightctl