Name: Dogfood
Author: nicsuzor

搵技能.../

Dogfood | Skills Pool

Read the relevant framework workflows. Check the /framework skill's workflow router. If the task touches hooks/gates, read 09-session-hook-forensics.md. If it touches skills, read the relevant SKILL.md. If it touches transcripts, read transcript.py and understand the data pipeline.
Verify data sources by sampling. Do not ASSUME what a data file contains. READ one. If you're writing instructions that say "audit files contain X," open an audit file and confirm X is there. If it isn't, find where X actually lives.
Map the full data landscape. For any framework component, there are typically multiple data sources at different levels of detail:
- Pre-rendered markdown transcripts (~/.aops/sessions/transcripts/) — abridged and full forms, ~10K+ files. This is the most accessible source for any session-level analysis.
- Raw session logs (JSONL/JSON in ~/.aops/sessions/client-logs/)
- Hook event logs (JSONL in ~/.aops/sessions/hooks/) — contains SubagentStart/SubagentStop events with verdicts
- Subagent transcripts (referenced by agent_transcript_path in SubagentStop events)
- Session metadata (JSON in ~/.aops/sessions/polecats/)
- Audit files (input documents sent to review agents — INPUT, not output)
The audit file for a component is its INPUT, not its OUTPUT. The output lives in the subagent transcript and the hook event log. The full session transcript (markdown or JSONL) contains the complete conversation including what happened before and after any component fired.

When a command like ls *.md returns nothing on a large directory, try ls <dir>/ | head instead. Glob expansion failures on 10K+ files are silent.
Understand the difference between "not in this file" and "doesn't exist." If a data point isn't in the file you're looking at, ask where else it might be before concluding it's unavailable.
Trace the full data flow for multi-component systems. If evaluating a system with multiple components (e.g., gate + subagent + main agent), map how data moves between them. A conclusion about "does X work?" requires knowing which channel delivers the result and whether the recipient sees it. Don't assume the obvious channel is the only one.

Scope the first iteration small. Start with N=2 representative tasks. Verify the pipeline accepts tasks, workers spawn, and output is observable before scaling. Only increase batch size when the first N=2 run completes cleanly. Finding 2 failures from 2 tasks is as informative as 2 failures from 10 — at a fraction of the cost.

Launch a subagent with ONLY the instruction file as context. Do not brief it verbally — if the instructions need verbal supplementation, they're incomplete.

Agent(
  prompt="You are a contextless reviewer. Your ONLY instructions are in: <path>. Read that file, then follow its instructions exactly. Note any friction.",
  run_in_background=true
)

Do not interfere. Let the agent succeed or fail on the instructions alone. The friction IS the data.

Exception — redirect, don't kill. If new guidance arrives while the agent is running and making progress, use SendMessage to redirect scope rather than TaskStop + restart. Only use TaskStop if the agent is actively causing harm (wrong files, dangerous scope expansion). A scope-narrowing correction (e.g. "do 2 tasks not 10") does not justify a kill-and-restart — send the update and let the agent self-correct. Killing a productive agent discards accumulated work and pays full cold-start cost on the replacement.
Record what happens. When the agent completes, read the full output — not just a summary. Note:
- Did it find the data? If not, what was missing from the instructions?
- Did it understand the task? Where did it misinterpret?
- Where did it get stuck? Was that because the instructions were wrong, ambiguous, or assumed knowledge the agent didn't have?
- Was the output well-structured?
- Was the output useful (not just complete, but actually insightful)?
- Did it discover something you didn't expect? The subagent may find real issues with the system being evaluated, not just issues with the instructions. Both matter.

Read the subagent's output. Evaluate against the original objectives — not "did it follow the steps" but "did it produce something valuable?"

Categorize friction:

Friction type	Fix
Missing path or data source	Add to instructions
Ambiguous assessment criteria	Sharpen the question
Agent couldn't find something	Add discovery commands
Agent misunderstood the goal	Rewrite the objective section
Agent produced shallow analysis	Add examples of good vs bad analysis
Agent went off track	Add guardrails or constraints
Instructions too long/complex	Simplify, split into phases
Data format wasn't what instructions assumed	Update data source description

Update the instructions. Edit them in-place (skill file, or wherever they now live). Be careful not to over-fit to this specific execution — the instructions should work for the category of task, not just this instance.
Optionally re-run. If friction was severe (agent couldn't complete the task), commission a second contextless execution with the updated instructions. If friction was minor (agent completed but output could be better), one iteration may suffice.

Commission the review. Use the critic agent or James (orchestrator):

Agent(
  subagent_type="aops-core:critic",
  prompt="Review the following report against these objectives: <objectives>. The report is at: <path>. Assess depth, accuracy, specificity, and actionability. Be brutal — adequate is not good enough."
)

Evaluate the review. The reviewer's assessment tells you about BOTH the instructions and the execution:
- If the execution was poor but instructions were clear → the task may be too hard for this agent tier
- If the execution was shallow because instructions didn't specify depth → fix the instructions
- If the execution was good but missed important angles → add those angles to instructions
- If the reviewer identifies structural framing errors (e.g., burying the primary finding as one recommendation among many) → add explicit structural requirements to the instructions
- If the reviewer finds the subagent violated framework principles in its own analysis (e.g., keyword-matching recommendations that violate P#49) → add guardrails against those specific anti-patterns

Failure mode	What happened	Prevention
Concluded task was impossible when data existed elsewhere	Supervisor accepted "verdicts not in audit files" as "verdicts don't exist" — when session transcripts and hook JSONL contained everything needed	Phase 0: Read framework workflows, verify data by sampling, map the FULL data landscape before writing instructions
Didn't read the relevant framework workflow	The forensics workflow documents exactly how to find RBG verdicts in hook logs. Supervisor never read it.	Phase 0: Check the /framework skill's workflow router. If the task touches hooks/gates, read the forensics workflow FIRST.
Wrong assumption about data format	Instructions said "audit files contain RBG verdicts" — they contain input to RBG, not output	Verify data sources by reading samples BEFORE writing instructions
Accepted subagent's "impossible" finding without verification	Subagent reported data gap → supervisor iterated on acknowledging the gap rather than questioning the premise	When a subagent reports a task is impossible or data is missing, VERIFY before accepting. P#26 applies to your subagent's claims too.
Silent glob failure hid 10K+ files	`ls ~/.aops/sessions/transcripts/*.md` returned empty because shell glob expansion failed on 10K+ files — supervisor concluded no transcripts existed	Use `ls <dir>/ \| head` instead of `ls <dir>/*.ext` for large directories. When a command returns nothing, question WHY before concluding the data isn't there.
Fragile shell commands	`sed` commands broke when file naming convention changed	Describe what to extract, not specific commands. Let the agent figure out the parsing.
Quality reviewer finds structural framing errors	Report buries the primary finding as one recommendation among many	Add explicit structural requirements (e.g., "executive summary must state data limitations first")
Quality reviewer finds framework principle violations in the subagent's recommendations	Keyword-matching recommendations violated P#49	Add guardrails against specific anti-patterns in the ground rules
Writing draft artifacts before review completes	Skill files written before quality feedback received	Draft early (that's fine), but mark them as drafts and plan to revise after review
Unreproducible quantitative claims	Subagent counted "68 events" for a session that had 15 — used keyword grep that included unrelated event types	Instructions must require documented counting methodology. Quantitative claims need the exact command that produced them, so reviewers can spot-check.
Keyword classification of free-text verdicts	Subagent classified RBG verdicts by keyword presence in analytical reasoning, inflating false positive count	If classification requires judgment (OK vs WARN in free text), document rules and acknowledge margin of error. Don't present rough counts as precise.
Mischaracterized enforcement architecture	Iteration 2 concluded "zero enforcement" because it examined only the gate system message (always "Compliance verified"), missing that the agent receives verdicts directly via the Agent tool result	Verify the actual delivery mechanism before making claims about enforcement channels. Trace the full data flow: who sends what to whom, and through which channel.
Useless early samples wasted deep-review time	Instruction to sample from "earliest week" led to March sessions with empty narratives and free-text verdicts — infrastructure failures, not compliance test cases	Qualify sampling guidance by data quality: note which periods have usable structured data vs. which are infrastructure-failure era
Subagent found real findings but misframed aggregate conclusion	Iteration 2's session-level analysis was correct (RBG accuracy, false positives, etc.) but the aggregate conclusion ("zero enforcement") was wrong because it examined the wrong enforcement channel	When aggregating, verify that the aggregate conclusion follows from the individual findings. A correct finding + wrong causal chain = wrong conclusion.
Full batch dispatched on first iteration	Agent dispatched N=10 tasks on iteration 1; user had to intervene. Two failures from 2 tasks is as informative as 2 from 10 at a fraction of the cost.	First iteration scope: cap at N=2. Verify pipeline health before scaling.
Killed productive agent on scope correction	Scope narrowing arrived mid-run; agent used `TaskStop` + restart instead of `SendMessage`. Discarded real findings, paid full cold-start cost on replacement.	Use `SendMessage` to redirect a running agent. Only `TaskStop` for active harm (wrong files, dangerous expansion).
Created draft spec file instead of editing skill	Agent wrote `specs/drafts/dogfood-instructions.md` as working artifact rather than editing the skill directly. File persisted after session, cluttering the repo.	Work directly in the skill file. Use `specs/drafts/` only for scaffolding; delete it when the session ends.

Dogfood

/dogfood — Delegated Instruction Testing

Purpose

When to Use

Workflow

Phase 0: Know What You're Eating

Dogfood

/dogfood — Delegated Instruction Testing

Purpose

When to Use

Workflow

Phase 0: Know What You're Eating

Phase 1: Research and Draft Instructions

Phase 2: Commission Contextless Execution

Phase 3: Friction Analysis and Iteration

Phase 4: Independent Quality Review

Phase 5: Codify

Key Principles

Scope Note

Common Failure Modes

Test

Feature Flags

Unit Tests

Integration Tests

Write Frontend Tests

Golang Testing

Dogfood

/dogfood — Delegated Instruction Testing

Purpose

When to Use

Workflow

Phase 0: Know What You're Eating

Dogfood

/dogfood — Delegated Instruction Testing

Purpose

When to Use

Workflow

Phase 0: Know What You're Eating

Phase 1: Research and Draft Instructions

Phase 2: Commission Contextless Execution

Phase 3: Friction Analysis and Iteration

Phase 4: Independent Quality Review

Phase 5: Codify

Key Principles

Scope Note

Common Failure Modes

Related

Test

Feature Flags

Unit Tests

Integration Tests

Write Frontend Tests

Golang Testing