Name: Skill Testing Output
Author: patanet7

スキルを検索.../

Skill Testing Output | Skills Pool

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}

Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any>
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/

{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>

nohup python ../writing-skills/eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  > /dev/null 2>&1 &

{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "chart missing axis labels", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
  ]
}

skill-name-workspace/
└── iteration-1/
    ├── eval-descriptive-name/
    │   ├── with_skill/
    │   │   ├── outputs/
    │   │   └── timing.json
    │   ├── without_skill/
    │   │   ├── outputs/
    │   │   └── timing.json
    │   ├── eval_metadata.json
    │   └── grading.json
    ├── benchmark.json
    ├── benchmark.md
    └── feedback.json

Script	Purpose	When to use
`python -m scripts.init_workspace <name> --evals <evals.json>`	Create workspace directory structure	Before running comparisons
`python -m scripts.run_comparison --skill-path <path> --evals <evals.json> --workspace <path>`	Run A/B comparisons via `claude -p`	Executing eval runs
`python -m scripts.run_comparison --baseline-skill <old-path>`	Compare two skill versions	Old vs new skill testing
`python -m scripts.diff_outputs <workspace>/iteration-N/eval-name`	Compare output files between configs	Reviewing output differences

# From the writing-skills sibling directory (../writing-skills):
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>

Skill Testing Output

Testing Output-Quality Skills

Overview

When to Use

Quick Mode

Skill Testing Output

Testing Output-Quality Skills

Overview

When to Use

Quick Mode

Test Case Design

Running Tests

Spawn All Runs in the Same Turn

Drafting Assertions

Capturing Timing

Grading

Benchmarking

Viewing Results

Reading Feedback

Iterative Improvement Philosophy

Generalize from feedback

Keep the prompt lean

Explain the why

Look for repeated work

Read transcripts, not just outputs

The Iteration Loop

Blind Comparison (Optional)

Workspace Organization

Platform Notes

Bundled Scripts

Testing Checklist

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio