Name: Skill Test
Author: databricks-solutions

/skill-test <skill-name> [subcommand]

Subcommand	Description
`run`	Run evaluation against ground truth (default)
`regression`	Compare current results against baseline
`init`	Initialize test scaffolding for a new skill
`add`	Interactive: prompt -> invoke skill -> test -> save
`add --trace`	Add test case with trace evaluation
`review`	Review pending candidates interactively
`review --batch`	Batch approve all pending candidates
`baseline`	Save current results as regression baseline
`mlflow`	Run full MLflow evaluation with LLM judges
`trace-eval`	Evaluate traces against skill expectations
`list-traces`	List available traces (MLflow or local)
`scorers`	List configured scorers for a skill
`scorers update`	Add/remove scorers or update default guidelines
`sync`	Sync YAML to Unity Catalog (Phase 2)

/skill-test databricks-spark-declarative-pipelines run
/skill-test databricks-spark-declarative-pipelines add --trace
/skill-test databricks-spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init

uv pip install -e .test/

uv run python .test/scripts/{subcommand}.py {skill_name} [options]

Subcommand	Script
`run`	`run_eval.py`
`regression`	`regression.py`
`init`	`init_skill.py`
`add`	`add.py`
`review`	`review.py`
`baseline`	`baseline.py`
`mlflow`	`mlflow_eval.py`
`scorers`	`scorers.py`
`scorers update`	`scorers_update.py`
`sync`	`sync.py`
`trace-eval`	`trace_eval.py`
`list-traces`	`list_traces.py`
`_routing mlflow`	`routing_eval.py`

Subcommand	Action
`run`	Execute `run(skill_name, ctx)` and display results
`regression`	Execute `regression(skill_name, ctx)` and display comparison
`init`	Execute `init(skill_name, ctx)` to create scaffolding
`add`	Prompt for test input, invoke skill, run `interactive()`
`review`	Execute `review(skill_name, ctx)` to review pending candidates
`baseline`	Execute `baseline(skill_name, ctx)` to save as regression baseline
`mlflow`	Execute `mlflow_eval(skill_name, ctx)` with MLflow logging
`scorers`	Execute `scorers(skill_name, ctx)` to list configured scorers
`scorers update`	Execute `scorers_update(skill_name, ctx, ...)` to modify scorers

File Type	Path
Ground truth	`{repo_root}/.test/skills/{skill-name}/ground_truth.yaml`
Candidates	`{repo_root}/.test/skills/{skill-name}/candidates.yaml`
Manifest	`{repo_root}/.test/skills/{skill-name}/manifest.yaml`
Routing tests	`{repo_root}/.test/skills/_routing/ground_truth.yaml`
Baselines	`{repo_root}/.test/baselines/{skill-name}/baseline.yaml`

/Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml

/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # WRONG

.test/                          # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml              # Package config (pip install -e ".test/")
├── README.md                   # Contributor documentation
├── SKILL.md                    # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh       # Sync script
├── scripts/                    # Wrapper scripts
│   ├── _common.py              # Shared utilities
│   ├── run_eval.py
│   ├── regression.py
│   ├── init_skill.py
│   ├── add.py
│   ├── baseline.py
│   ├── mlflow_eval.py
│   ├── routing_eval.py
│   ├── trace_eval.py           # Trace evaluation
│   ├── list_traces.py          # List available traces
│   ├── scorers.py
│   ├── scorers_update.py
│   └── sync.py
├── src/
│   └── skill_test/             # Python package
│       ├── cli/                # CLI commands module
│       ├── fixtures/           # Test fixture setup
│       ├── scorers/            # Evaluation scorers
│       ├── grp/                # Generate-Review-Promote pipeline
│       └── runners/            # Evaluation runners
├── skills/                     # Per-skill test definitions
│   ├── _routing/               # Routing test cases
│   └── {skill-name}/           # Skill-specific tests
│       ├── ground_truth.yaml
│       ├── candidates.yaml
│       └── manifest.yaml
├── tests/                      # Unit tests
├── references/                 # Documentation references
└── baselines/                  # Regression baselines

Skill Test | Skills Pool

Skill Test

Skill Test

Databricks Skills Testing Framework

Quick References

/skill-test Command

Basic Usage

Subcommands

Quick Examples

Execution Instructions

Environment Setup

Running Scripts

Command Handler

Argument Parsing

Subcommand Routing

init Behavior

Context Setup

File Locations

Directory Structure

References

Feishu Wiki

Clawhub

Prose

Coding Agent (bash-first)

Gemini

Wiki Maintainer