Testing framework for evaluating Databricks skills. Use when building test cases for skills, running skill evaluations, comparing skill versions, or creating ground truth datasets with the Generate-Review-Promote (GRP) pipeline. Triggers include "test skill", "evaluate skill", "skill regression", "ground truth", "GRP pipeline", "skill quality", and "skill metrics".
Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.
The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.
/skill-test <skill-name> [subcommand]
| Subcommand | Description |
|---|---|
run | Run evaluation against ground truth (default) |
regression | Compare current results against baseline |
init | Initialize test scaffolding for a new skill |
add | Interactive: prompt -> invoke skill -> test -> save |
add --trace | Add test case with trace evaluation |
review | Review pending candidates interactively |
review --batch | Batch approve all pending candidates |
baseline | Save current results as regression baseline |
mlflow | Run full MLflow evaluation with LLM judges |
trace-eval | Evaluate traces against skill expectations |
list-traces | List available traces (MLflow or local) |
scorers | List configured scorers for a skill |
scorers update | Add/remove scorers or update default guidelines |
sync | Sync YAML to Unity Catalog (Phase 2) |
/skill-test databricks-spark-declarative-pipelines run
/skill-test databricks-spark-declarative-pipelines add --trace
/skill-test databricks-spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init
See Workflows for detailed examples of each subcommand.
uv pip install -e .test/
Environment variables for Databricks MLflow:
DATABRICKS_CONFIG_PROFILE - Databricks CLI profile (default: "DEFAULT")MLFLOW_TRACKING_URI - Set to "databricks" for Databricks MLflowMLFLOW_EXPERIMENT_NAME - Experiment path (e.g., "/Users/{user}/skill-test")All subcommands have corresponding scripts in .test/scripts/:
uv run python .test/scripts/{subcommand}.py {skill_name} [options]
| Subcommand | Script |
|---|---|
run | run_eval.py |
regression | regression.py |
init | init_skill.py |
add | add.py |
review | review.py |
baseline | baseline.py |
mlflow | mlflow_eval.py |
scorers | scorers.py |
scorers update | scorers_update.py |
sync | sync.py |
trace-eval | trace_eval.py |
list-traces | list_traces.py |
_routing mlflow | routing_eval.py |
Use --help on any script for available options.
When /skill-test is invoked, parse arguments and execute the appropriate command.
args[0] = skill_name (required)args[1] = subcommand (optional, default: "run")| Subcommand | Action |
|---|---|
run | Execute run(skill_name, ctx) and display results |
regression | Execute regression(skill_name, ctx) and display comparison |
init | Execute init(skill_name, ctx) to create scaffolding |
add | Prompt for test input, invoke skill, run interactive() |
review | Execute review(skill_name, ctx) to review pending candidates |
baseline | Execute baseline(skill_name, ctx) to save as regression baseline |
mlflow | Execute mlflow_eval(skill_name, ctx) with MLflow logging |
scorers | Execute scorers(skill_name, ctx) to list configured scorers |
scorers update | Execute scorers_update(skill_name, ctx, ...) to modify scorers |
When running /skill-test <skill-name> init:
manifest.yaml with appropriate scorers and trace_expectationsground_truth.yaml and candidates.yaml templatesFollow with /skill-test <skill-name> add using recommended prompts.
Create CLIContext with MCP tools before calling any command. See Python API for details.
Important: All test files are stored at the repository root level, not relative to this skill's directory.
| File Type | Path |
|---|---|
| Ground truth | {repo_root}/.test/skills/{skill-name}/ground_truth.yaml |
| Candidates | {repo_root}/.test/skills/{skill-name}/candidates.yaml |
| Manifest | {repo_root}/.test/skills/{skill-name}/manifest.yaml |
| Routing tests | {repo_root}/.test/skills/_routing/ground_truth.yaml |
| Baselines | {repo_root}/.test/baselines/{skill-name}/baseline.yaml |
For example, to test databricks-spark-declarative-pipelines in this repository:
/Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml
Not relative to the skill definition:
/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG
.test/ # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml # Package config (pip install -e ".test/")
├── README.md # Contributor documentation
├── SKILL.md # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh # Sync script
├── scripts/ # Wrapper scripts
│ ├── _common.py # Shared utilities
│ ├── run_eval.py
│ ├── regression.py
│ ├── init_skill.py
│ ├── add.py
│ ├── baseline.py
│ ├── mlflow_eval.py
│ ├── routing_eval.py
│ ├── trace_eval.py # Trace evaluation
│ ├── list_traces.py # List available traces
│ ├── scorers.py
│ ├── scorers_update.py
│ └── sync.py
├── src/
│ └── skill_test/ # Python package
│ ├── cli/ # CLI commands module
│ ├── fixtures/ # Test fixture setup
│ ├── scorers/ # Evaluation scorers
│ ├── grp/ # Generate-Review-Promote pipeline
│ └── runners/ # Evaluation runners
├── skills/ # Per-skill test definitions
│ ├── _routing/ # Routing test cases
│ └── {skill-name}/ # Skill-specific tests
│ ├── ground_truth.yaml
│ ├── candidates.yaml
│ └── manifest.yaml
├── tests/ # Unit tests
├── references/ # Documentation references
└── baselines/ # Regression baselines