Write LLM evaluation spec files with datasets, tasks, and evaluators using the @kbn/evals Playwright fixture. Use when authoring new eval specs, adding datasets or evaluators, or debugging evaluation test failures.
Eval specs use the evaluate Playwright fixture (not test). A spec file follows this structure:
import { evaluate, tags, selectEvaluators, type Example, type TaskOutput } from '@kbn/evals';
evaluate.describe('Suite name', { tag: tags.serverless.observability.complete }, () => {
evaluate.beforeAll(async ({ fetch, log }) => {
// one-time setup: install docs, create agents, load archives
});
evaluate.afterAll(async ({ fetch, log }) => {
// teardown: uninstall docs, delete agents, unload archives
});
evaluate('test name', async ({ executorClient, connector }) => {
await executorClient.runExperiment(
{ dataset, task },
evaluators
);
});
});
When a suite has a custom src/evaluate.ts, import from there instead of @kbn/evals:
import { evaluate } from '../src/evaluate';
Every evaluate.describe must have a tag. Common choices:
| Tag | When to use |
|---|---|
tags.serverless.observability.complete | Observability domain evals |
tags.serverless.security.complete | Security domain evals |
tags.serverless.search | Search domain evals |
tags.stateful.classic | Stateful-only evals |
Import tags from @kbn/scout or @kbn/evals (re-exported).
A dataset is an array of examples with typed input, output (expected), and optional metadata:
type MyExample = Example<
{ question: string },
{ expectedAnswer: string },
{ tags?: string[] }
>;
const dataset = {
name: 'my-dataset',
description: 'What this dataset tests',
examples: [
{
input: { question: 'What is 2+2?' },
output: { expectedAnswer: '4' },
metadata: { tags: ['math'] },
},
],
};
Keep datasets focused. For local iteration, use --grep to run a subset:
node scripts/evals start --grep "my test name"
The task function receives an example and returns the output to evaluate: