Orchestrate agent-builder evaluation runs — init ES/Kibana/EDOT stack, collect eval parameters, output the run command, and stop services.
This skill manages the lifecycle of running agent-builder evaluations. It accepts $ARGUMENTS as one of: init or stop.
init — Launch ES, Kibana, and EDOT; collect eval parameters; output the run commandstop — Kill background ES, Kibana, and EDOT processesinitFollow these steps sequentially. Each step requires confirmation before proceeding.
Use AskUserQuestion to ask the user for the path to their GCS credentials file. The default path is exactly ~/.gcs/gcs.client.default.credentials_file.json — do NOT suggest any other path (not ~/.config/gcloud/..., not application_default_credentials, etc.).
What is the path to your GCS credentials file? (default: ~/.gcs/gcs.client.default.credentials_file.json)
If the user accepts the default or leaves it blank, use $HOME/.gcs/gcs.client.default.credentials_file.json (expand ~ to the user's home directory). Validate that the resolved path starts with /. If it does not, ask again.
Launch Elasticsearch in the background using run_in_background. Include the GCS credentials:
yarn es snapshot --license trial --secure-files gcs.client.default.credentials_file=<GCS_CREDENTIALS_PATH>
Tell the user Elasticsearch is starting up.
Wait for Elasticsearch to become available by polling until the cluster health endpoint responds. Fail after 30 attempts (approximately 2.5 minutes):
MAX_RETRIES=30; COUNT=0; until curl -s -u elastic:changeme http://localhost:9200/_cluster/health | grep -q '"status"'; do COUNT=$((COUNT+1)); if [ "$COUNT" -ge "$MAX_RETRIES" ]; then echo "ERROR: Elasticsearch did not become available after $MAX_RETRIES attempts"; exit 1; fi; sleep 5; done
If the poll times out, show the error to the user and suggest checking the Elasticsearch background task output for startup errors.
Once ES is ready, register the GCS snapshot repository with these defaults:
agent-builder-datasetsagent-builder-datasetsknowledge_base/snapshot_dt=2026-01-10curl -s -u elastic:changeme -X PUT "http://localhost:9200/_snapshot/agent-builder-datasets" \
-H "Content-Type: application/json" \
-d '{
"type": "gcs",
"settings": {
"bucket": "agent-builder-datasets",
"base_path": "knowledge_base/snapshot_dt=2026-01-10"
}
}'
Verify registration succeeded by checking the response contains "acknowledged":true. If it fails, show the error to the user and ask if they want to retry or abort.
Tell the user the GCS snapshot repository has been registered.
List available snapshots in the repository:
curl -s -u elastic:changeme "http://localhost:9200/_snapshot/agent-builder-datasets/_all"
Parse the response and present each snapshot as an option using AskUserQuestion. For each snapshot, show:
Example options:
manual_test_snapshot_2 — 32 indices, Jan 12text_retrieval_eval_bm25_elser — 2 indices, Feb 12Once the user selects a snapshot, restore it:
curl -s -u elastic:changeme -X POST "http://localhost:9200/_snapshot/agent-builder-datasets/<snapshot_name>/_restore" \
-H "Content-Type: application/json" \
-d '{
"indices": "*",
"include_global_state": false
}'
Verify the restore was accepted by checking the response contains "accepted":true. If it fails (e.g., index already exists), show the error and ask the user if they want to close conflicting indices and retry, or abort.
To retry with conflicting indices closed:
curl -s -u elastic:changeme -X POST "http://localhost:9200/<comma_separated_index_names>/_close"
Then re-run the restore command.
Tell the user the snapshot has been restored.
Launch Kibana in the background using run_in_background:
yarn start --no-base-path
Tell the user Kibana is starting up.
Use AskUserQuestion to confirm Phoenix is running:
Is Phoenix running and ready to receive traces?
Options:
Launch the EDOT collector in the background using run_in_background:
ELASTICSEARCH_HOST=http://localhost:9200 ELASTICSEARCH_USERNAME=elastic ELASTICSEARCH_PASSWORD=changeme node scripts/edot_collector.js
Tell the user EDOT is starting up.
Read config/kibana.dev.yml and parse the xpack.actions.preconfigured section to get the list of available connector IDs and names. These connectors are used for both EVALUATION_CONNECTOR_ID (the judge) and --project (the model being evaluated).
If no connectors are found, tell the user to configure connectors in config/kibana.dev.yml under xpack.actions.preconfigured and abort.
Use AskUserQuestion to ask which connector to use as the evaluation judge. Present the discovered connectors as options:
Which connector should be used as the evaluation judge (EVALUATION_CONNECTOR_ID)?
Options: one per discovered connector, using id (name) as the label.
Use AskUserQuestion to ask which connector/model to evaluate. Present the discovered connectors as options:
Which model should be evaluated (--project)?
Options: one per discovered connector, using id (name) as the label.
Use AskUserQuestion to ask which dataset to use:
Which dataset should be used?
Options:
agent-builder: text-retrieval: wix-qaagent-builder: text-retrieval: elastic-qaagent-builder: text-retrieval: quick-testerUsing the collected values and the following defaults, output the exact command the user should run in a separate terminal:
Precision@K,Recall@K,F1@K,Latency,Input Tokens,Output Tokens,Tool Calls,Factuality,Groundedness,Relevance10,20,30,401Display a summary and the command:
Stack is ready!
- Elasticsearch: running (snapshot with GCS credentials)
- Kibana: running (no base path)
- Phoenix: confirmed running
- EDOT: running
Important: Make sure Cloud Connected Mode (CCM) is enabled in Kibana before running the evaluation. Go to Stack Management > Cloud Connected Mode in the Kibana UI and enable it if it is not already active.
Run the following command in a separate terminal to start the evaluation:
TRACING_ES_URL=http://elastic:changeme@localhost:9200 \ SELECTED_EVALUATORS="<value>" \ RAG_EVAL_K=<value> \ KBN_EVALS_EXECUTOR=phoenix \ EVALUATION_CONNECTOR_ID=<value> \ DATASET_NAME="<value>" \ EVALUATION_REPETITIONS=<value> \ KBN_EVALS_SKIP_CONNECTOR_SETUP=true \ node scripts/playwright test \ --config x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/playwright.config.ts \ evals/external/external_dataset.spec.ts \ --project <value>
Substitute the actual user-selected values into the command. The user will copy-paste and run this themselves. Do NOT append any extra notes or warnings after the command block.
stopKill background ES, Kibana, and EDOT processes that were launched during init.
Run the following commands to find and kill the relevant processes:
# Kill Elasticsearch
pkill -f 'elasticsearch' || true
# Kill Kibana (node process started by yarn start)
pkill -f 'scripts/kibana --dev' || true
# Kill EDOT collector
pkill -f 'edot_collector' || true
Tell the user:
All evaluation stack processes (ES, Kibana, EDOT) have been stopped.
run_in_background. Their task IDs are tracked by the session so stop can kill them.TRACING_ES_URL, KBN_EVALS_EXECUTOR, and KBN_EVALS_SKIP_CONNECTOR_SETUP are not configurable — they are set for the local dev stack.node scripts/playwright test — never use npx playwright test.x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/playwright.config.ts.evals/external/external_dataset.spec.ts.