Create and manage evaluation datasets for Cortex Agents. Use this to build datasets from scratch, from production data, or to add questions to existing datasets. Outputs datasets in the format required by Snowflake Agent Evaluations.
Create and manage evaluation datasets for Cortex Agents. This workflow helps you build high-quality datasets that can be used with Snowflake's native Agent Evaluations (evaluate-cortex-agent skill).
Snowflake Access:
Understanding:
Snowflake Agent Evaluations require a specific format:
Source table columns:
| Column | Type | Description |
|---|---|---|
INPUT_QUERY | VARCHAR | The question to ask the agent |
GROUND_TRUTH | OBJECT | Expected results (structure below) |
GROUND_TRUTH structure:
{
"ground_truth_output": "Expected answer text"
}
What each field enables:
| Field | Enables Metric |
|---|---|
ground_truth_output | answer_correctness |
| (none required) | logical_consistency |
Goal: Design and build evaluation dataset for a new or untested agent.
Gather agent information:
-- Get agent tools
SELECT tool_name, tool_type, tool_spec
FROM <DATABASE>.INFORMATION_SCHEMA.CORTEX_AGENT_TOOLS
WHERE agent_name = '<AGENT_NAME>';
Or extract from agent config:
uv run --project <SKILL_DIR> python <SKILL_DIR>/scripts/get_agent_config.py \
--agent-name AGENT_NAME --database DATABASE --schema SCHEMA \
--connection CONNECTION_NAME --output agent_config.json
Document capabilities:
Before creating questions, understand what the agent is designed to do:
DESCRIBE AGENT <DATABASE>.<SCHEMA>.<AGENT_NAME>;
Check instructions in the agent config for:
⚠️ Common pitfall: Creating analytics questions for a customer-service agent that's programmed to deflect data queries.
Present findings to user:
Agent Instructions Summary:
- Persona: [Customer service / Analytics / etc.]
- Guardrails: [List any restrictions]
- Example questions from instructions: [List sample questions if provided]
I'll design questions that align with this persona and respect these guardrails.
Recommended distribution:
| Category | % | Purpose | Example |
|---|---|---|---|
| Core use cases | 40% | Primary agent purpose | "What was Q3 revenue?" |
| Tool routing | 25% | Verify correct tool selection | "Show ML platform usage" (not general usage) |
| Edge cases | 15% | Boundary conditions | "Revenue for Feb 30th" (invalid date) |
| Ambiguous queries | 10% | Interpretation tests | "Show me recent activity" (vague) |
| Data validation | 10% | Quality checks | "Total for incomplete period" |
For each tool, include:
Target: 10-20 queries depending on agent complexity:
Work with user to create questions:
Present proposed questions one category at a time:
Here are proposed evaluation questions for [CATEGORY]:
| # | Question | Expected Tool | Notes |
|---|----------|---------------|-------|
| 1 | [question] | [tool] | [note] |
| 2 | [question] | [tool] | [note] |
Any to add, modify, or remove for this category?
STOP: Get user approval on each category before moving to next.
For each question, gather:
Let's start with core use cases for [TOOL_NAME]:
Question 1: "What was the total revenue for Q3 2025?"
Expected answer: ?
Expected tool: ?
Generate ground truth for each approved question based on:
Present ground truth for review:
| # | Question | Expected Tool(s) | Ground Truth Output |
|---|----------|------------------|---------------------|
| 1 | [question] | [tool] | [concise expected answer] |
Review the ground truth above. Any corrections needed?
STOP: Get user approval on ground truth before creating table.
Expected answer guidelines:
✅ Good (specific, verifiable):
❌ Bad (vague, unverifiable):
Create source table:
CREATE OR REPLACE TABLE <DATABASE>.<SCHEMA>.EVAL_DATASET_<AGENT_NAME> (
question_id INT AUTOINCREMENT,
INPUT_QUERY VARCHAR NOT NULL,
GROUND_TRUTH OBJECT NOT NULL,
category VARCHAR,
author VARCHAR DEFAULT CURRENT_USER(),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
notes VARCHAR
);
Insert questions:
INSERT INTO EVAL_DATASET_<AGENT_NAME> (INPUT_QUERY, GROUND_TRUTH, category, notes)
VALUES (
'What was the total revenue for Q3 2025?',
OBJECT_CONSTRUCT('ground_truth_output', 'Total revenue for Q3 2025 was $2.5M'),
'core_use_case',
'Basic revenue query'
);
INSERT INTO EVAL_DATASET_<AGENT_NAME> (INPUT_QUERY, GROUND_TRUTH, category, notes)
VALUES (
'Show ML platform usage for last month',
OBJECT_CONSTRUCT('ground_truth_output', 'ML Platform had 1,234 executions last month'),
'tool_routing',
'Should route to ML tool, not general usage'
);
Critical format requirements:
GROUND_TRUTH (required by SYSTEM$CREATE_EVALUATION_DATASET)OBJECT or VARIANT (required - do not use VARCHAR)OBJECT_CONSTRUCT() for inserting ground truth dataCreate evaluation dataset:
CALL SYSTEM$CREATE_EVALUATION_DATASET(
'Cortex Agent',
'<DATABASE>.<SCHEMA>.EVAL_DATASET_<AGENT_NAME>',
'<AGENT_NAME>_eval_v1',
OBJECT_CONSTRUCT('query_text', 'INPUT_QUERY', 'ground_truth', 'GROUND_TRUTH')
);
Deliverables:
Goal: Build evaluation dataset from real production queries.
Option 1: Use Agent Events Explorer (recommended)
uv run --project <SKILL_DIR> streamlit run <SKILL_DIR>/scripts/agent_events_explorer.py -- \
--connection CONNECTION_NAME \
--database DATABASE \
--schema SCHEMA \
--agent AGENT_NAME
Option 2: Query observability logs directly
-- Find recent agent interactions using AI Observability
SELECT DISTINCT
RECORD_ATTRIBUTES:"ai.observability.record_root.input"::STRING AS USER_QUESTION,
RECORD_ATTRIBUTES:"ai.observability.record_root.output"::STRING AS AGENT_RESPONSE,
RECORD_ATTRIBUTES:"ai.observability.record_id"::STRING AS REQUEST_ID
FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_OBSERVABILITY_EVENTS(
'<DATABASE>',
'<SCHEMA>',
'<AGENT_NAME>',
'CORTEX AGENT'))
WHERE RECORD_ATTRIBUTES:"ai.observability.span_type" = 'record_root'
AND USER_QUESTION IS NOT NULL
ORDER BY RECORD_ATTRIBUTES:"ai.observability.record_id" DESC
LIMIT 100;
Present findings:
Found [N] unique questions in agent logs. Here are common patterns:
1. [Category]: "example question 1", "example question 2"
2. [Category]: "example question 3"
Would you like to include some of these in your evaluation dataset?
Criteria for good evaluation questions:
Filter examples:
-- Find questions about specific topics
WHERE question ILIKE '%revenue%'
-- Find questions that used specific tools
WHERE RECORD:response LIKE '%tool_name%'
-- Find questions with errors or issues
WHERE RECORD:response:error IS NOT NULL
For each selected question:
Using Agent Events Explorer:
Manual annotation:
CREATE OR REPLACE TABLE EVAL_ANNOTATIONS AS
SELECT
REQUEST_ID,
question,
answer AS actual_answer,
NULL AS expected_answer, -- Fill in manually
NULL AS is_correct -- Fill in manually
FROM production_events;
From annotated data:
CREATE OR REPLACE TABLE EVAL_DATASET_<AGENT_NAME> AS
SELECT
ROW_NUMBER() OVER (ORDER BY timestamp) AS question_id,
question AS INPUT_QUERY,
OBJECT_CONSTRUCT(
'ground_truth_output', expected_answer
) AS GROUND_TRUTH,
CASE WHEN is_correct THEN 'passing' ELSE 'failing' END AS category,
'production_data' AS source
FROM annotated_production_data
WHERE expected_answer IS NOT NULL;
CALL SYSTEM$CREATE_EVALUATION_DATASET(
'Cortex Agent',
'<DATABASE>.<SCHEMA>.EVAL_DATASET_<AGENT_NAME>', -- agent FQN
'<AGENT_NAME>_eval_v1', -- version
OBJECT_CONSTRUCT('query_text', 'INPUT_QUERY', 'ground_truth', 'GROUND_TRUTH') -- column mapping
);
Deliverables:
Goal: Expand coverage of existing evaluation dataset.
-- Count by category
SELECT category, COUNT(*) as count
FROM EVAL_DATASET_<AGENT_NAME>
GROUP BY category;
-- List all questions
SELECT question_id, INPUT_QUERY, category
FROM EVAL_DATASET_<AGENT_NAME>
ORDER BY question_id;
Identify gaps:
Current Coverage:
- revenue_tool: 5 questions
- usage_tool: 3 questions
- ml_platform_tool: 0 questions ← GAP
- Edge cases: 1 question ← GAP
- Tool routing tests: 2 questions ← Need more
Recommendations:
1. Add 2 questions for ml_platform_tool
2. Add 3 edge case questions
3. Add 2 tool routing tests
INSERT INTO EVAL_DATASET_<AGENT_NAME> (INPUT_QUERY, GROUND_TRUTH, category, notes)
VALUES
-- ML platform questions
(
'How many ML models were trained last quarter?',
OBJECT_CONSTRUCT('ground_truth_output', '47 models were trained in Q4 2025'),
'core_use_case',
'New - filling ML tool coverage gap'
),
-- Edge case
(
'What was revenue on February 30th, 2025?',
OBJECT_CONSTRUCT('ground_truth_output', 'February 30th is not a valid date. Please provide a valid date.'),
'edge_case',
'Invalid date edge case'
),
-- Ambiguous
(
'Show platform usage statistics',
OBJECT_CONSTRUCT('ground_truth_output', 'I need clarification: Are you asking about ML Platform usage or general Snowflake platform usage?'),
'ambiguous',
'Ambiguous - should ask for clarification'
);
Important: After adding questions, re-register the dataset by passing in a new version to CREATE_EVALUATION_DATASET.
Deliverables:
Do:
Don't:
Do:
Don't:
Minimum targets:
_v1, _v2, etc.)From adhoc-testing-for-cortex-agent:
To evaluate-cortex-agent:
In optimize-cortex-agent: