Name: Dataset Curation
Author: Heath-Moose

Search skills.../

Dataset Curation | Skills Pool

{
  "ground_truth_output": "Expected answer text"
}

-- Get agent tools
SELECT tool_name, tool_type, tool_spec
FROM <DATABASE>.INFORMATION_SCHEMA.CORTEX_AGENT_TOOLS
WHERE agent_name = '<AGENT_NAME>';

uv run --project <SKILL_DIR> python <SKILL_DIR>/scripts/get_agent_config.py \
    --agent-name AGENT_NAME --database DATABASE --schema SCHEMA \
    --connection CONNECTION_NAME --output agent_config.json

DESCRIBE AGENT <DATABASE>.<SCHEMA>.<AGENT_NAME>;

Agent Instructions Summary:
- Persona: [Customer service / Analytics / etc.]
- Guardrails: [List any restrictions]
- Example questions from instructions: [List sample questions if provided]

I'll design questions that align with this persona and respect these guardrails.

Category	%	Purpose	Example
Core use cases	40%	Primary agent purpose	"What was Q3 revenue?"
Tool routing	25%	Verify correct tool selection	"Show ML platform usage" (not general usage)
Edge cases	15%	Boundary conditions	"Revenue for Feb 30th" (invalid date)
Ambiguous queries	10%	Interpretation tests	"Show me recent activity" (vague)
Data validation	10%	Quality checks	"Total for incomplete period"

Here are proposed evaluation questions for [CATEGORY]:

| # | Question | Expected Tool | Notes |
|---|----------|---------------|-------|
| 1 | [question] | [tool] | [note] |
| 2 | [question] | [tool] | [note] |

Any to add, modify, or remove for this category?

Let's start with core use cases for [TOOL_NAME]:

Question 1: "What was the total revenue for Q3 2025?"
Expected answer: ?
Expected tool: ?

| # | Question | Expected Tool(s) | Ground Truth Output |
|---|----------|------------------|---------------------|
| 1 | [question] | [tool] | [concise expected answer] |

Review the ground truth above. Any corrections needed?

CREATE OR REPLACE TABLE <DATABASE>.<SCHEMA>.EVAL_DATASET_<AGENT_NAME> (
    question_id INT AUTOINCREMENT,
    INPUT_QUERY VARCHAR NOT NULL,
    GROUND_TRUTH OBJECT NOT NULL,
    category VARCHAR,
    author VARCHAR DEFAULT CURRENT_USER(),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    notes VARCHAR
);

INSERT INTO EVAL_DATASET_<AGENT_NAME> (INPUT_QUERY, GROUND_TRUTH, category, notes)
VALUES (
    'What was the total revenue for Q3 2025?',
    OBJECT_CONSTRUCT('ground_truth_output', 'Total revenue for Q3 2025 was $2.5M'),
    'core_use_case',
    'Basic revenue query'
);

INSERT INTO EVAL_DATASET_<AGENT_NAME> (INPUT_QUERY, GROUND_TRUTH, category, notes)
VALUES (
    'Show ML platform usage for last month',
    OBJECT_CONSTRUCT('ground_truth_output', 'ML Platform had 1,234 executions last month'),
    'tool_routing',
    'Should route to ML tool, not general usage'
);

CALL SYSTEM$CREATE_EVALUATION_DATASET(
    'Cortex Agent',
    '<DATABASE>.<SCHEMA>.EVAL_DATASET_<AGENT_NAME>',
    '<AGENT_NAME>_eval_v1',
    OBJECT_CONSTRUCT('query_text', 'INPUT_QUERY', 'ground_truth', 'GROUND_TRUTH')
);

uv run --project <SKILL_DIR> streamlit run <SKILL_DIR>/scripts/agent_events_explorer.py -- \
    --connection CONNECTION_NAME \
    --database DATABASE \
    --schema SCHEMA \
    --agent AGENT_NAME

-- Find recent agent interactions using AI Observability
SELECT DISTINCT
    RECORD_ATTRIBUTES:"ai.observability.record_root.input"::STRING AS USER_QUESTION,
    RECORD_ATTRIBUTES:"ai.observability.record_root.output"::STRING AS AGENT_RESPONSE,
    RECORD_ATTRIBUTES:"ai.observability.record_id"::STRING AS REQUEST_ID
FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_OBSERVABILITY_EVENTS(
    '<DATABASE>',
    '<SCHEMA>',
    '<AGENT_NAME>',
    'CORTEX AGENT'))
WHERE RECORD_ATTRIBUTES:"ai.observability.span_type" = 'record_root'
AND USER_QUESTION IS NOT NULL
ORDER BY RECORD_ATTRIBUTES:"ai.observability.record_id" DESC
LIMIT 100;

Found [N] unique questions in agent logs. Here are common patterns:
1. [Category]: "example question 1", "example question 2"
2. [Category]: "example question 3"

Would you like to include some of these in your evaluation dataset?

-- Find questions about specific topics
WHERE question ILIKE '%revenue%'

-- Find questions that used specific tools
WHERE RECORD:response LIKE '%tool_name%'

-- Find questions with errors or issues
WHERE RECORD:response:error IS NOT NULL

CREATE OR REPLACE TABLE EVAL_ANNOTATIONS AS
SELECT 
    REQUEST_ID,
    question,
    answer AS actual_answer,
    NULL AS expected_answer,  -- Fill in manually
    NULL AS is_correct        -- Fill in manually
FROM production_events;

CREATE OR REPLACE TABLE EVAL_DATASET_<AGENT_NAME> AS
SELECT 
    ROW_NUMBER() OVER (ORDER BY timestamp) AS question_id,
    question AS INPUT_QUERY,
    OBJECT_CONSTRUCT(
        'ground_truth_output', expected_answer
    ) AS GROUND_TRUTH,
    CASE WHEN is_correct THEN 'passing' ELSE 'failing' END AS category,
    'production_data' AS source
FROM annotated_production_data
WHERE expected_answer IS NOT NULL;

CALL SYSTEM$CREATE_EVALUATION_DATASET(
    'Cortex Agent',
    '<DATABASE>.<SCHEMA>.EVAL_DATASET_<AGENT_NAME>', -- agent FQN
    '<AGENT_NAME>_eval_v1', -- version
    OBJECT_CONSTRUCT('query_text', 'INPUT_QUERY', 'ground_truth', 'GROUND_TRUTH') -- column mapping
);

-- Count by category
SELECT category, COUNT(*) as count
FROM EVAL_DATASET_<AGENT_NAME>
GROUP BY category;

-- List all questions
SELECT question_id, INPUT_QUERY, category
FROM EVAL_DATASET_<AGENT_NAME>
ORDER BY question_id;

Current Coverage:
- revenue_tool: 5 questions
- usage_tool: 3 questions
- ml_platform_tool: 0 questions  ← GAP
- Edge cases: 1 question         ← GAP
- Tool routing tests: 2 questions ← Need more

Recommendations:
1. Add 2 questions for ml_platform_tool
2. Add 3 edge case questions
3. Add 2 tool routing tests

INSERT INTO EVAL_DATASET_<AGENT_NAME> (INPUT_QUERY, GROUND_TRUTH, category, notes)
VALUES 
-- ML platform questions
(
    'How many ML models were trained last quarter?',
    OBJECT_CONSTRUCT('ground_truth_output', '47 models were trained in Q4 2025'),
    'core_use_case',
    'New - filling ML tool coverage gap'
),
-- Edge case
(
    'What was revenue on February 30th, 2025?',
    OBJECT_CONSTRUCT('ground_truth_output', 'February 30th is not a valid date. Please provide a valid date.'),
    'edge_case',
    'Invalid date edge case'
),
-- Ambiguous
(
    'Show platform usage statistics',
    OBJECT_CONSTRUCT('ground_truth_output', 'I need clarification: Are you asking about ML Platform usage or general Snowflake platform usage?'),
    'ambiguous',
    'Ambiguous - should ask for clarification'
);

Column	Type	Description
`INPUT_QUERY`	VARCHAR	The question to ask the agent
`GROUND_TRUTH`	OBJECT	Expected results (structure below)

Field	Enables Metric
`ground_truth_output`	`answer_correctness`
(none required)	`logical_consistency`

Dataset Curation

Dataset Curation for Cortex Agent Evaluation

Purpose

When to Use

Prerequisites

Dataset Curation

Dataset Curation for Cortex Agent Evaluation

Purpose

When to Use

Prerequisites

Dataset Format

Workflows

Option A: Create Dataset from Scratch

Step 1: Understand Agent Capabilities

Step 1.5: Review Agent Instructions

Step 2: Design Question Categories

Step 3: Draft Questions with Expected Answers

Step 4: Create Dataset Table

Step 5: Register Dataset

Option B: Create Dataset from Production Data

Step 1: Access Production Events

Step 2: Filter and Select Questions

Step 3: Annotate with Expected Answers

Step 4: Convert to Evaluation Format

Step 5: Register Dataset

Option C: Add Questions to Existing Dataset

Step 1: Review Current Coverage

Step 2: Add New Questions

Step 3: Re-register Dataset

Best Practices

Question Design

Expected Answers

Coverage

Maintenance

Integration with Other Skills

Taskflow Inbox Triage

Accessibility

Open a Pull Request

Investor Materials

Continuous Agent Loop

Configure Ecc