Import datasets from HuggingFace and convert them to Coval test sets. Use when the user wants to create test cases from HuggingFace dataset or repository.
Import $ARGUMENTS from HuggingFace and convert it into Coval test sets with properly structured test cases.
Coval is an AI evaluation platform for testing voice and conversational AI agents. It runs simulations against AI agents and measures performance with configurable metrics.
| Concept | Description |
|---|---|
| Test Set | A collection of test cases, grouped by category or evaluation purpose |
| Test Case | A single evaluation scenario with input (prompt) and optional metadata |
| Persona | High-level user character (system prompt) - separate from test cases |
| Agent | The AI system being evaluated |
Key distinction:
Base URL: https://api.coval.dev/v1
Fetch the OpenAPI spec before making API calls:
# List specs (no auth)
GET https://api.coval.dev/v1/openapi
# Fetch specific spec
GET https://api.coval.dev/v1/openapi/{spec_name}
If $ARGUMENTS is provided, navigate to it. Otherwise ask:
What is the HuggingFace repository, space, or dataset you want to import?
Then:
Report to the user:
Ask these questions to map HuggingFace data to Coval format:
Q1: Input Field
Which field contains the question/prompt for the test case
input?
Q2: Categorization
How should test cases be organized into test sets?
- By existing category field
- Single test set
- Custom logic
Q3: Metadata
Which fields should be preserved in
metadataJSON? (Recommend: preserve original IDs likequestion_id)
Q4: Multi-turn (if applicable)
How to handle multi-turn conversations?
- First turn only
- Concatenate turns
- Separate test cases per turn
Create Coval-compatible CSVs:
input,metadata
"Your question here","{""question_id"": ""123"", ""source"": ""mt-bench""}"
Requirements:
input column MUST be firstmetadata as valid JSON stringNaming: {source}_{category}.csv
Manual: Upload CSVs via Coval dashboard test sets page.
API: Fetch OpenAPI spec and use test set endpoints programmatically.
| Dataset | Description |
|---|---|
cais/mmlu | 15k+ multiple-choice questions across 57 subjects (STEM, humanities, law) |
nyu-mll/glue | Sentence-level tasks: sentiment, entailment, linguistic acceptability |
tau/commonsense_qa | Reasoning tests for everyday world knowledge |
Rowan/hellaswag | Common-sense inference and completion |
| Dataset | Description |
|---|---|
openai/gsm8k | ~8k grade-school math word problems (multi-step arithmetic) |
ucinlp/drop | Reading comprehension with discrete operations |
lukaemon/bbh | BigBench Hard - challenging reasoning subset |