Name: Add Dataset
Author: marin-community

Buscar habilidades.../

Add Dataset | Skills Pool

uv run lib/marin/tools/get_hf_dataset_schema.py <dataset_name> [options]

from marin.tools.get_hf_dataset_schema import get_schema
schema = get_schema(dataset_name="wikitext", config_name="wikitext-103-v1")

ALWAYS check if a dataset requires a config first

If configs are required, the tool returns:

{
  "error": "Config name is required.",
  "available_configs": ["config1", "config2", ...]
}

Select an appropriate config based on the available options

{
  "splits": ["train", "validation", ...],
  "text_field_candidates": ["text", "content", ...],
  "features": {
    "text": "string",
    "label": "int64",
    ...
  },
  "sample_row": {
    "text": "Example content...",
    ...
  }
}

$ uv run lib/marin/tools/get_hf_dataset_schema.py roneneldan/TinyStories
{
  "splits": ["train", "validation"],
  "text_field_candidates": ["text"],
  "features": {"text": "string"},
  "sample_row": {"text": "Once upon a time..."}
}

$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext
{
  "error": "Config name is required.",
  "available_configs": ["wikitext-103-raw-v1", "wikitext-103-v1", ...]
}

$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext --config_name wikitext-103-v1
{
  "splits": ["train", "validation", "test"],
  "text_field_candidates": ["text"],
  "features": {"text": "string"},
  "sample_row": {"text": "Article content..."}
}

$ uv run lib/marin/tools/get_hf_dataset_schema.py c4 --config_name en --trust_remote_code
{
  "splits": ["train", "validation"],
  "text_field_candidates": ["text"],
  "features": {"text": "string", "url": "string", "timestamp": "string"},
  "sample_row": {"text": "Web content..."}
}

Add Dataset

Skill: Dataset Schema Inspection and Registration

Overview

Add Dataset

Skill: Dataset Schema Inspection and Registration

Overview

Prerequisites

Guidelines for Humans

Command Line Usage

Python Import

Common Workflows

Rules for Agents

1. Config Handling

2. Text Field Selection

3. Error Handling

4. Performance

Output Format

Examples

Basic Usage

Dataset with Config

Dataset with Remote Code

Next Steps

See Also

Jupyter Notebook

Startup Metrics Framework

Billing Automation

Internal Comms Community

Sales Automator

Internal Comms Anthropic