Add a Hugging Face dataset to Marin pipelines. Use when asked to add, register, or inspect a new dataset for training.
This skill guides agents in inspecting Hugging Face dataset schemas using Marin's schema inspection tool. The inspection is the first step toward exposing a dataset for training. After capturing the schema, register an ExecutorStep so the dataset can be downloaded in Marin pipelines. For simple datasets, add the step to experiments/pretraining_datasets/__init__.py (see the fineweb_edu entry). If the dataset is multipart or more complex, create a dedicated file (e.g., experiments/pretraining_datasets/nemotron.py) and add the step there instead. Datasets with Hugging Face–exposed subsets or splits should pattern match on , defining separate steps for each subset.
pip install.uv sync --all-packages.uv run --with datasets --with pyyaml lib/marin/tools/get_hf_dataset_schema.py ....uv run lib/marin/tools/get_hf_dataset_schema.py from a synced Marin checkout, or the uv run --with ... form above when the environment is not already provisioned.uv run lib/marin/tools/get_hf_dataset_schema.py <dataset_name> [options]
from marin.tools.get_hf_dataset_schema import get_schema
schema = get_schema(dataset_name="wikitext", config_name="wikitext-103-v1")
uv run lib/marin/tools/get_hf_dataset_schema.py roneneldan/TinyStories--trust_remote_code flag when needed{
"error": "Config name is required.",
"available_configs": ["config1", "config2", ...]
}
The tool returns a JSON object with:
{
"splits": ["train", "validation", ...],
"text_field_candidates": ["text", "content", ...],
"features": {
"text": "string",
"label": "int64",
...
},
"sample_row": {
"text": "Example content...",
...
}
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py roneneldan/TinyStories
{
"splits": ["train", "validation"],
"text_field_candidates": ["text"],
"features": {"text": "string"},
"sample_row": {"text": "Once upon a time..."}
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext
{
"error": "Config name is required.",
"available_configs": ["wikitext-103-raw-v1", "wikitext-103-v1", ...]
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext --config_name wikitext-103-v1
{
"splits": ["train", "validation", "test"],
"text_field_candidates": ["text"],
"features": {"text": "string"},
"sample_row": {"text": "Article content..."}
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py c4 --config_name en --trust_remote_code
{
"splits": ["train", "validation"],
"text_field_candidates": ["text"],
"features": {"text": "string", "url": "string", "timestamp": "string"},
"sample_row": {"text": "Web content..."}
}
Once the schema is inspected and the dataset is registered (for example in experiments/pretraining_datasets/__init__.py or in a dedicated file), the goal is to cargo-cult existing dataset configs for tokenization:
lib/marin/tools/get_hf_dataset_schema.py