Multi-source ML dataset discovery.
Multi-source ML dataset discovery. Search HuggingFace Hub, OpenML, GitHub, and paper cross-references for datasets relevant to a research task. Use when asked to "find datasets for", "search ML datasets", "what datasets exist for", or "dis...
Use this skill when the user request matches its research workflow scope. Prefer the bundled resources instead of recreating templates or reference material. Keep outputs traceable to project files, citations, scripts, or upstream evidence.
scripts/ as optional helpers. Run them only when their dependencies are available, keep outputs in the project workspace, and explain a manual fallback if execution is blocked.Search multiple ML dataset sources (HuggingFace Hub, OpenML, GitHub, Semantic Scholar) and return a ranked, deduplicated list of relevant datasets.
Clarify the user's needs before searching:
Run the search script with the user's query:
python3 scripts/search_ml_datasets.py search --query "<query>" --sources huggingface,openml,github,papers --max 30
Options:
--sources: Comma-separated list from huggingface, openml, github, papers. Default: all four.--max: Maximum results to return after dedup + ranking. Default: 30.--modality: Filter by modality (image, text, tabular, audio).--workspace: Output directory. Default: ./datasets/discovery/Optionally also call HF MCP tool hub_repo_search with repo_types: ["dataset"] for semantic search to supplement results.
Show results as a markdown table:
| Name | Source | Downloads | Size | License | Tags | URL |
|---|
Sort by relevance score (highest first).
When the user wants more info on a specific dataset:
python3 scripts/search_ml_datasets.py detail --dataset-id "huggingface:stanfordnlp/imdb" --workspace ./datasets/discovery/
Writes metadata.json and README.md to {workspace}/datasets/{source}_{slug}/.
When the user wants to preview data:
python3 scripts/search_ml_datasets.py pull --dataset-id "huggingface:stanfordnlp/imdb" --sample-rows 20 --workspace ./datasets/discovery/
Writes sample.jsonl to {workspace}/datasets/{source}_{slug}/.
For full dataset download, confirm with the user first, then use huggingface-cli download or equivalent.
{workspace}/ # default: ./datasets/discovery/
search-{YYYY-MM-DD}.json # search results log
datasets/
{source}_{slug}/
metadata.json # detailed metadata
README.md # human-readable summary
sample.jsonl # sample rows
requests (stdlib-adjacent, universally available)gh CLI (for GitHub source only)