Automates ML experiments and the scientific process. Use when the user asks to run experiments, generate hypotheses, interpret results, analyze MLflow metrics, or manage a research workflow. Triggers on "run experiment", "analyze results", "next hypothesis", "research session", "interpret these metrics", "what should I try next".
This skill provides the scientific loop workflow for automated ML experimentation.
All state lives in .research-assistant/ at project root:
.research-assistant/
├── config.yaml # MLflow location, toolkits, NotebookLM notebooks
├── framework-context.md # Project-level analysis
├── toolkit-context/ # Per-toolkit analysis (if toolkits configured)
│ └── {toolkit}.md
├── experiment-state.md # Current hypotheses, results
├── research-log.md # Append-only decisions
└── proposals/ # Code change proposals
When invoked on a new project (no .research-assistant/ exists):
Discover framework
Ask user for config
Initialize state
.research-assistant/ directoryconfig.yaml with user inputsframework-context.md with discoveriesexperiment-state.md and research-log.mdEach iteration follows this structure:
hypothesis_init = generate_hypothesis(context, user_steering?)
framework-context.md and experiment-state.mdliterature = query_notebooklm(context, hypothesis_init)
hypothesis_final = refine_hypothesis(literature, hypothesis_init)
experiment-state.md with run IDsSTOP HERE — Do not interpret results from terminal output.
Execution outputs are for monitoring only, not analysis.
See references/result-interpretation.md for the complete workflow.
MANDATORY FIRST STEP: Query MLflow before any analysis.
# REQUIRED: Fetch data from MLflow
runs_df = mlflow.search_runs(experiment_ids=[...], filter_string="tags.hypothesis_id = 'H1'")
# Only THEN perform analysis
results = aggregate_metrics(runs_df)
interpretation = analyze(results, hypothesis, context)
mlflow.search_runs() — see reference file for patternsresearch-log.md with full recordDO NOT:
querying-mlflow-metrics skill — that is for GenAI trace metrics (tokens, latency), not ML experiment metrics like best_fitnesscontext_{i+1} = update_context(results, hypothesis, interpretation)
experiment-state.md with:
Load these as needed:
| Skill | When | NOT for |
|---|---|---|
querying-mlflow-metrics | GenAI/LLM trace metrics (tokens, latency) | ML experiment metrics (best_fitness, mean_fitness) |
retrieving-mlflow-traces | Debugging execution traces | — |
For ML experiment metrics: Use mlflow.search_runs() directly — see references/result-interpretation.md
| notebooklm | Literature grounding |
For in-depth guidance on each phase, see:
| Reference | Contents |
|---|---|
| references/hypothesis-generation.md | Prompts, quality criteria, hypothesis categories |
| references/notebooklm-integration.md | Query formulation, citation extraction, follow-ups |
| references/experiment-execution.md | MLflow tagging, error handling, principles |
| references/result-interpretation.md | Statistical analysis, verdict criteria, templates |
| Reference | Contents |
|---|---|
| references/project-bootstrap.md | Toolkit linking, two-level analysis, proposal routing |
| references/batch-execution.md | Autonomous batch runs, progress tracking, failure handling |
| references/code-proposals.md | Proposing framework changes, linking to evidence |
| references/status-reporting.md | Status checks, session summaries, state inspection |
| references/git-integration.md | Change detection, reproducibility, commit linking |