Create ShinkaEvolve task scaffolds from a target directory and task description, producing `evaluate.py` and `initial.<ext>` (multi-language). Use when asked to set up new ShinkaEvolve tasks, evaluation harnesses, or baseline programs for ShinkaEvolve.
Create a setup scaffold consisting of an evaluation script and initial solution for an optimization problem given a user's task description. Both ingredients will be used within ShinkaEvolve, a framework combining LLMs with evolutionary algorithms to drive code optimization.
Invoke this skill when the user:
evaluate.py and initial.<ext> exist in the working directoryinitial.<ext> (if omitted, default to Python)evaluate.py or initial.<ext> without consent.initial.<ext> with a clear evolve region (EVOLVE-BLOCK markers or language-equivalent comments) and stable I/O contract.evaluate.py:
initial.py: call run_shinka_eval with experiment_fn_name, get_experiment_kwargs, aggregate_metrics_fn, num_runs, and optional validate_fn.initial.<ext>: run candidate program directly (usually via subprocess) and write metrics.json + correct.json.evaluate.py before handoff:
python evaluate.py --program_path initial.<ext> --results_dir /tmp/shinka_eval_smokedict is produced (either from aggregate_fn or metrics.json) with at least:
combined_score (numeric),public (dict),private (dict),extra_data (dict),text_feedback (string, can be empty).correct.json exists with correct (bool) and error (string) fields.shinka-run skill:
run_evo.py plus a shinka.yaml config with matching language + init_program_path.shinka-run skill to perform optimization with the agent.A framework developed by SakanaAI that combines LLMs with evolutionary algorithms to propose program mutations, that are then evaluated and archived. The goal is to optimize for performance and discover novel scientific insights.
Repo and documentation: https://github.com/SakanaAI/ShinkaEvolve Paper: https://arxiv.org/abs/2212.04180
combined_score| File | Purpose |
|---|---|
initial.<ext> | Starting solution in the chosen language with an evolve region that LLMs mutate |
evaluate.py | Scores candidates and emits metrics/correctness outputs that guide selection |
run_evo.py | (Optional) Launches the evolution loop |
shinka.yaml | (Optional) Config: generations, islands, LLM models, patch types, etc. |
Install once before creating/running tasks:
# Check if shinka is available in workspace environment
python -c "import shinka"
# If not; install from PyPI
pip install shinka-evolve
# Or with uv
uv pip install shinka-evolve
initial.<ext>)Shinka supports multiple candidate-program languages. Choose one, then keep extension/config/evaluator aligned.
evo_config.language | initial.<ext> |
|---|---|
python | initial.py |
julia | initial.jl |
cpp | initial.cpp |
cuda | initial.cu |
rust | initial.rs |
swift | initial.swift |
json / json5 | initial.json |
Rules:
evaluate.py stays the evaluator entrypoint.run_shinka_eval + experiment_fn_name.subprocess and write metrics.json + correct.json.evo_config.language and matching evo_config.init_program_path.initial.<ext> (Python example)import random
# EVOLVE-BLOCK-START
def advanced_algo():
# Implement the evolving algorithm here.
return 0.0, ""
# EVOLVE-BLOCK-END
def solve_problem(params):
return advanced_algo()
def run_experiment(random_seed: int | None = None, **kwargs):
"""Main entrypoint called by evaluator."""
if random_seed is not None:
random.seed(random_seed)
score, text = solve_problem(kwargs)
return float(score), text
For non-Python initial.<ext>, keep the same idea: small evolve region + deterministic program interface consumed by evaluate.py.
evaluate.py (Python run_shinka_eval path)import argparse
import numpy as np
from shinka.core import run_shinka_eval # required for results storage
def get_kwargs(run_idx: int) -> dict:
return {"random_seed": int(np.random.randint(0, 1_000_000_000))}
def aggregate_fn(results: list) -> dict:
scores = [r[0] for r in results]
texts = [r[1] for r in results if len(r) > 1]
combined_score = float(np.mean(scores))
text = texts[0] if texts else ""
return {
"combined_score": combined_score,
"public": {},
"private": {},
"extra_data": {},
"text_feedback": text,
}
def validate_fn(result):
# Return (True, None) or (False, "reason")
return True, None
def main(program_path: str, results_dir: str):
metrics, correct, err = run_shinka_eval(
program_path=program_path,
results_dir=results_dir,
experiment_fn_name="run_experiment",
num_runs=3,
get_experiment_kwargs=get_kwargs,
aggregate_metrics_fn=aggregate_fn,
validate_fn=validate_fn, # Optional
)
if not correct:
raise RuntimeError(err or "Evaluation failed")
if __name__ == "__main__":
# argparse program path & dir
parser = argparse.ArgumentParser()
parser.add_argument("--program_path", required=True)
parser.add_argument("--results_dir", required=True)
args = parser.parse_args()
main(program_path=args.program_path, results_dir=args.results_dir)
evaluate.py (non-Python initial.<ext> path)import argparse
import json
import os
from pathlib import Path
def main(program_path: str, results_dir: str):
os.makedirs(results_dir, exist_ok=True)
# 1) Execute candidate program_path (subprocess / runtime-specific call)
# 2) Compute task metrics + correctness
metrics = {
"combined_score": 0.0,
"public": {},
"private": {},
"extra_data": {},
"text_feedback": "",
}
correct = False
error = ""
(Path(results_dir) / "metrics.json").write_text(
json.dumps(metrics, indent=2), encoding="utf-8"
)
(Path(results_dir) / "correct.json").write_text(
json.dumps({"correct": correct, "error": error}, indent=2), encoding="utf-8"
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--program_path", required=True)
parser.add_argument("--results_dir", required=True)
args = parser.parse_args()
main(program_path=args.program_path, results_dir=args.results_dir)
run_evo.py (async)See skills/shinka-setup/scripts/run_evo.py for an example to edit.
shinka.yamlSee skills/shinka-setup/scripts/shinka.yaml for an example to edit.
combined_score, public, private, extra_data, text_feedback).experiment_fn_name matches function name in initial.py.initial.<ext> CLI/I/O.combined_score values indicate better performance.