Bootstrap a new realtime eval folder inside this cookbook repo by choosing the right harness from examples/evals/realtime_evals, scaffolding prompt/tools/data files, generating a useful README, and validating it with smoke, full eval, and test runs. Use when a user wants to start a new crawl, walk, or run realtime eval in this repository.
Use this skill when the user wants a new realtime eval scaffold under examples/evals/realtime_evals/.
This skill is repo-specific. Do not copy harness code into the generated folder. The generated eval should point at the shared harnesses already in:
examples/evals/realtime_evals/crawl_harnessexamples/evals/realtime_evals/walk_harnessexamples/evals/realtime_evals/run_harnessAlways ask the user for the minimum set needed to choose and scaffold the eval before you create files, run the scaffold script, or author starter data. Do not skip this just because you can infer a default.
Ask for:
If the user does not know which harness they want, explain the options briefly and recommend one. See references/harness-selection.md.
When the user asks for synthetic audio but does not specify a harness, default to text-to-TTS unless they need the generated audio to carry particular noise, telephony artifacts, speaker characteristics, or other replay-specific properties. Use for those cases.
crawlwalkKeep the questions concise and grouped into one short batch whenever possible.
If the user only provides user_text or a short task description, still ask the questions above first. If they answer only partially, then infer the remaining low-risk details, call out the assumptions, and make the scaffold easy to revise later.
Ask the user for the required inputs first.
Pick the harness.
crawl: single-turn text-to-TTS.walk: replay saved audio or generate audio from text rows.run: multi-turn simulation with tool mocks and judge criteria.crawl over walk.Normalize the inputs.
user_text, infer example_id values and leave optional grading fields blank unless you have enough signal to fill them.Be proactive when data is missing.
crawl: you should author 3 starter rows covering one happy path and a couple of nearby variants or edge cases.walk: you should author 3 source CSV rows and prepare the audio-generation step so the user can create audio immediately.run: you should author 2 starter simulations, not 1.Run the scaffold script:
python examples/evals/realtime_evals/skills/bootstrap-realtime-eval/scripts/bootstrap_realtime_eval.py --name "<eval_name>" --harness "<crawl|walk|run>"
Add flags for prompt, tools, data, graders, and run-specific fields as needed. Read the script help if you need the exact flag names.
Review the generated folder.
Enrich the scaffold.
crawl and walk, make the CSV realistic and ensure the expected tool columns are present.walk, if the dataset lacks audio_path, use the shared walk_harness/generate_audio.py flow described in the README.run, make sure simulations.csv and the starter sim_*.json file reflect the user’s scenario, tool mocks, and graders.Validate before returning.
pytest examples/evals/realtime_evals/tests -q.examples/evals/realtime_evals/.--assistant-system-prompt-file--assistant-tools-fileTreat the task as complete only when:
examples/evals/realtime_evals/<name>_realtime_eval/README.md, system_prompt.txt, and tools.json existpytest examples/evals/realtime_evals/tests -q has been runWhen this skill uncovers a reusable workflow or harness constraint that should guide future bootstrap work, add a short note here.
Keep learnings concise and action-oriented:
Only add items that are likely to help future realtime-eval scaffolding in this repo. Remove stale items when they no longer apply.
gpt-realtime temperature is unsupported -> Do not add a temperature field or CLI flag when scaffolding gpt-realtime evals -> Avoids invalid config and keeps runs aligned with the realtime harness constraints.crawl text-to-TTS and reserve walk for replay-specific audio characteristics like noise or telephony artifacts -> Keeps the bootstrap path simpler unless audio realism is the actual target.