Name: Segway Bench
Author: 200815147

SegwayBench Benchmark Skill

SegwayBench measures how well LLM models perform as the brain of an OpenClaw agent when operating Segway delivery robots. It evaluates agents across real-world robot operation scenarios including querying areas and stations, creating and managing delivery tasks, controlling robot boxes, and handling errors.

Prerequisites

Python 3.10+
uv package manager
OpenClaw instance (this agent)
Segway API credentials configured via segway_auth

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_area_query,task_05_guidance_create

# Run with safety mode override (force all tasks to use mock)
uv run benchmark.py --model anthropic/claude-sonnet-4 --safety-mode mock_required

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Category	Description
`area_query`	Area and station lookup queries
`robot_query`	Robot list and status queries
`task_create`	Delivery task creation (guidance, pickup/delivery)
`task_manage`	Task cancellation, priority changes, status queries
`box_control`	Robot box open/close and verification
`multi_step`	Multi-step workflows combining queries and actions
`error_handling`	Error scenarios (missing IDs, invalid parameters)

Task	Category	Description
`task_00_sanity`	Basic	Verify agent works
`task_01_area_query`	area_query	Query available areas
`task_02_station_query`	area_query	Query stations in an area
`task_03_robot_list`	robot_query	List all robots
`task_04_robot_status`	robot_query	Query robot status
`task_05_guidance_create`	task_create	Create guidance delivery task
`task_06_task_cancel`	task_manage	Cancel a delivery task
`task_07_box_open`	box_control	Open robot box
`task_08_multi_query_create`	multi_step	Query then create task
`task_09_error_missing_id`	error_handling	Handle missing ID error
`task_10_error_invalid_area`	error_handling	Handle invalid area error

Option	Description
`--model`	Model identifier (e.g., `anthropic/claude-sonnet-4`)
`--suite`	`all`, `automated-only`, or comma-separated task IDs
`--safety-mode`	Global safety level override: `read_only`, `mock_required`, or `live_allowed`
`--output-dir`	Results directory (default: `results/`)
`--timeout-multiplier`	Scale task timeouts for slower models
`--runs`	Number of runs per task for averaging
`--no-upload`	Skip uploading to leaderboard

Segway Bench

Segway Bench

SegwayBench Benchmark Skill

Prerequisites

Quick Start

Available Task Categories

Available Tasks

Command Line Options

Safety Modes

Results

Adding Custom Tasks

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api