Name: Hermes Atropos Environments
Author: NousResearch

SkillsPool

スキルを検索.../

スキル内容

Hermes Agent Atropos Environments

Guide for building RL environments in the hermes-agent repo that integrate with the Atropos training framework.

Architecture Overview

Atropos BaseEnv (atroposlib/envs/base.py)
    └── HermesAgentBaseEnv (environments/hermes_base_env.py)
            ├── Handles agent loop orchestration
            ├── Handles tool resolution per group
            ├── Handles ToolContext for reward verification
            └── YOUR ENVIRONMENT (environments/your_env.py)
                    Only implements: setup, get_next_item, format_prompt,
                                    compute_reward, evaluate, wandb_log

Hermes environments are special because they run a multi-turn agent loop with tool calling — not just single-turn completions. The base env handles the loop; you implement the task and scoring.

File Locations

Provider	`--openai.server_type`	`--openai.health_check`	`--openai.api_key`
OpenRouter	`openai`	`false`	`$OPENROUTER_API_KEY`
VLLM (self-hosted)	`vllm`	(default)	(not needed)
Other OpenAI-compatible	`openai`	`false`	As needed
Local Atropos	(default)	(default)	(not needed)

async def setup(self) -> None:
    """Called once at startup. Load datasets, initialize state."""
    # Try HuggingFace first, fallback to built-in samples
    try:
        from datasets import load_dataset
        ds = load_dataset("your/dataset", split="test")
        self._items = [...]
    except Exception:
        self._items = BUILTIN_SAMPLES

    # Always split into train/eval
    random.shuffle(self._items)
    eval_size = max(20, int(len(self._items) * 0.1))
    self._eval_items = self._items[:eval_size]
    self._items = self._items[eval_size:]

async def get_next_item(self) -> dict:
    """Return next item, cycling through dataset."""
    item = self._items[self._index % len(self._items)]
    self._index += 1
    return item

def format_prompt(self, item: dict) -> str:
    """Convert a dataset item into the user-facing prompt."""
    return f"Research this question: {item['question']}"

async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float:
    # Extract final response (last assistant message with content)
    final_response = ""
    tools_used = []
    for msg in reversed(result.messages):
        if msg.get("role") == "assistant" and msg.get("content") and not final_response:
            final_response = msg["content"]
        if msg.get("role") == "assistant" and msg.get("tool_calls"):
            for tc in msg["tool_calls"]:
                fn = tc.get("function", {}) if isinstance(tc, dict) else {}
                name = fn.get("name", "")
                if name:
                    tools_used.append(name)

    # Score using LLM judge, heuristic, or ToolContext verification
    correctness = await self._llm_judge(item, final_response)
    return correctness

# Run tests in the agent's sandbox
result = ctx.terminal("pytest /workspace/test.py")
return 1.0 if result["exit_code"] == 0 else 0.0

async def evaluate(self, *args, **kwargs) -> None:
    import time, uuid
    from environments.agent_loop import HermesAgentLoop
    from environments.tool_context import ToolContext

    start_time = time.time()
    tools, valid_names = self._resolve_tools_for_group()
    samples = []

    for item in self._eval_items[:self.config.eval_size]:
        task_id = str(uuid.uuid4())
        messages = []
        if self.config.system_prompt:
            messages.append({"role": "system", "content": self.config.system_prompt})
        messages.append({"role": "user", "content": self.format_prompt(item)})

        agent = HermesAgentLoop(
            server=self.server,
            tool_schemas=tools,
            valid_tool_names=valid_names,
            max_turns=self.config.max_agent_turns,
            task_id=task_id,
            temperature=0.0,  # Deterministic for eval
            max_tokens=self.config.max_token_length,
            extra_body=self.config.extra_body,
        )
        result = await agent.run(messages)

        ctx = ToolContext(task_id)
        try:
            reward = await self.compute_reward(item, result, ctx)
        finally:
            ctx.cleanup()

        samples.append({"prompt": ..., "response": ..., "reward": reward})

    eval_metrics = {"eval/mean_reward": ...}
    await self.evaluate_log(metrics=eval_metrics, samples=samples,
                            start_time=start_time, end_time=time.time())

async def wandb_log(self, wandb_metrics=None):
    if wandb_metrics is None:
        wandb_metrics = {}
    if self._reward_buffer:
        n = len(self._reward_buffer)
        wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
        self._reward_buffer.clear()
    await super().wandb_log(wandb_metrics)  # MUST call super

# SERVE — Full training loop (connects to Atropos API server)
python environments/my_env.py serve --openai.base_url http://localhost:8000/v1

# PROCESS — Offline data generation (saves JSONL)
python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \
    --env.use_wandb false --env.data_path_to_save_groups output.jsonl \
    --openai.base_url "<USER_BASE_URL>" \
    --openai.model_name "<USER_MODEL>" \
    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false

# EVALUATE — Standalone eval (runs setup + evaluate only)
python environments/my_env.py evaluate --env.eval_size 20 \
    --env.data_dir_to_save_evals /tmp/eval_results \
    --openai.base_url "<USER_BASE_URL>" \
    --openai.model_name "<USER_MODEL>" \
    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false

class MyEnv(HermesAgentBaseEnv):
    name = "my-env"
    env_config_cls = MyEnvConfig

    @classmethod
    def config_init(cls): ...          # Default server + env config
    async def setup(self): ...         # Load dataset + train/eval split
    async def get_next_item(self): ... # Cycle through training items
    def format_prompt(self, item): ... # Item → user message string
    async def compute_reward(self, item, result, ctx): ...  # Score rollout
    async def evaluate(self, *args, **kwargs): ...  # Full agent loop eval
    async def wandb_log(self, metrics=None): ...    # Custom metrics + super()

if __name__ == "__main__":
    MyEnv.cli()

Hermes Atropos Environments | Skills Pool

`environments/hermes_base_env.py`	Base class with agent loop + tool resolution
`environments/agent_loop.py`	`HermesAgentLoop` + `AgentResult` dataclass
`environments/tool_context.py`	`ToolContext` for reward verification
`environments/tool_call_parsers.py`	Phase 2 tool call parsers (hermes, mistral, etc.)
`environments/your_env.py`	Your environment implementation

Hermes Atropos Environments

Hermes Atropos Environments

Hermes Agent Atropos Environments

Architecture Overview

File Locations

Inference Setup — Ask the User First

Key flags by provider:

Required Methods

1. `setup()` — Load dataset and initialize state

2. `get_next_item()` — Return next training item

3. `format_prompt(item)` — Convert item to user message

4. `compute_reward(item, result, ctx)` — Score the rollout

5. `evaluate()` — Periodic evaluation with full agent loop

6. `wandb_log()` — Custom metrics logging

Config Class

config_init() — Default Configuration

Three CLI Modes

Common Pitfalls

Reward Function Patterns

LLM Judge (for open-ended tasks)

Binary Verification (for code/terminal tasks)

Multi-Signal (combine multiple indicators)

Testing Your Environment

Minimum Implementation Checklist

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

Hermes Atropos Environments

Hermes Atropos Environments

Hermes Agent Atropos Environments

Architecture Overview

File Locations

Inference Setup — Ask the User First

Key flags by provider:

Required Methods

1. setup() — Load dataset and initialize state

2. get_next_item() — Return next training item

3. format_prompt(item) — Convert item to user message

4. compute_reward(item, result, ctx) — Score the rollout

5. evaluate() — Periodic evaluation with full agent loop

6. wandb_log() — Custom metrics logging

Config Class

config_init() — Default Configuration

Three CLI Modes

Common Pitfalls

Reward Function Patterns

LLM Judge (for open-ended tasks)

Binary Verification (for code/terminal tasks)

Multi-Signal (combine multiple indicators)

Testing Your Environment

Minimum Implementation Checklist

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

1. `setup()` — Load dataset and initialize state

2. `get_next_item()` — Return next training item

3. `format_prompt(item)` — Convert item to user message

4. `compute_reward(item, result, ctx)` — Score the rollout

5. `evaluate()` — Periodic evaluation with full agent loop

6. `wandb_log()` — Custom metrics logging