Buscar habilidades.../

Benchmark And Docs Refresh | Skills Pool

Archivo del skill

Benchmark And Docs Refresh

Run or continue model benchmarks, collect measured results, and refresh README/docs benchmark sections from generated artifacts. Use when benchmark tables in model docs need to be created, updated, or corrected.

open-edge-platform5,651 estrellas10 abr 2026

Ocupación
Categorías: Depuración

Contenido de la habilidad

Use this skill to update benchmark sections in model documentation from real benchmark outputs.

Scope

This skill focuses on:

running or continuing benchmarks
collecting benchmark CSV results from results/
updating benchmark tables in model READMEs
updating matching docs pages when benchmark status changes

It does not own sample image export. Use model-sample-image-export for that.

Request changes when

incomplete benchmark coverage is presented;
README or docs benchmark status drifts from the actual run state.

Preferred Benchmark Workflow

Always prefer:

tools/experimental/benchmarking/benchmark.py

with an appropriate config file.

If the stock benchmark path is insufficient for a specific model:

derive a small helper script from the benchmark workflow

Skills relacionados

Instalación rápida

Benchmark And Docs Refresh

npx skills add open-edge-platform/anomalib

Descargar Skill Abrir repositorio

Autor: open-edge-platform
estrellas: 5,651
Actualizado: 10 abr 2026
Ocupación

En esta página

keep it model-specific unless multiple models clearly need the same pattern

save measurable outputs such as CSV files under results/

Required Evidence

Only publish benchmark values when they come from actual artifacts, for example:

results/<model>_benchmark.csv
benchmark-generated CSV files under runs/ or results/
model-specific run outputs that clearly record the measured metrics

Never infer missing values.

Update Rules

When refreshing benchmark tables:

Read the target README and matching docs page first.
Read the benchmark artifact source.
Fill only the shot-settings and metrics that actually exist.
Leave unavailable rows blank or TODO.
Update status wording if the benchmark is still partial or still running.

Table Conventions

Common sections to refresh:

### Image-Level AUC
### Pixel-Level AUC
### Image F1 Score
### Pixel F1 Score

If a README only contains placeholders, replace only the rows supported by measured results.

Docs Synchronization Rules

If the README benchmark state changes, update the matching docs page under:

docs/source/markdown/guides/reference/models/image/<model>.md
docs/source/markdown/guides/reference/models/video/<model>.md

The docs page may stay shorter than the README, but it must not contradict it.

Quality Checks

Before finishing:

Confirm the benchmark artifact still exists.
Confirm copied values exactly match the artifact.
Confirm averages are computed from measured values only.
Confirm incomplete rows remain clearly incomplete.
Confirm README/docs wording matches reality.

Reviewer checklist

Check that the artifact exists.
Check that every copied value matches.
Check that partial runs are labeled clearly.
Check README and docs wording for consistency.

Repo-Specific Notes

Some benchmark jobs in this repo may require derived helper scripts.
Some long runs are better continued in tmux/background sessions.
A benchmark can be complete enough to fill a subset of rows without justifying all rows.
Never replace TODOs with fabricated numbers.

02

Request changes when

03Preferred Benchmark Workflow

04Required Evidence

06Table Conventions

07Docs Synchronization Rules

08Quality Checks

09Reviewer checklist

10Repo-Specific Notes

Científicos de datos

Session Logs

Search and analyze your own session logs (older/parent conversations) using jq.

OpenClaw Test Heap Leaks

Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, triage likely transformed-module retention versus likely runtime leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests.

Node Connect

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps. Use when QR/setup code/manual connect fails, local Wi-Fi works but VPS/tailnet does not, or errors mention pairing required, unauthorized, bootstrap token invalid or expired, gateway.bind, gateway.remote.url, Tailscale, or plugins.entries.device-pair.config.publicUrl.

Openclaw Qa Testing

Run, watch, debug, and extend OpenClaw QA testing with qa-lab and qa-channel. Use when Codex needs to execute the repo-backed QA suite, inspect live QA artifacts, debug failing scenarios, add new QA scenarios, or explain the OpenClaw QA workflow. Prefer the live OpenAI lane with regular openai/gpt-5.4 in fast mode; do not use gpt-5.4-pro or gpt-5.4-mini unless the user explicitly overrides that policy.

Openclaw Secret Scanning Maintainer

Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.

Flags

Use when you need to check feature flag states, compare channels, or debug why a feature behaves differently across release channels.

Científicos de datos