Name: Cross Run Consistency
Author: mytechsonamy

Cross Run Consistency

Runs the same test N times in one session, diffs the outputs, and classifies non-determinism by root cause. Complements observability's historical flake tracking with an immediate "does this test agree with itself right now?" answer. Gate contract — P0 scenarios must be strict-consistent (same output on N/N runs), no tolerance fuzzing, no silent averaging. PIPELINE-5 step 3.

mytechsonamy0 スター2026/04/12

職業
カテゴリ: デバッグ

An L2 Truth-Execution skill. It answers a specific question: "If I run this test right now, five times in a row, will the five runs agree with each other?"

That question is different from "has this test been flaky in the past", which is what observability MCP's ob_track_flaky tool answers. Historical flakiness looks backward across time, at runs that were separated by code changes, environment drift, and other noise. Cross-run consistency looks forward, in one session, against an unchanged codebase — any disagreement is pure non-determinism, because nothing else could have caused it.

Flaky tests that only show up historically usually hide behind timing wobble; flaky tests that show up cross-run are faster to diagnose because the search space is smaller.

When You're Invoked

PIPELINE-5 step 3 — pre-release, after the regression suite has produced a clean baseline. A cross-run on the critical-path scenarios before shipping catches the last- mile non-determinism that a single green run can hide.
On demand as /vibeflow:cross-run-consistency <scenario-glob> [--runs N] [--mode strict|tolerant].
From when a test classified in the baseline needs a fresh, session-local reproduction attempt before uses it as a hard blocker.

Cross Run Consistency

mytechsonamy0 スター2026/04/12

職業
カテゴリ: デバッグ

When You're Invoked

PIPELINE-5 step 3 — pre-release, after the regression suite has produced a clean baseline. A cross-run on the critical-path scenarios before shipping catches the last- mile non-determinism that a single green run can hide.

On demand as /vibeflow:cross-run-consistency <scenario-glob> [--runs N] [--mode strict|tolerant].

From when a test classified in the baseline needs a fresh, session-local reproduction attempt before uses it as a hard blocker.

Input	Required	Notes
Test scenario(s)	yes	Glob or explicit list. Matches the target's `scenario-set.md` ids, or direct file paths for runner-level tests.
Run count `N`	optional	Default: 5. Range: `[3, 50]`. A `--runs 1` is rejected — you can't check consistency with one observation.
Mode	optional	`strict` or `tolerant`. Default: `strict` for P0 tests, `tolerant` for everything else (the default honors the P0 rule). Explicit mode overrides only apply to non-P0 tests.
Tolerance declaration	optional	From `test-strategy.md → crossRunTolerance`. See `references/tolerance-modes.md` §2 for the shape.
`regression-baseline.json`	optional but preferred	Used to resolve test priorities (P0/P1/…) for the gate rule.
`observability` MCP	optional	When present, the skill reads historical flakiness to cross-reference findings — a test that's flaky historically AND cross-run is a stronger signal than either alone.

Condition	Verdict
Every P0 scored 1.0 in strict mode AND overall ≥ threshold	PASS
Every P0 scored 1.0 AND overall < threshold	NEEDS_REVISION
Any P0 scored < 1.0	BLOCKED

Domain	Non-P0 overall threshold
`financial`	0.98
`healthcare`	0.98
`e-commerce`	0.93
`general`	0.90

Cross Run Consistency

When You're Invoked

Cross Run Consistency

When You're Invoked

Input Contract

Algorithm

Step 1 — Resolve mode per test

Step 2 — Capture the first run as baseline

Step 3 — Run the remaining N-1 executions

Step 4 — Diff against the baseline

Step 5 — Classify every inconsistency

Step 6 — Compute the consistency score

Step 7 — Apply the gate

Step 8 — Write outputs

Output Contract

`consistency-report.md`

Gate Contract

Non-Goals

Downstream Dependencies

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags