Name: Sdd Experiment
Author: JNZader

Search skills.../

Sdd Experiment | Skills Pool

mem_get_observation(id)

mem_save(
  title: "sdd/{change-name}/experiment-report",
  topic_key: "sdd/{change-name}/experiment-report",
  type: "architecture",
  project: "{project}",
  content: "{your full experiment report}"
)

1. Verify clean git state:
   git status --porcelain
   -> MUST return empty
   -> If dirty: ABORT. Tell user to commit or stash first.

2. Record baseline commit:
   BASELINE_COMMIT=$(git rev-parse HEAD)
   -> This is your safety net for ALL experiments.

3. Run baseline tests:
   {test_command}
   -> Capture: pass_count, fail_count, duration
   -> If tests already fail: WARN user, record as baseline (experiments must not make it WORSE)

4. Run baseline benchmark (if configured):
   {benchmark_command}
   -> Capture: relevant metrics
   -> If no benchmark configured: skip, use test-only scoring

Detect test runner from (in priority order):
  1. Experiment config test_command (if explicit)
  2. openspec/config.yaml -> rules.apply.test_command
  3. package.json -> scripts.test (npm test / pnpm test)
  4. pyproject.toml / pytest.ini -> pytest
  5. Makefile -> make test
  6. Fallback: ABORT -- cannot experiment without measurable tests

FOR iteration IN 1..max_iterations:

  3a. HYPOTHESIZE
  ├── Review: specs, design, current code state, previous experiment results
  ├── Formulate hypothesis:
  │   "Changing {what} in {where} will improve {metric} because {why}"
  ├── Hypothesis MUST be specific and testable
  ├── Do NOT repeat a hypothesis that was already tried and reverted
  └── If no more meaningful hypotheses: STOP early, go to Step 4

  3b. SNAPSHOT
  ├── Record current state: git diff --stat (log what exists now)
  └── All safety is via BASELINE_COMMIT -- no stash needed

  3c. TRY
  ├── Implement the hypothesized change
  ├── Keep changes minimal and focused on the hypothesis
  └── Do NOT make unrelated changes

  3d. MEASURE
  ├── Run test command: {test_command}
  │   -> Capture: pass_count, fail_count, duration
  ├── Run benchmark (if configured): {benchmark_command}
  │   -> Capture: metrics
  └── Compute confidence score (see Scoring section)

  3e. DECIDE
  ├── IF confidence >= threshold:
  │   ├── KEEP the change
  │   ├── git add -A (stage changes as new baseline for next experiment)
  │   ├── Record: {iteration, hypothesis, KEPT, confidence, measurements}
  │   └── Update baseline metrics for next comparison
  │
  └── IF confidence < threshold:
      ├── REVERT: git checkout -- . && git clean -fd
      ├── Record: {iteration, hypothesis, REVERTED, confidence, measurements}
      └── Baseline metrics unchanged

  3f. LOG
  └── Append experiment result to running log

1. Generate structured experiment report (see Report Format below)
2. Persist report to engram (or filesystem per mode)
3. Return summary to orchestrator

Score range: 0.0 to 1.0

IF any previously-passing test now fails:
  confidence = 0.0  (HARD REJECT -- no test regressions allowed)

ELIF all tests pass:
  IF benchmark configured AND benchmark improved (> 5% gain):
    confidence = 1.0
  ELIF benchmark configured AND benchmark same (within 5%):
    confidence = 0.8
  ELIF benchmark configured AND benchmark regressed:
    confidence = 0.3
  ELIF no benchmark configured:
    IF new tests added AND all pass:
      confidence = 0.85
    ELSE:
      confidence = 0.8

ELIF test count increased (new tests) AND all original tests pass:
  confidence = 0.7

Sdd Experiment

Purpose

What You Receive

Execution and Persistence Contract

Sdd Experiment

Purpose

What You Receive

Execution and Persistence Contract

Experiment Loop Protocol

Step 1: Pre-Flight Checks

Step 2: Detect Test Runner

Step 3: Experiment Loop

Step 4: Post-Loop Report

Confidence Scoring

Configuration

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio