Name: Objective
Author: thepriceisright

Search skills.../

Objective | Skills Pool

What kind of objective are you trying to achieve?

A. Performance - Reduce latency, improve throughput, decrease resource usage
B. Accuracy - Improve model accuracy, reduce errors, increase precision
C. Coverage - Increase test coverage, documentation coverage, feature coverage
D. Quality - Reduce bugs, improve code quality metrics, decrease technical debt
E. Other: [please specify your optimization goal]

What component or system are you optimizing?

A. A specific endpoint or API
B. A machine learning model
C. The test suite
D. A specific module or subsystem
E. The entire application
F. Other: [please specify]

What is your current baseline? (This helps set realistic targets)

Please provide:
- Current metric value (e.g., "78% accuracy", "450ms latency")
- How it's measured (e.g., "pytest --cov", "benchmark script")
- Any known issues (e.g., "fails on edge cases", "slow on large inputs")

What is your target? Choose one:

A. Specific target value: [e.g., "90% accuracy", "under 100ms"]
B. Percentage improvement: [e.g., "50% faster", "20% fewer errors"]
C. Best achievable within constraints (let the agent explore)
D. Match a benchmark: [e.g., "match competitor X", "industry standard"]

What constraints must be respected? Select all that apply:

A. Performance bounds (e.g., "must not exceed 200ms latency")
B. Resource limits (e.g., "model under 500MB", "no GPU required")
C. API compatibility (e.g., "cannot change public interfaces")
D. No external dependencies (e.g., "cannot add new libraries")
E. Budget limits (e.g., "no paid services", "under $X")
F. Other: [please specify]
G. No constraints

What is the performance bottleneck?

A. CPU-bound computation
B. I/O operations (disk, network)
C. Memory usage
D. External API calls
E. Unknown - needs profiling first

How do you measure performance?

A. I have an existing benchmark script
B. Use standard tools (e.g., pytest-benchmark, time command)
C. Create a benchmark for me
D. Measure via application logs/metrics

What type of accuracy problem?

A. Classification accuracy (correct predictions)
B. Regression accuracy (prediction error)
C. Detection accuracy (finding all relevant items)
D. Ranking accuracy (correct ordering)
E. Other: [please specify]

What evaluation data do you have?

A. Labeled validation/test set
B. Historical data with known outcomes
C. A/B test capability
D. Manual review process
E. Need to create evaluation data

What kind of coverage?

A. Code coverage (lines/branches tested)
B. Feature coverage (functionality tested)
C. Documentation coverage
D. API endpoint coverage
E. Other: [please specify]

Current coverage tooling:

A. pytest-cov / coverage.py
B. jest --coverage
C. Other tool: [specify]
D. No coverage tool set up

What quality metrics matter most?

A. Bug count / error rate
B. Linting violations
C. Type coverage
D. Cyclomatic complexity
E. Technical debt score
F. Other: [please specify]

## Proposed Verification

**Command:** `python scripts/benchmark_performance.py --endpoint [endpoint] --iterations 100`

This command will:
- Run [N] iterations of the target operation
- Measure: latency (p50, p95, p99), throughput (req/sec), memory usage
- Output JSON metrics for tracking

**Does this benchmark script exist?**
A. Yes, I have an existing benchmark at: [path]
B. No, please create a benchmark script for me (Recommended)
C. I'll use a standard tool (pytest-benchmark, hyperfine, wrk)
D. I need to modify an existing script: [path]

## Proposed Verification

**Command:** `python scripts/evaluate_model.py --dataset validation --output-json`

This command will:
- Load the model and validation dataset
- Run predictions on [N] samples
- Calculate: accuracy, precision, recall, F1-score
- Output JSON metrics for tracking

**Does this evaluation script exist?**
A. Yes, I have an existing evaluation at: [path]
B. No, please create an evaluation script for me (Recommended)
C. I'll use a standard tool (sklearn metrics, pytest with assertions)
D. I need to modify an existing script: [path]

## Proposed Verification

**Command:** `pytest --cov=[source_dir] --cov-report=json --cov-report=term`

This command will:
- Run the test suite with coverage tracking
- Measure: line coverage, branch coverage, missing lines
- Output coverage.json for metric extraction

**Is your coverage tooling configured?**
A. Yes, pytest-cov is set up and working
B. No, please help me set up coverage (Recommended)
C. I use a different coverage tool: [specify]
D. I need to configure coverage for specific modules

## Proposed Verification

**Command:** `./scripts/quality_metrics.sh`

This script will aggregate:
- Linting violations (ruff, eslint)
- Type coverage (pyright, tsc)
- Complexity metrics (radon, complexity-report)

**Do you have quality metric tooling?**
A. Yes, I have existing quality scripts at: [path]
B. No, please create a quality metrics script for me (Recommended)
C. I want to use specific tools: [list tools]
D. I only care about: [specific metric]

**Benchmark Creation Plan:**

I'll create `scripts/benchmark_[type].py` that:
1. [Specific action for this objective type]
2. Outputs JSON with metrics: [list metrics]
3. Can be run with: `python scripts/benchmark_[type].py`

This will be created as part of the first iteration.

Proceed with benchmark creation? (yes/adjust/no)

**Success Criteria:** `[metric] [operator] [value]`

Is this target realistic based on your current state?

A. Yes, this target is achievable
B. Too aggressive - adjust to: [suggest lower target]
C. Too conservative - adjust to: [suggest higher target]
D. I want a different metric entirely: [specify]

**Recommended Metrics to Track:**
1. `latency_p99` (primary) - 99th percentile response time
2. `latency_p50` - Median response time
3. `throughput` - Requests per second
4. `memory_mb` - Peak memory usage

Add, remove, or reorder? (List changes or 'OK')

**Recommended Metrics to Track:**
1. `accuracy` (primary) - Overall correct predictions
2. `precision` - True positives / predicted positives
3. `recall` - True positives / actual positives
4. `f1_score` - Harmonic mean of precision and recall

Add, remove, or reorder? (List changes or 'OK')

**Recommended Metrics to Track:**
1. `line_coverage` (primary) - Percentage of lines covered
2. `branch_coverage` - Percentage of branches covered
3. `missing_lines` - Count of uncovered lines
4. `files_covered` - Number of files with any coverage

Add, remove, or reorder? (List changes or 'OK')

**Recommended Metrics to Track:**
1. `linting_violations` (primary) - Total lint errors
2. `type_coverage` - Percentage of typed code
3. `cyclomatic_complexity` - Average function complexity
4. `tech_debt_hours` - Estimated remediation time

Add, remove, or reorder? (List changes or 'OK')

## Verification Proposal Summary

| Aspect | Proposed Value | Adjust? |
|--------|---------------|---------|
| Command | `[command]` | A. Keep / B. Change to: ___ |
| Create benchmark? | [Yes/No] | A. Keep / B. Change |
| Success criteria | `[expression]` | A. Keep / B. Change to: ___ |
| Metrics to track | [list] | A. Keep / B. Add: ___ / C. Remove: ___ |

Enter adjustments (e.g., "B. Change command to pytest") or 'OK' to proceed to Step 3.

Objective Type	Default maxIterations	Rationale
Performance (A)	12 iterations	Profiling + incremental optimizations, medium exploration
Accuracy (B)	15 iterations	ML tuning needs more exploration, hyperparameter space
Coverage (C)	10 iterations	Test writing is fairly predictable, lower variance
Quality (D)	10 iterations	Refactoring is incremental, bounded scope
Other/Complex	15 iterations	Unknown territory, allow more exploration

## Proposed Stopping Conditions

### 1. Maximum Iterations
**Proposed:** [N] iterations (based on [objective type] objectives)

This is your iteration budget. The loop will stop after this many attempts,
regardless of whether the objective is achieved.


### 2. Plateau Detection
Stops the loop when progress stagnates - no need to keep trying if we're stuck.

**Proposed settings:**
- **Metric to watch:** `[primary metric from success criteria]`
- **Minimum improvement:** 0.01 (1% change required between iterations)
- **Window size:** 3 iterations (check improvement over last 3 attempts)

**How it works:** If `[metric]` improves by less than 1% across 3 consecutive
iterations, the loop stops with a "plateau" status.


### 3. Consecutive Failures
Safety stop for experiments that are completely broken (not just not improving).

**Proposed:** 3 consecutive failures

A "failure" means the iteration couldn't produce valid metrics at all
(crashed, timed out, broke constraints).

## Stopping Conditions Summary

| Parameter | Value | Meaning |
|-----------|-------|---------|
| Max Iterations | [N] | Stop after [N] attempts maximum |
| Plateau Metric | `[metric]` | Watch this metric for stagnation |
| Min Improvement | [value] | Require at least [value*100]% improvement |
| Window Size | [N] | Check improvement over last [N] iterations |
| Max Failures | [N] | Stop after [N] consecutive broken iterations |

**Estimated behavior:**
- Best case: Objective achieved in ~[N/3] iterations
- Typical case: Plateau detected around iteration [N*0.6]
- Worst case: Loop runs all [N] iterations

Confirm these settings? (yes/adjust/explain more)

**What these mean (in plain language):**

📊 **Max Iterations** = Your budget
   Think of each iteration as one attempt to improve. This is the maximum
   number of attempts before giving up. More iterations = more chances to
   find improvements, but also more time/resources.

📈 **Plateau Detection** = "Are we stuck?"
   If the metric barely changes for several attempts in a row, we're probably
   stuck and should stop. The window size is how many attempts to look back,
   and min improvement is how much change counts as "real progress."

   Example: With windowSize=3 and minImprovement=0.01, if accuracy goes from
   0.85 → 0.851 → 0.852 over 3 iterations, that's a plateau (only 0.002 total
   improvement, less than 0.01).

❌ **Consecutive Failures** = "Is something broken?"
   Different from plateau. A failure means the experiment completely broke
   (couldn't even measure the metric). 3 failures in a row suggests
   something fundamental is wrong, not just slow progress.

## Complete Plan Summary

**Objective:** [description from Step 1]
**Target:** [success criteria from Step 2]
**Verification:** `[command from Step 2]`

**Constraints:**
- [constraint 1]
- [constraint 2]
- (or "None specified" if no constraints)

**Stopping Conditions:**
- Max iterations: [N]
- Plateau: [metric] improving < [threshold] over [window] iterations
- Max consecutive failures: [N]

**Metrics to track:** [list from Step 2]

---

Ready to create objective.json? (yes/no/adjust)

What would you like to adjust?

A. Objective description or constraints
B. Verification command or success criteria
C. Stopping conditions (iterations, plateau, failures)
D. Metrics to track

Enter your choice, or describe what you want to change:

Objective creation cancelled. No files were created.

You can restart with /objective when you're ready.

{
  "objective": {
    "description": "[from answers]",
    "context": "[current state and known issues]",
    "constraints": [
      "[constraint 1]",
      "[constraint 2]"
    ]
  },
  "verification": {
    "command": "[verification command]",
    "successCriteria": "[metric expression]",
    "metricsToTrack": ["metric1", "metric2", "metric3"]
  },
  "stopping": {
    "maxIterations": [N],
    "plateauThreshold": {
      "metric": "[primary metric]",
      "minImprovement": 0.01,
      "windowSize": 3
    },
    "maxConsecutiveFailures": 3
  },
  "status": {
    "state": "pending",
    "iterations": 0,
    "bestMetrics": {},
    "metricHistory": []
  }
}

✅ objective.json created successfully!

Next steps:
1. Review the generated objective.json
2. Ensure the verification command works: [command]
3. To start: ./angainor.sh --objective

The agent will iterate autonomously, trying different approaches until:
- ✓ Success criteria are met
- ⏸ Progress plateaus (no improvement in consecutive iterations)
- ⏱ Max iterations reached
- ✗ The objective is determined impossible

#!/usr/bin/env python3
"""
API Latency Benchmark - Measures endpoint performance.
Outputs JSON metrics for Angainor objective tracking.

Usage: python scripts/benchmark_latency.py --url URL [--iterations N]
"""

import argparse
import json
import statistics
import sys
import time
import urllib.request
import urllib.error


def measure_latency(url: str, iterations: int = 100) -> dict:
    """Measure latency statistics for a URL."""
    latencies = []
    errors = 0

    for i in range(iterations):
        start = time.perf_counter()
        try:
            with urllib.request.urlopen(url, timeout=10) as response:
                _ = response.read()
            latency_ms = (time.perf_counter() - start) * 1000
            latencies.append(latency_ms)
        except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e:
            errors += 1
            sys.stderr.write(f"Request {i+1} failed: {e}\n")

    if not latencies:
        return {"error": "All requests failed", "success_rate": 0}

    latencies.sort()
    n = len(latencies)

    return {
        "latency_p50": round(latencies[int(n * 0.50)], 2),
        "latency_p95": round(latencies[int(n * 0.95)], 2),
        "latency_p99": round(latencies[int(n * 0.99)], 2),
        "latency_avg": round(statistics.mean(latencies), 2),
        "throughput": round(n / (sum(latencies) / 1000), 2),
        "success_rate": round((iterations - errors) / iterations, 3),
        "iterations": iterations
    }


def main():
    parser = argparse.ArgumentParser(description="Measure API endpoint latency")
    parser.add_argument("--url", required=True, help="URL to benchmark")
    parser.add_argument("--iterations", type=int, default=100, help="Number of requests")
    args = parser.parse_args()

    metrics = measure_latency(args.url, args.iterations)

    if "error" in metrics:
        print(json.dumps(metrics), file=sys.stderr)
        sys.exit(1)

    print(json.dumps(metrics))


if __name__ == "__main__":
    main()

python scripts/benchmark_latency.py --url http://localhost:3000/api/endpoint --iterations 100

#!/usr/bin/env python3
"""
Model Accuracy Benchmark - Evaluates model predictions against ground truth.
Outputs JSON metrics for Angainor objective tracking.

Usage: python scripts/benchmark_accuracy.py --model MODEL_PATH --data DATA_PATH
"""

import argparse
import json
import sys
from pathlib import Path


def calculate_metrics(predictions: list, ground_truth: list) -> dict:
    """Calculate classification metrics."""
    if len(predictions) != len(ground_truth):
        raise ValueError("Predictions and ground truth must have same length")

    n = len(predictions)
    if n == 0:
        return {"error": "No samples to evaluate"}

    # Calculate basic metrics
    correct = sum(1 for p, g in zip(predictions, ground_truth) if p == g)
    accuracy = correct / n

    # For binary classification, calculate precision/recall
    # Assumes positive class is 1 or True
    tp = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
    fp = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 0)
    fn = sum(1 for p, g in zip(predictions, ground_truth) if p == 0 and g == 1)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    return {
        "accuracy": round(accuracy, 4),
        "precision": round(precision, 4),
        "recall": round(recall, 4),
        "f1_score": round(f1_score, 4),
        "samples": n,
        "correct": correct
    }


def load_data(path: str) -> tuple:
    """Load predictions and ground truth from JSON file."""
    data = json.loads(Path(path).read_text())
    return data.get("predictions", []), data.get("ground_truth", [])


def main():
    parser = argparse.ArgumentParser(description="Evaluate model accuracy")
    parser.add_argument("--data", required=True, help="Path to JSON with predictions and ground_truth")
    parser.add_argument("--predictions-key", default="predictions", help="Key for predictions in JSON")
    parser.add_argument("--ground-truth-key", default="ground_truth", help="Key for ground truth in JSON")
    args = parser.parse_args()

    try:
        data = json.loads(Path(args.data).read_text())
        predictions = data.get(args.predictions_key, [])
        ground_truth = data.get(args.ground_truth_key, [])

        metrics = calculate_metrics(predictions, ground_truth)

        if "error" in metrics:
            print(json.dumps(metrics), file=sys.stderr)
            sys.exit(1)

        print(json.dumps(metrics))

    except FileNotFoundError:
        print(json.dumps({"error": f"Data file not found: {args.data}"}), file=sys.stderr)
        sys.exit(1)
    except json.JSONDecodeError as e:
        print(json.dumps({"error": f"Invalid JSON: {e}"}), file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()

python scripts/benchmark_accuracy.py --data data/validation_results.json

#!/usr/bin/env python3
"""
Test Coverage Benchmark - Parses coverage.json and outputs metrics.
Outputs JSON metrics for Angainor objective tracking.

Usage: python scripts/benchmark_coverage.py [--coverage-file PATH]

Requires: Run pytest with coverage first:
  pytest --cov=src --cov-report=json
"""

import argparse
import json
import sys
from pathlib import Path


def parse_coverage(coverage_path: str = "coverage.json") -> dict:
    """Parse coverage.json and extract metrics."""
    path = Path(coverage_path)

    if not path.exists():
        return {"error": f"Coverage file not found: {coverage_path}. Run pytest --cov first."}

    data = json.loads(path.read_text())

    totals = data.get("totals", {})
    files = data.get("files", {})

    # Calculate metrics
    line_coverage = totals.get("percent_covered", 0)
    covered_lines = totals.get("covered_lines", 0)
    missing_lines = totals.get("missing_lines", 0)
    total_lines = covered_lines + missing_lines

    # Branch coverage (if available)
    branch_coverage = totals.get("percent_covered_branches", 0) if "covered_branches" in totals else None

    # Files with coverage
    files_with_coverage = sum(1 for f in files.values() if f.get("summary", {}).get("percent_covered", 0) > 0)
    total_files = len(files)

    metrics = {
        "line_coverage": round(line_coverage, 2),
        "covered_lines": covered_lines,
        "missing_lines": missing_lines,
        "total_lines": total_lines,
        "files_covered": files_with_coverage,
        "total_files": total_files
    }

    if branch_coverage is not None:
        metrics["branch_coverage"] = round(branch_coverage, 2)

    return metrics


def main():
    parser = argparse.ArgumentParser(description="Parse test coverage metrics")
    parser.add_argument("--coverage-file", default="coverage.json", help="Path to coverage.json")
    args = parser.parse_args()

    metrics = parse_coverage(args.coverage_file)

    if "error" in metrics:
        print(json.dumps(metrics), file=sys.stderr)
        sys.exit(1)

    print(json.dumps(metrics))


if __name__ == "__main__":
    main()

pytest --cov=src --cov-report=json && python scripts/benchmark_coverage.py

Determine the benchmark type based on objective type from Step 1:
- Performance (A) → scripts/benchmark_latency.py
- Accuracy (B) → scripts/benchmark_accuracy.py
- Coverage (C) → scripts/benchmark_coverage.py
- Quality (D) → Use existing tools (ruff, pylint), no script needed
Use the template above as a starting point, customizing:
- Command-line arguments for the specific use case
- Metric names to match metricsToTrack in objective.json
- Input data paths or URLs for the specific project

Create the script in scripts/ directory:

Write the script to: scripts/benchmark_[type].py
Make it executable: chmod +x scripts/benchmark_[type].py

Verify it works before finalizing objective.json:

Run: python scripts/benchmark_[type].py --help
Test: python scripts/benchmark_[type].py [args]

Update objective.json with the correct verification command.

✅ Benchmark script created: scripts/benchmark_[type].py

The script outputs JSON metrics including: [list metrics]

Test the script:
  python scripts/benchmark_[type].py --help
  python scripts/benchmark_[type].py [example args]

Objective

Objective Mode Setup

The Job

Step 1: Clarifying Questions

Question 1: Objective Type

Objective

Objective Mode Setup

The Job

Step 1: Clarifying Questions

Question 1: Objective Type

Question 2: Target Component

Question 3: Current State

Question 4: Target Value

Question 5: Constraints

Adapting Questions by Objective Type

For Performance Objectives (A)

For Accuracy Objectives (B)

For Coverage Objectives (C)

For Quality Objectives (D)

Step 2: Verification Approach

Type-Specific Verification Commands

For Performance Objectives (Type A)

For Accuracy Objectives (Type B)

For Coverage Objectives (Type C)

For Quality Objectives (Type D)

Offer to Create Benchmark Scripts

Propose Success Criteria

Propose Metrics to Track

Performance Metrics (Type A)

Accuracy Metrics (Type B)

Coverage Metrics (Type C)

Quality Metrics (Type D)

User Adjustment Summary

Step 3: Stopping Conditions

Type-Specific maxIterations Defaults

Proposing Stopping Conditions

Stopping Conditions Summary Table

Explaining Parameters in Plain Language

Handling User Adjustments

Step 4: Generate objective.json

Present the Complete Plan Summary

Handling User Responses

Write objective.json After Approval

After Approval

Benchmark Script Creation

API Latency Benchmark Template (Performance)

Model Accuracy Benchmark Template (Accuracy)

Test Coverage Benchmark Template (Coverage)

Benchmark Creation Workflow

Checklist

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2