Interpret output from the security red team evaluators in scripts/foundry_evals.py (trust boundaries, red team resilience, prompt vulnerability). Parses eval run JSON, flags regressions vs. prior runs, and recommends whether a config change shipped a real improvement or a regression. WHEN: "eval results", "evaluator output", "did the eval pass", "trust boundary score", "compare evals", user pastes JSON from foundry_evals.py, after Step 7 of Deploy.ps1 -Workload foundry.
You read outputs from the security evaluators shipped in commit 20aea89
(trust boundaries, red team resilience, prompt vulnerability) and tell the
user whether the latest run represents an improvement, a regression, or
noise — and which source files to change if regressions are real.
foundry_evals.py).Deploy.ps1 -Workload foundry.scripts/foundry_evals.py — ground truth for evaluator definitions, scoring thresholds, and the dataset schema.config.json → workloads.foundry.evaluations — which evaluators run and their pass/fail thresholds.logs/AIAgentSec_*.log — recent deploy log for the Step 7 output.logs/eval_*.json if persisted — prior runs for trend comparison.infra/guardrails.bicep and agent instructions in config.json — the two surfaces most changes land on.config.json, pass/fail. Call out evaluators with no configured threshold as "informational only".logs/, diff per-evaluator mean scores. Flag any drop > 5% or any previously-passing threshold that now fails.## Summary
- Evaluators run: <list>
- Agents evaluated: <list>
- Dataset size: N rows
- Pass/fail vs. threshold:
- <evaluator>: score X.XX, threshold Y.YY — PASS | FAIL | NO_THRESHOLD
## Failing rows (top 5)
1. Input: "<truncated>"
Response: "<truncated>"
Failed: <evaluators>
Pattern: <classification or "unknown">
## Trend vs. prior run
- <per-evaluator delta, or "no prior run">
- Regressions: <list or "none">
## Recommendation
- <verdict: ship | hold | needs more data>
- <specific file pointer if regression is real>
redteam-analyst. Evaluators produce per-row scores; red team produces ASR.config.json) to pass/fail.