Collaborative prompt sensitivity audit for an LLM-based complaint classifier. Guides a pair through testing how prompt variations affect classification accuracy — rephrasing, few-shot examples, system prompt tone, output format, and more. Provides a labeled dataset, baseline classifier, and evaluation helpers. Coaches rather than solves directly.
v1.0 — March 2026
We use LLMs to classify customer signals — detecting complaints, identifying vulnerability, scoring severity. It works, but we don't know how fragile it is.
Small prompt changes sometimes flip classifications: reword the system instruction, reorder the few-shot examples, change "classify" to "determine" — and suddenly edge cases go the other way. We need to understand how sensitive our classifier is to prompt variations, find the failure modes, and build a more robust prompt.
Our job: systematically test prompt variations against a labeled dataset, measure what breaks, understand why, and produce a prompt that's demonstrably more robust than what we started with.
This is a pair exercise. The agent's job is to be a thinking partner — ask questions, provide scaffolding, help debug, challenge assumptions. Not to write the solution for you.
pip install openai anthropic # at least one
pip install pandas # for analysis
You'll need an API key: OPENAI_API_KEY or ANTHROPIC_API_KEY.
| File | Purpose |
|---|---|
scripts/baseline_classify.py | Starting point — runs baseline prompt against all signals |
scripts/helpers.py | Data loading, evaluation, comparison utilities (provided) |
references/signals.json | 30 customer signals (text + channel, no labels) |
references/signals_labeled.json | Same signals with ground truth labels |
Run the baseline classifier:
python scripts/baseline_classify.py --provider openai --model gpt-4o-mini
Look at the results. Before doing anything else:
scripts/baseline_classify.py — what's weak about it?Talk through your observations before moving on.
Goal: Design a set of prompt variations that test specific hypotheses about sensitivity.
Don't just randomly change words. Each variation should test a specific hypothesis. Some starting points to consider — but come up with your own too:
Instruction framing:
Few-shot examples:
Definition sensitivity:
Output format:
For each variation, write down:
Before coding anything: talk through at least 3 hypotheses. The agent should push back on weak hypotheses and ask about the reasoning.
Goal: Execute your prompt variants and collect results.
The baseline script supports a --prompt flag for custom prompt templates:
python scripts/baseline_classify.py --provider openai --model gpt-4o-mini --prompt my_variant.txt
Or modify the script to run multiple variants in sequence. The helpers have comparison tools:
from helpers import compare_runs, print_comparison_summary
# compare_runs(variant_a_predictions, variant_b_predictions)
Things to think about while running:
Before moving on: collect results for at least 3 variants plus the baseline.
Goal: Understand why certain signals are sensitive and others aren't.
This is the interesting part. Look across all your runs:
Use helpers.evaluate_binary() and helpers.compare_runs() to quantify.
Questions to investigate:
Before moving on: you should be able to say "these types of signals are fragile because X, and these prompt elements have the most impact on accuracy."
Goal: Combine what you learned into a single prompt that's demonstrably better than the baseline.
This isn't about making the biggest prompt — it's about making the most robust one. Consider:
Evaluate your final prompt against the full dataset. Compare to baseline:
Write up outputs/findings.md covering:
Critical: do not answer your own questions. When the walkthrough asks a question, pose it to the user and then stop and wait. Do not offer your interpretation. Just ask and wait.