Create context-limited versions of evaluations. Use when you need to simulate lower context windows or compare model performance at different context limits. Keywords - context, clip, truncate, window, limit, simulate, 8k, 12k, 16k.
You are helping the user create context-limited versions of existing evaluations using the cohort.py tool.
When the user invokes /context-clip <cohort-or-pattern> <eval-pattern> <contexts>, follow these steps:
Important: Use the AskUserQuestion tool for all user confirmations to maintain smooth flow without breaking execution.
Extract from the command:
Qwen3-Next-80B, data/m12x/Qwen*, or empty for all)Input flexibility:
data/m12x/Discover what evals exist using cohort.py list:
# If cohort specified
python cohort.py list data/m12x/<CohortName>
# If glob pattern
python cohort.py list 'data/m12x/<Pattern>*/'
# If no cohort specified, list all
python cohort.py list
Parse the markdown table output to extract cohort, eval_id, label, and groups for each eval.
Filter the evals list using the user's pattern:
Use AskUserQuestion:
Question format:
Which eval should be context-clipped?
Options:
[cohort] [eval_id] Label
Convert user's context specifications to token counts:
If user specified multiple contexts (e.g., "12k and 8k"), process all of them.
Use AskUserQuestion to confirm:
Create context-limited versions at: 12288, 8192 tokens?
Options:
- Yes, proceed
- Modify context values
If "Modify context values", ask again with editable input.
Before processing, check if context-limited versions already exist:
python cohort.py list <cohort-path> | grep "ctx"
If found, use AskUserQuestion:
Existing context-limited versions found:
- [eval_id] Label (8k ctx)
- [eval_id] Label (12k ctx)
Options:
- Skip existing, create missing only
- Recreate all (will skip if folders exist)
- Cancel
For each target context, run:
source venv/bin/activate
python cohort.py context <cohort-path> \
--eval-id <eval_id> \
--context <context_tokens>
Show progress for each context:
Processing 12288 token limit...
✓ Created 3 result folders, clipped X samples
✓ Updated evals.json
Processing 8192 token limit...
✓ Created 3 result folders, clipped Y samples
✓ Updated evals.json
After processing, list the cohort again to show the new evals:
python cohort.py list <cohort-path>
Show the user:
Inform the user:
Context-limited evaluations created successfully!
To analyze, you'll need to:
1. Add the cohort to a dataset config (data/*.json)
2. Rebuild the dataset database
3. Run analysis commands
Example workflow:
# If cohort already in a dataset, just rebuild:
python evaluate.py --dataset data/<dataset>.json
# Then compare performance across contexts:
python analyze.py cluster data/<dataset>.json \
--filters '{"eval_id": ["<original>", "<ctx12k>", "<ctx8k>"]}' \
--stack sampler
# Or examine truncation patterns:
python analyze.py surface data/<dataset>.json \
--filters '{"eval_id": ["<ctx8k>"], "base_task": "arithmetic"}' \
--output-dir research/<project>/
User: /context-clip Qwen3-Next-80B "thinking 16k fp16" 12k
1. Cohort: data/m12x/Qwen3-Next-80B
2. List evals: python cohort.py list data/m12x/Qwen3-Next-80B
3. Found 1 match:
[bc8eef] Qwen3-Next-80B Thinking (FP16, 16k)
4. Confirm: Yes
5. Target contexts: 12288 tokens
6. Execute:
python cohort.py context data/m12x/Qwen3-Next-80B --eval-id bc8eef --context 12288
7. ✓ Created 3 result folders, clipped 335+835+1423 samples
8. ✓ New eval_id: 221ec5
User: /context-clip Qwen3-Next-80B fp16 "12k and 8k"
1. Cohort: data/m12x/Qwen3-Next-80B
2. List evals: python cohort.py list data/m12x/Qwen3-Next-80B
3. Found 2 matches:
[bc8eef] Qwen3-Next-80B Thinking (FP16, 16k)
[5a2b3c] Qwen3-Next-80B Instruct (FP16, 16k)
4. Ask user to pick → [bc8eef]
5. Target contexts: 12288, 8192 tokens
6. Confirm: Yes
7. Execute:
12k: ✓ Created [221ec5] (335+835+1423 samples clipped)
8k: ✓ Created [8fc62e] (1314+3717+5998 samples clipped)
8. Show both new eval_ids and suggest next steps
User: /context-clip "Qwen*" fp16 12k
1. List all: python cohort.py list 'data/m12x/Qwen*/'
2. Found evals in multiple cohorts:
- Qwen3-32B
- Qwen3-Next-80B
- Qwen3-14B
3. Ask user which cohort → Qwen3-32B
4. List evals in that cohort... found 3 fp16 variants
5. Ask user to be more specific
6. User refines: "qwen3-32b fp16 thinking"
7. Found 1 match → confirm → execute
Common context limits and use cases:
| Context | Tokens | Use Case |
|---|---|---|
| 2k | 2048 | Extreme resource constraints |
| 4k | 4096 | Mobile/edge deployment |
| 8k | 8192 | Standard constrained deployment |
| 12k | 12288 | Moderate context tasks |
| 16k | 16384 | Extended reasoning tasks |
Guidelines:
source venv/bin/activate