Use this skill when evaluating if an AI judge exhibits stable, objective, and logically sound reasoning, particularly against situational and statistical biases. Trigger it when requests mention 'test if the judge is tricked by a sob story', 'check if the evaluator is swayed by flattery', 'verify if the model is too sure too early based on vague clues', 'check if confidence increases just because the chat got longer', or 'test if the judge is just copying the reference answer'. It is designed for meta-evaluation, detecting susceptibility to rhetorical persuasion, 'Length Artifacts' (monotonicity bias), Criteria Entanglement (halo effect), and 'Solution Fixation' induced by references.
[Case 1]
[Case 2]
[Case 3]
To synthesize data for this capability, you must strictly follow a 3-phase pipeline. Do not hallucinate steps. Read the corresponding reference file for each phase sequentially:
Phase 1: Environment Exploration
Read the exploration guidelines to discover raw knowledge seeds:
references/EXPLORATION.md
Phase 2: Trajectory Selection
Once Phase 1 is complete, read the selection criteria to evaluate the trajectory:
references/SELECTION.md
Phase 3: Data Synthesis
Once a trajectory passes Phase 2, read the synthesis instructions to generate the final data:
references/SYNTHESIS.md