Implements red teaming, refusal training, Constitutional AI, and safety RLHF to align LLMs against harmful outputs while preserving helpfulness. Use when designing safety data pipelines, evaluating jailbreak robustness with benchmarks like ToxiGen/BBQ/TruthfulQA, or balancing over-refusal vs harmlessness. Do not use for general RLHF preference optimization unrelated to safety.
Make LLMs reliably refuse harmful requests while remaining maximally helpful on benign ones. This covers the full safety alignment pipeline: red teaming to discover vulnerabilities, generating safety training data, training with safety-specific reward signals, and evaluating robustness across harm categories and multi-turn conversations.
Use this skill when:
preference-optimizationreward-modelingr_total = α * r_helpfulness + (1-α) * r_safetyHarm Taxonomy — categorized risk areas with severity levels and expected model behaviorsRed Team Report — attack categories tested, success rates, and example jailbreaks discoveredSafety Evaluation — ToxiGen/BBQ/TruthfulQA scores, over-refusal rate, multi-turn robustness resultsTraining Data Summary — number of safety pairs, category distribution, Constitutional AI revision statisticsRead these only when relevant:
reward-modeling — building the safety reward modeleval-dataset-design — designing safety evaluation benchmarkspreference-optimization — PPO/DPO training loop that consumes safety datasynthetic-data-generation — generating adversarial and safety training data at scale