Name: Purpose
Author: merceralex397-collab

Make LLMs reliably refuse harmful requests while remaining maximally helpful on benign ones. This covers the full safety alignment pipeline: red teaming to discover vulnerabilities, generating safety training data, training with safety-specific reward signals, and evaluating robustness across harm categories and multi-turn conversations.

When to use this skill

Use this skill when:

conducting red teaming to discover jailbreaks and adversarial prompt categories
generating safety preference pairs (harmful-request → refusal vs harmful-request → compliance)
implementing Constitutional AI (principle-based self-critique and revision)
training or fine-tuning with safety-specific RLHF reward signals
evaluating safety with benchmarks: ToxiGen (toxicity), BBQ (bias), TruthfulQA (factuality)
analyzing the helpfulness-vs-harmlessness tradeoff and over-refusal rates

Do not use this skill when

the task is general RLHF/DPO without a safety focus — prefer preference-optimization
the task is training the reward model itself — prefer

When to use this skill

Use this skill when:

conducting red teaming to discover jailbreaks and adversarial prompt categories
generating safety preference pairs (harmful-request → refusal vs harmful-request → compliance)
implementing Constitutional AI (principle-based self-critique and revision)
training or fine-tuning with safety-specific RLHF reward signals
evaluating safety with benchmarks: ToxiGen (toxicity), BBQ (bias), TruthfulQA (factuality)
analyzing the helpfulness-vs-harmlessness tradeoff and over-refusal rates

Do not use this skill when

the task is general RLHF/DPO without a safety focus — prefer preference-optimization
the task is training the reward model itself — prefer

Purpose

When to use this skill

Do not use this skill when

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Failure handling

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Purpose

When to use this skill

Do not use this skill when

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Related skills

Failure handling

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns