Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.
You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:
TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
Fine-tune language models on instruction-following or conversational datasets.
Full training:
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
Train with LoRA adapters:
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
Align models using preference data (chosen/rejected pairs).
Full training:
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-7 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns
Train with LoRA adapters:
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-6 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns \
--use_peft \
--lora_r 32 \
--lora_alpha 16
Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.
Basic usage:
trl grpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/gsm8k \
--reward_funcs accuracy_reward \
--output_dir Qwen2-0.5B-GRPO \
--push_to_hub
Online RL training where the model generates text and receives rewards based on custom criteria.
Basic usage:
trl rloo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/tldr \
--reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
--output_dir Qwen2-0.5B-RLOO \
--push_to_hub
Train a reward model to score text quality for RLHF.
Full training:
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-5 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048
Train with LoRA adapters:
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-4 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_task_type SEQ_CLS \
--lora_r 32 \
--lora_alpha 16
TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.
Example config (sft_config.yaml):
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio
Launch with config:
trl sft --config sft_config.yaml
Override config values:
trl sft --config sft_config.yaml --learning_rate 1.0e-5
TRL integrates with Accelerate for multi-GPU and multi-node training.
Multi-GPU training:
trl sft \
--config sft_config.yaml \
--num_processes 4
Use predefined Accelerate configs:
TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3
trl sft \
--config sft_config.yaml \
--accelerate_config zero2
Custom Accelerate config:
# Generate custom config
accelerate config
# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
Fully Sharded Data Parallel (FSDP):
trl sft --config sft_config.yaml --accelerate_config fsdp2
DeepSpeed ZeRO:
trl sft --config sft_config.yaml --accelerate_config zero3
--per_device_train_batch_size and increase --gradient_accumulation_steps--use_peft for LoRA training--gradient_checkpointing to save memory--dataset_config for multi-config datasetsfrom datasets import load_dataset; ds = load_dataset(name)hf auth login--packing for short sequences--per_device_train_batch_size if memory allows--tf32 for faster computation on Ampere GPUs--bf16 on supported hardware--num_processes--temperature and --top_p for generation--use_peft for faster training and lower memory--report_to trackio (or --report_to wandb or --report_to tensorboard) for tracking--output_dirWhen helping users with TRL: