Frequently Asked Questions (FAQ)¶
Common questions and answers about LightRFT.
General Questions¶
Q: What is LightRFT?¶
A: LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning framework for fine-tuning Large Language Models (LLMs) and Vision-Language Models (VLMs). It provides efficient and scalable RLHF training with support for multiple algorithms and distributed training strategies.
Q: What are the main differences between LightRFT and OpenRLHF?¶
A: LightRFT extends OpenRLHF with:
Enhanced multimodal (VLM) support
More RL algorithms (GRPO, GSPO, GMPO, REINFORCE++, CPGD, etc.)
Better memory optimization (engine sleep, optimizer offload)
Improved inference engines (vLLM, SGLang with FP8)
Reward model co-location for efficiency
More flexible distributed training strategies
Q: Which models are supported?¶
A: LightRFT supports:
LLMs: Qwen, Qwen2.5, LLaMA, Mistral, and most HuggingFace models
VLMs: Qwen-VL, Qwen2-VL, LLaVA
Custom: Easy to add new models via monkey patching
Q: What hardware is required?¶
A: Minimum requirements:
GPU: NVIDIA GPUs with CUDA 11.8+
Memory: 40GB+ VRAM per GPU recommended (24GB possible with optimizations)
PyTorch: 2.5.1+
Python: 3.8+
For production: 8× A100/H100 80GB recommended
Installation Questions¶
Q: How do I install LightRFT?¶
A: Simple installation:
git clone https://github.com/opendilab/LightRFT.git
cd LightRFT
pip install -r requirements.txt && pip install -e .
Q: Do I need to install vLLM separately?¶
A: No, vLLM is included in the requirements. However, for the latest features, you can install from source.
Training Questions¶
Q: What’s the difference between FSDP and DeepSpeed?¶
A:
FSDP: PyTorch-native, better integration, supports CPU offload
DeepSpeed: More mature, ZeRO-3 optimization, generally faster
Use FSDP for maximum memory efficiency, DeepSpeed for speed.
Q: How do I choose batch sizes?¶
A: Follow this constraint:
train_batch_size >= rollout_batch_size × n_samples_per_prompt
Example for 8 GPUs:
train_batch_size=256rollout_batch_size=64n_samples_per_prompt=8micro_train_batch_size=1micro_rollout_batch_size=2
Q: Which algorithm should I use?¶
A: By task:
Math/Coding: GRPO, Dr.GRPO
Instruction Following: CPGD, GSPO
Open-ended: FIRE Sampling
Low Memory: GRPO (no critic)
Research: GMPO, REINFORCE++
Q: How many samples per prompt should I use?¶
A: Typical values:
4-8: Standard, good balance
16+: Better quality, slower training
32+: Best-of-N scenarios
More samples = better advantage estimation but slower.
Q: Can I use multiple reward models?¶
A: Yes! LightRFT supports:
Multiple reward models in parallel
Reward model co-location (same GPU as training)
Remote reward model servers
Weighted reward combination
Q: How do I enable multimodal (VLM) training?¶
A: Use the VLM training script:
python train_vl.py \
--pretrain /path/to/Qwen2-VL \
--mixed_mm_data \
--packing_samples
Performance Questions¶
Q: How do I reduce memory usage?¶
A: Use these techniques:
Enable gradient checkpointing:
--gradient_checkpointingUse FSDP with CPU offload:
--fsdp --fsdp_cpu_offloadLower engine memory:
--engine_mem_util 0.4Use ZeRO-3:
--zero_stage 3Reduce batch sizes
Enable engine sleep:
--enable_engine_sleep
Q: How do I speed up training?¶
A:
Increase batch sizes (if memory allows)
Use FP8 inference (vLLM)
Enable Flash Attention:
--flash_attnReduce
n_samples_per_promptif possibleUse tensor parallelism for inference:
--engine_tp_size 2Optimize NCCL:
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
Q: What’s the typical training speed?¶
A: On 8× A100 80GB:
7B model: ~1000 samples/min
13B model: ~500 samples/min
34B model: ~200 samples/min
70B model: ~50 samples/min
With FSDP and optimizations.
Q: How do I use multiple nodes?¶
A: Use SLURM or Ray:
# SLURM example
srun -N2 --gres=gpu:8 --ntasks-per-node=8 bash train.sh
# Or use torchrun
torchrun --nproc_per_node=8 \
--nnodes=2 \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train.py
Algorithm Questions¶
Q: What’s the difference between GRPO and PPO?¶
A:
GRPO: Group-normalized advantages, no critic network
PPO: Uses separate value network (critic)
GRPO is simpler and more memory-efficient.
Q: When should I use CPGD?¶
A: Use CPGD when:
Fine-tuning pre-trained models
Want to preserve base capabilities
Need controlled policy updates
Preventing catastrophic forgetting
Q: What is Clip Higher?¶
A: An improved clipping scheme with separate upper/lower bounds for positive/negative advantages. Better for:
Noisy rewards
Large distribution shifts
Unstable training
Debugging Questions¶
Q: Training crashes with OOM error¶
A: See the Troubleshooting Guide
Q: num_rollouts_per_episodes = 0 error¶
A: Your train_batch_size is too small. Ensure:
train_batch_size >= rollout_batch_size × n_samples_per_prompt
Q: Model not improving / Reward not increasing¶
A: Check:
Learning rate too high/low
KL penalty too large
Reward model quality
Enable reward normalization:
--reward_running_normTry different advantage estimator
Q: NCCL timeout or hanging¶
A:
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
# Increase timeout
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
Q: vLLM engine initialization fails¶
A:
Check GPU memory:
--engine_mem_util 0.5Reduce TP size:
--engine_tp_size 1Check CUDA compatibility
Update vLLM:
pip install -U vllm
Evaluation Questions¶
Q: How do I evaluate on benchmarks?¶
A: For math benchmarks, use the evaluation scripts in the examples directory:
# Refer to the examples/gsm8k_geo3k directory for evaluation scripts
# See the example training scripts for evaluation configurations
Q: Can I save generation trajectories?¶
A: Yes, use the trajectory saver:
from lightrft.utils import TrajectorySaver
saver = TrajectorySaver(output_dir="./trajectories")
# Automatically saves prompts, responses, rewards
Q: How do I integrate with W&B?¶
A:
python train.py \
--use_wandb your-project \
--wandb_org your-org \
--wandb_run_name experiment-1
Advanced Questions¶
Q: Can I implement custom algorithms?¶
A: Yes! Extend the trainer class:
from lightrft.trainer import SPMDPPOTrainer
class CustomTrainer(SPMDPPOTrainer):
def compute_advantages(self, ...):
# Your custom advantage computation
pass
Q: How do I add a new model architecture?¶
A: Create a monkey patch in lightrft/models/monkey_patch/:
# your_model.py
def patch_your_model(model):
# Add custom forward methods
pass
# In apply.py
from .your_model import patch_your_model
Q: Can I use custom reward functions?¶
A: Yes, pass a callable:
def custom_reward_fn(responses, labels):
# Your reward computation
return rewards
trainer = SPMDPPOTrainer(
...,
reward_fn=custom_reward_fn
)
Q: How do I checkpoint during training?¶
A: Checkpoints are automatic:
--save_path ./checkpoints \
--save_interval 1 \
--max_ckpt_num 3
Resume with:
--load_checkpoint \
--ckpt_path ./checkpoints/episode_5
Contributing Questions¶
Q: How can I contribute to LightRFT?¶
A:
Fork the repository
Create a feature branch
Implement your changes
Add tests
Submit a pull request
See Contributing Guide for details.
Q: How do I report bugs?¶
A: Open an issue on GitHub Issues with:
Environment details (GPU, CUDA, PyTorch versions)
Full error traceback
Minimal reproduction script
Expected vs actual behavior
Q: Where can I get help?¶
A:
GitHub Issues for bugs
Discussions for questions
Documentation for guides
Examples directory for code samples