Frequently Asked Questions (FAQ)¶
Common questions and answers about LightRFT.
General Questions¶
Q: What is LightRFT?¶
A: LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning framework designed for the reinforcement fine-tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs). It supports multiple models, algorithms, distributed training strategies, and inference engines, providing efficient and scalable RLHF and RLVR training capabilities.
Q: What are the main differences between LightRFT and OpenRLHF?¶
A: LightRFT extends OpenRLHF with:
Enhanced multimodal (VLM) support
More RL algorithms (GRPO, GSPO, GMPO, REINFORCE++, CPGD, etc.)
Comprehensive Reward Model support, including Scalar Reward Models (SRM) and Generative Reward Models (GRM)
Better memory optimization (engine sleep, optimizer offload)
Improved inference engines (vLLM, SGLang)
Reward model co-location for efficiency
More flexible distributed training strategies, supporting FSDP and DeepSpeed ZeRO
Q: Which models are supported?¶
A: LightRFT supports:
LLM: Qwen, Qwen2.5 and most HuggingFace models
VLM: Qwen-VL, Qwen2-VL
Audio: Qwen2-Audio
Custom: Easily inherit and extend existing model architectures
Q: What hardware is required?¶
A: Minimum requirements:
GPU: NVIDIA GPUs with CUDA 12.8+
Memory: 40GB+ VRAM per GPU recommended (24GB possible with optimizations)
PyTorch: 2.9.1+
Python: 3.12+
For production: 8× A100/H100 80GB recommended
Installation Questions¶
Q: How do I install LightRFT?¶
A: Standard installation (includes SGLang + Flash-Attention):
git clone https://github.com/opendilab/LightRFT.git
cd LightRFT
pip install -e .
Q: How do I install vLLM?¶
A: vLLM is optional. Install it with:
# Option 1: Install as optional dependency
pip install ".[vllm]"
# Option 2: Install vLLM directly
pip install vllm>=0.13.3
Note: SGLang is the default inference backend and is already included in the standard installation.
Q: What if Flash-Attention installation fails?¶
A: Try these solutions:
Option 1: Use pre-compiled wheel (Recommended)
# Download from https://github.com/Dao-AILab/flash-attention/releases
# For CUDA 12.x with PyTorch 2.9 and Python 3.12:
pip install flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
Option 2: Use Docker (Easiest)
docker pull opendilab/lightrft:v0.1.0
Training Questions¶
Q: What’s the difference between FSDP and DeepSpeed?¶
A: Both implement Fully Sharded Data Parallelism (ZeRO-3/FSDP), but they differ in design philosophy:
FSDP (PyTorch Native):
Deep Integration: Seamlessly works with PyTorch ecosystem including Autograd and
torch.compile.High Flexibility: Offers programmatic control over sharding units via
auto_wrap_policy.Composability: Easier to combine with other native features like Tensor Parallelism.
DeepSpeed (Microsoft):
All-in-One Toolkit: Provides built-in CPU/NVMe offloading (ZeRO-Infinity) and high-performance optimizers.
Declarative Config: Simple setup via JSON configuration files, abstracting away complexity.
Custom Kernels: Contains many manual CUDA optimizations for peak performance in specific setups.
Recommendation: Use FSDP for native experience, complex model customization, or with torch.compile. Use DeepSpeed for ease of use or extreme model sizes requiring NVMe offloading.
Q: Which algorithm should I use?¶
A: By task:
Math/Coding: GRPO, Dr.GRPO
Instruction Following: CPGD, GSPO
Open-ended: FIRE Sampling
Low Memory: GRPO (no critic)
Research: GMPO, REINFORCE++
Q: How many samples per prompt should I use?¶
A: Typical values:
4-8: Standard, good balance
16+: Better quality, slower training
32+: Best-of-N scenarios
More samples = better advantage estimation but slower.
Q: Can I use multiple reward models?¶
A: Yes! LightRFT supports:
Multiple reward models in parallel
Reward model co-location (same GPU as training)
Remote reward model servers
Weighted reward combination
Performance Questions¶
Q: How do I reduce memory usage?¶
A: Use these techniques:
Enable gradient checkpointing:
--gradient_checkpointingUse FSDP with CPU offload:
--fsdp --fsdp_cpu_offloadLower engine memory:
--engine_mem_util 0.4Use ZeRO-3:
--zero_stage 3Reduce batch sizes
Enable engine sleep:
--enable_engine_sleep
Q: How do I speed up training?¶
A:
Increase batch sizes (if memory allows)
Use FP8 inference (Work in Progress, only in vLLM)
Enable Flash Attention:
--flash_attnReduce
n_samples_per_promptif possibleUse tensor parallelism for inference:
--engine_tp_size 2Optimize NCCL:
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
Q: What’s the typical training speed?¶
A: On 8× A100 80GB:
7B model: ~1000 samples/min
13B model: ~500 samples/min
34B model: ~200 samples/min
70B model: ~50 samples/min
With FSDP and optimizations.
Algorithm Questions¶
Q: What’s the difference between GRPO and PPO?¶
A:
GRPO: Group-normalized advantages, no critic network
PPO: Uses separate value network (critic)
GRPO is simpler and more memory-efficient.
Q: What is Clip Higher?¶
A: An improved clipping scheme with separate upper/lower bounds for positive/negative advantages. Better for:
Noisy rewards
Large distribution shifts
Unstable training
Debugging Questions¶
Q: Training crashes with OOM error¶
A: See the Troubleshooting Guide
Q: num_rollouts_per_episodes = 0 error¶
A: Your train_batch_size is too small. Ensure:
train_batch_size >= rollout_batch_size × n_samples_per_prompt
Q: Model not improving / Reward not increasing¶
A: Check:
Learning rate too high/low
KL penalty too large
Reward model quality
Enable reward normalization:
--reward_running_normTry different advantage estimator
Q: NCCL timeout or hanging¶
A:
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
# Increase timeout
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
Q: vLLM engine initialization fails¶
A:
Check GPU memory:
--engine_mem_util 0.5Reduce TP size:
--engine_tp_size 1Check CUDA compatibility
Update vLLM:
pip install -U vllm
Evaluation Questions¶
Q: How do I evaluate on benchmarks?¶
A: For math benchmarks, use the evaluation scripts in the examples directory:
# Refer to the examples/gsm8k_geo3k directory for evaluation scripts
# See the example training scripts for evaluation configurations
Q: Can I save generation trajectories?¶
A: Yes, use the trajectory saver:
from lightrft.utils import TrajectorySaver
saver = TrajectorySaver(output_dir="./trajectories")
# Automatically saves prompts, responses, rewards
Q: How do I integrate with W&B?¶
A:
python train.py \
--use_wandb your-project \
--wandb_org your-org \
--wandb_run_name experiment-1
Advanced Questions¶
Q: Can I implement custom algorithms?¶
A: Yes! Extend the trainer class:
from lightrft.trainer import SPMDPPOTrainer
class CustomTrainer(SPMDPPOTrainer):
def compute_advantages(self, ...):
# Your custom advantage computation
pass
Q: How do I add a new model architecture?¶
A: There are two methods:
Standard approach: Inherit from base classes like
ActorLanguageorActorVL, and add the implementation in thelightrft/models/directory.Monkey Patching: Create a monkey patch in
lightrft/models/monkey_patch/:
# your_model.py
def patch_your_model(model):
# Add custom forward methods, etc.
pass
# Register in apply.py
from .your_model import patch_your_model
Q: Can I use custom reward functions?¶
A: Yes, pass a callable:
def custom_reward_fn(responses, labels):
# Your reward computation
return rewards
trainer = SPMDPPOTrainer(
...,
reward_fn=custom_reward_fn
)
Q: How do I checkpoint during training?¶
A: Checkpoints are automatic:
--save_path ./checkpoints \
--save_interval 1 \
--max_ckpt_num 3
Resume with:
--load_checkpoint \
--ckpt_path ./checkpoints/episode_5
Contributing Questions¶
Q: How can I contribute to LightRFT?¶
A:
Fork the repository
Create a feature branch
Implement your changes
Add tests
Submit a pull request
See Contributing Guide for details.
Q: How do I report bugs?¶
A: Open an issue on GitHub Issues with:
Environment details (GPU, CUDA, PyTorch versions)
Full error traceback
Minimal reproduction script
Expected vs actual behavior
Q: Where can I get help?¶
A:
GitHub Issues for bugs
Discussions for questions
Documentation for guides
Examples directory for code samples