Frequently Asked Questions (FAQ)¶

Common questions and answers about LightRFT.

General Questions¶

Q: What is LightRFT?¶

A: LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning framework designed for the reinforcement fine-tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs). It supports multiple models, algorithms, distributed training strategies, and inference engines, providing efficient and scalable RLHF and RLVR training capabilities.

Q: What are the main differences between LightRFT and OpenRLHF?¶

A: LightRFT extends OpenRLHF with:

Enhanced multimodal (VLM) support
More RL algorithms (GRPO, GSPO, GMPO, REINFORCE++, CPGD, etc.)
Comprehensive Reward Model support, including Scalar Reward Models (SRM) and Generative Reward Models (GRM)
Better memory optimization (engine sleep, optimizer offload)
Improved inference engines (vLLM, SGLang)
Reward model co-location for efficiency
More flexible distributed training strategies, supporting FSDP and DeepSpeed ZeRO

Q: Which models are supported?¶

A: LightRFT supports:

LLM: Qwen, Qwen2.5 and most HuggingFace models
VLM: Qwen-VL, Qwen2-VL
Audio: Qwen2-Audio
Custom: Easily inherit and extend existing model architectures

Q: What hardware is required?¶

A: Minimum requirements:

GPU: NVIDIA GPUs with CUDA 12.8+
Memory: 40GB+ VRAM per GPU recommended (24GB possible with optimizations)
PyTorch: 2.9.1+
Python: 3.12+

For production: 8× A100/H100 80GB recommended

Installation Questions¶

Q: How do I install LightRFT?¶

A: Standard installation (includes SGLang + Flash-Attention):

git clone https://github.com/opendilab/LightRFT.git
cd LightRFT
pip install -e .

Q: How do I install vLLM?¶

A: vLLM is optional. Install it with:

# Option 1: Install as optional dependency
pip install ".[vllm]"

# Option 2: Install vLLM directly
pip install vllm>=0.13.3

Note: SGLang is the default inference backend and is already included in the standard installation.

Q: What if Flash-Attention installation fails?¶

A: Try these solutions:

Option 1: Use pre-compiled wheel (Recommended)

# Download from https://github.com/Dao-AILab/flash-attention/releases
# For CUDA 12.x with PyTorch 2.9 and Python 3.12:
pip install flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

Option 2: Use Docker (Easiest)

docker pull opendilab/lightrft:v0.1.0

Training Questions¶

Q: What’s the difference between FSDP and DeepSpeed?¶

A: Both implement Fully Sharded Data Parallelism (ZeRO-3/FSDP), but they differ in design philosophy:

FSDP (PyTorch Native):
- Deep Integration: Seamlessly works with PyTorch ecosystem including Autograd and torch.compile.
- High Flexibility: Offers programmatic control over sharding units via auto_wrap_policy.
- Composability: Easier to combine with other native features like Tensor Parallelism.
DeepSpeed (Microsoft):
- All-in-One Toolkit: Provides built-in CPU/NVMe offloading (ZeRO-Infinity) and high-performance optimizers.
- Declarative Config: Simple setup via JSON configuration files, abstracting away complexity.
- Custom Kernels: Contains many manual CUDA optimizations for peak performance in specific setups.

Recommendation: Use FSDP for native experience, complex model customization, or with torch.compile. Use DeepSpeed for ease of use or extreme model sizes requiring NVMe offloading.

Q: Which algorithm should I use?¶

A: By task:

Math/Coding: GRPO, Dr.GRPO
Instruction Following: CPGD, GSPO
Open-ended: FIRE Sampling
Low Memory: GRPO (no critic)
Research: GMPO, REINFORCE++

Q: How many samples per prompt should I use?¶

A: Typical values:

4-8: Standard, good balance
16+: Better quality, slower training
32+: Best-of-N scenarios

More samples = better advantage estimation but slower.

Q: Can I use multiple reward models?¶

A: Yes! LightRFT supports:

Multiple reward models in parallel
Reward model co-location (same GPU as training)
Remote reward model servers
Weighted reward combination

Performance Questions¶

Q: How do I reduce memory usage?¶

A: Use these techniques:

Enable gradient checkpointing: --gradient_checkpointing
Use FSDP with CPU offload: --fsdp --fsdp_cpu_offload
Lower engine memory: --engine_mem_util 0.4
Use ZeRO-3: --zero_stage 3
Reduce batch sizes
Enable engine sleep: --enable_engine_sleep

Q: How do I speed up training?¶

A:

Increase batch sizes (if memory allows)
Use FP8 inference (Work in Progress, only in vLLM)
Enable Flash Attention: --flash_attn
Reduce n_samples_per_prompt if possible
Use tensor parallelism for inference: --engine_tp_size 2
Optimize NCCL: export TORCH_NCCL_AVOID_RECORD_STREAMS=1

Q: What’s the typical training speed?¶

A: On 8× A100 80GB:

7B model: ~1000 samples/min
13B model: ~500 samples/min
34B model: ~200 samples/min
70B model: ~50 samples/min

With FSDP and optimizations.

Algorithm Questions¶

Q: What’s the difference between GRPO and PPO?¶

A:

GRPO: Group-normalized advantages, no critic network
PPO: Uses separate value network (critic)

GRPO is simpler and more memory-efficient.

Q: What is Clip Higher?¶

A: An improved clipping scheme with separate upper/lower bounds for positive/negative advantages. Better for:

Noisy rewards
Large distribution shifts
Unstable training

Debugging Questions¶

Q: Training crashes with OOM error¶

A: See the Troubleshooting Guide

Q: `num_rollouts_per_episodes = 0` error¶

A: Your train_batch_size is too small. Ensure:

train_batch_size >= rollout_batch_size × n_samples_per_prompt

Q: Model not improving / Reward not increasing¶

A: Check:

Learning rate too high/low
KL penalty too large
Reward model quality
Enable reward normalization: --reward_running_norm
Try different advantage estimator

Q: NCCL timeout or hanging¶

A:

export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
# Increase timeout
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

Q: vLLM engine initialization fails¶

A:

Check GPU memory: --engine_mem_util 0.5
Reduce TP size: --engine_tp_size 1
Check CUDA compatibility
Update vLLM: pip install -U vllm

Evaluation Questions¶

Q: How do I evaluate on benchmarks?¶

A: For math benchmarks, use the evaluation scripts in the examples directory:

# Refer to the examples/gsm8k_geo3k directory for evaluation scripts
# See the example training scripts for evaluation configurations

Q: Can I save generation trajectories?¶

A: Yes, use the trajectory saver:

from lightrft.utils import TrajectorySaver

saver = TrajectorySaver(output_dir="./trajectories")
# Automatically saves prompts, responses, rewards

Q: How do I integrate with W&B?¶

A:

python train.py \
    --use_wandb your-project \
    --wandb_org your-org \
    --wandb_run_name experiment-1

Advanced Questions¶

Q: Can I implement custom algorithms?¶

A: Yes! Extend the trainer class:

from lightrft.trainer import SPMDPPOTrainer

class CustomTrainer(SPMDPPOTrainer):
    def compute_advantages(self, ...):
        # Your custom advantage computation
        pass

Q: How do I add a new model architecture?¶

A: There are two methods:

Standard approach: Inherit from base classes like ActorLanguage or ActorVL, and add the implementation in the lightrft/models/ directory.
Monkey Patching: Create a monkey patch in lightrft/models/monkey_patch/:

# your_model.py
def patch_your_model(model):
    # Add custom forward methods, etc.
    pass

# Register in apply.py
from .your_model import patch_your_model

Q: Can I use custom reward functions?¶

A: Yes, pass a callable:

def custom_reward_fn(responses, labels):
    # Your reward computation
    return rewards

trainer = SPMDPPOTrainer(
    ...,
    reward_fn=custom_reward_fn
)

Q: How do I checkpoint during training?¶

A: Checkpoints are automatic:

--save_path ./checkpoints \
--save_interval 1 \
--max_ckpt_num 3

Resume with:

--load_checkpoint \
--ckpt_path ./checkpoints/episode_5

Contributing Questions¶

Q: How can I contribute to LightRFT?¶

A:

Fork the repository
Create a feature branch
Implement your changes
Add tests
Submit a pull request

See Contributing Guide for details.

Q: How do I report bugs?¶

A: Open an issue on GitHub Issues with:

Environment details (GPU, CUDA, PyTorch versions)
Full error traceback
Minimal reproduction script
Expected vs actual behavior

Q: Where can I get help?¶

A:

GitHub Issues for bugs
Discussions for questions
Documentation for guides
Examples directory for code samples