Shortcuts

Frequently Asked Questions (FAQ)

Common questions and answers about LightRFT.

General Questions

Q: What is LightRFT?

A: LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning framework designed for the reinforcement fine-tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs). It supports multiple models, algorithms, distributed training strategies, and inference engines, providing efficient and scalable RLHF and RLVR training capabilities.

Q: What are the main differences between LightRFT and OpenRLHF?

A: LightRFT extends OpenRLHF with:

  • Enhanced multimodal (VLM) support

  • More RL algorithms (GRPO, GSPO, GMPO, REINFORCE++, CPGD, etc.)

  • Comprehensive Reward Model support, including Scalar Reward Models (SRM) and Generative Reward Models (GRM)

  • Better memory optimization (engine sleep, optimizer offload)

  • Improved inference engines (vLLM, SGLang)

  • Reward model co-location for efficiency

  • More flexible distributed training strategies, supporting FSDP and DeepSpeed ZeRO

Q: Which models are supported?

A: LightRFT supports:

  • LLM: Qwen, Qwen2.5 and most HuggingFace models

  • VLM: Qwen-VL, Qwen2-VL

  • Audio: Qwen2-Audio

  • Custom: Easily inherit and extend existing model architectures

Q: What hardware is required?

A: Minimum requirements:

  • GPU: NVIDIA GPUs with CUDA 12.8+

  • Memory: 40GB+ VRAM per GPU recommended (24GB possible with optimizations)

  • PyTorch: 2.9.1+

  • Python: 3.12+

For production: 8× A100/H100 80GB recommended

Installation Questions

Q: How do I install LightRFT?

A: Standard installation (includes SGLang + Flash-Attention):

git clone https://github.com/opendilab/LightRFT.git
cd LightRFT
pip install -e .

Q: How do I install vLLM?

A: vLLM is optional. Install it with:

# Option 1: Install as optional dependency
pip install ".[vllm]"

# Option 2: Install vLLM directly
pip install vllm>=0.13.3

Note: SGLang is the default inference backend and is already included in the standard installation.

Q: What if Flash-Attention installation fails?

A: Try these solutions:

Option 1: Use pre-compiled wheel (Recommended)

# Download from https://github.com/Dao-AILab/flash-attention/releases
# For CUDA 12.x with PyTorch 2.9 and Python 3.12:
pip install flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

Option 2: Use Docker (Easiest)

docker pull opendilab/lightrft:v0.1.0

Training Questions

Q: What’s the difference between FSDP and DeepSpeed?

A: Both implement Fully Sharded Data Parallelism (ZeRO-3/FSDP), but they differ in design philosophy:

  • FSDP (PyTorch Native):

    • Deep Integration: Seamlessly works with PyTorch ecosystem including Autograd and torch.compile.

    • High Flexibility: Offers programmatic control over sharding units via auto_wrap_policy.

    • Composability: Easier to combine with other native features like Tensor Parallelism.

  • DeepSpeed (Microsoft):

    • All-in-One Toolkit: Provides built-in CPU/NVMe offloading (ZeRO-Infinity) and high-performance optimizers.

    • Declarative Config: Simple setup via JSON configuration files, abstracting away complexity.

    • Custom Kernels: Contains many manual CUDA optimizations for peak performance in specific setups.

Recommendation: Use FSDP for native experience, complex model customization, or with torch.compile. Use DeepSpeed for ease of use or extreme model sizes requiring NVMe offloading.

Q: Which algorithm should I use?

A: By task:

  • Math/Coding: GRPO, Dr.GRPO

  • Instruction Following: CPGD, GSPO

  • Open-ended: FIRE Sampling

  • Low Memory: GRPO (no critic)

  • Research: GMPO, REINFORCE++

Q: How many samples per prompt should I use?

A: Typical values:

  • 4-8: Standard, good balance

  • 16+: Better quality, slower training

  • 32+: Best-of-N scenarios

More samples = better advantage estimation but slower.

Q: Can I use multiple reward models?

A: Yes! LightRFT supports:

  • Multiple reward models in parallel

  • Reward model co-location (same GPU as training)

  • Remote reward model servers

  • Weighted reward combination

Performance Questions

Q: How do I reduce memory usage?

A: Use these techniques:

  1. Enable gradient checkpointing: --gradient_checkpointing

  2. Use FSDP with CPU offload: --fsdp --fsdp_cpu_offload

  3. Lower engine memory: --engine_mem_util 0.4

  4. Use ZeRO-3: --zero_stage 3

  5. Reduce batch sizes

  6. Enable engine sleep: --enable_engine_sleep

Q: How do I speed up training?

A:

  1. Increase batch sizes (if memory allows)

  2. Use FP8 inference (Work in Progress, only in vLLM)

  3. Enable Flash Attention: --flash_attn

  4. Reduce n_samples_per_prompt if possible

  5. Use tensor parallelism for inference: --engine_tp_size 2

  6. Optimize NCCL: export TORCH_NCCL_AVOID_RECORD_STREAMS=1

Q: What’s the typical training speed?

A: On 8× A100 80GB:

  • 7B model: ~1000 samples/min

  • 13B model: ~500 samples/min

  • 34B model: ~200 samples/min

  • 70B model: ~50 samples/min

With FSDP and optimizations.

Algorithm Questions

Q: What’s the difference between GRPO and PPO?

A:

  • GRPO: Group-normalized advantages, no critic network

  • PPO: Uses separate value network (critic)

GRPO is simpler and more memory-efficient.

Q: What is Clip Higher?

A: An improved clipping scheme with separate upper/lower bounds for positive/negative advantages. Better for:

  • Noisy rewards

  • Large distribution shifts

  • Unstable training

Debugging Questions

Q: Training crashes with OOM error

A: See the Troubleshooting Guide

Q: num_rollouts_per_episodes = 0 error

A: Your train_batch_size is too small. Ensure:

train_batch_size >= rollout_batch_size × n_samples_per_prompt

Q: Model not improving / Reward not increasing

A: Check:

  1. Learning rate too high/low

  2. KL penalty too large

  3. Reward model quality

  4. Enable reward normalization: --reward_running_norm

  5. Try different advantage estimator

Q: NCCL timeout or hanging

A:

export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
# Increase timeout
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

Q: vLLM engine initialization fails

A:

  1. Check GPU memory: --engine_mem_util 0.5

  2. Reduce TP size: --engine_tp_size 1

  3. Check CUDA compatibility

  4. Update vLLM: pip install -U vllm

Evaluation Questions

Q: How do I evaluate on benchmarks?

A: For math benchmarks, use the evaluation scripts in the examples directory:

# Refer to the examples/gsm8k_geo3k directory for evaluation scripts
# See the example training scripts for evaluation configurations

Q: Can I save generation trajectories?

A: Yes, use the trajectory saver:

from lightrft.utils import TrajectorySaver

saver = TrajectorySaver(output_dir="./trajectories")
# Automatically saves prompts, responses, rewards

Q: How do I integrate with W&B?

A:

python train.py \
    --use_wandb your-project \
    --wandb_org your-org \
    --wandb_run_name experiment-1

Advanced Questions

Q: Can I implement custom algorithms?

A: Yes! Extend the trainer class:

from lightrft.trainer import SPMDPPOTrainer

class CustomTrainer(SPMDPPOTrainer):
    def compute_advantages(self, ...):
        # Your custom advantage computation
        pass

Q: How do I add a new model architecture?

A: There are two methods:

  1. Standard approach: Inherit from base classes like ActorLanguage or ActorVL, and add the implementation in the lightrft/models/ directory.

  2. Monkey Patching: Create a monkey patch in lightrft/models/monkey_patch/:

# your_model.py
def patch_your_model(model):
    # Add custom forward methods, etc.
    pass

# Register in apply.py
from .your_model import patch_your_model

Q: Can I use custom reward functions?

A: Yes, pass a callable:

def custom_reward_fn(responses, labels):
    # Your reward computation
    return rewards

trainer = SPMDPPOTrainer(
    ...,
    reward_fn=custom_reward_fn
)

Q: How do I checkpoint during training?

A: Checkpoints are automatic:

--save_path ./checkpoints \
--save_interval 1 \
--max_ckpt_num 3

Resume with:

--load_checkpoint \
--ckpt_path ./checkpoints/episode_5

Contributing Questions

Q: How can I contribute to LightRFT?

A:

  1. Fork the repository

  2. Create a feature branch

  3. Implement your changes

  4. Add tests

  5. Submit a pull request

See Contributing Guide for details.

Q: How do I report bugs?

A: Open an issue on GitHub Issues with:

  • Environment details (GPU, CUDA, PyTorch versions)

  • Full error traceback

  • Minimal reproduction script

  • Expected vs actual behavior

Q: Where can I get help?

A:

  • GitHub Issues for bugs

  • Discussions for questions

  • Documentation for guides

  • Examples directory for code samples

Additional Resources