Supported Algorithms¶
LightRFT supports a rich ecosystem of reinforcement learning algorithms for fine-tuning large language models. This comprehensive guide provides algorithm details and implementation references.
Purpose of This Guide¶
With the rapid development in the RFT field and emerging algorithmic innovations, this guide helps you:
Quickly identify which algorithms suit your needs
Understand implementation by mapping algorithms to code modules
Plan integration of multiple algorithms by identifying synergies or conflicts
Maintain clarity through documented relationships between algorithms and components
Algorithm Overview with Implementation¶
Algorithm |
Type |
Module |
Description |
Implementation |
Paper |
|---|---|---|---|---|---|
GRPO |
Policy Optimization |
Advantage Estimation |
Uses group-based normalization for advantage estimation without requiring a separate value network |
|
|
GSPO |
Policy Optimization |
Policy Loss |
Group sequence policy optimization |
|
|
REINFORCE++ |
Advantage Estimation |
Advantage Estimation |
Modifies return and advantage calculation with improved baseline estimation |
|
|
CPGD |
Advantage Estimation |
Advantage Estimation |
Adds KL-based drift constraint and clipped log-ratio for stable return/advantage computation |
|
|
FIRE Sampling |
Sampling Strategy |
Experience Generation |
Modifies sample generation process with filtering and ranking strategies |
|
|
GMPO |
Policy Optimization |
Policy Loss |
Geometric-Mean Policy Optimization |
|
|
Dr.GRPO |
Policy Optimization |
Policy Loss |
Introduces an unbiased policy optimization to mitigate length bias and improve token efficiency |
|
|
DAPO |
Policy Optimization |
Policy Loss |
Introduces decoupled clipping and dynamic sampling scheme to stabilize large-scale RL optimization |
|
|
Token-Level Policy |
Policy Optimization |
Policy Loss |
Optimizes policy at token granularity to improve stability and credit assignment |
|
|
Reward Norm/Clip |
Reward Processing |
Reward Processing |
Applies reward normalization and clipping to stabilize advantage computation |
|
|
select_high_entropy_tokens |
Policy Optimization |
Policy Loss |
Modifies PolicyLoss to implement high entropy token selection during training |
|
Algorithm Architecture¶
Core Training Components¶
LightRFT’s algorithm implementations are organized around three main modules:
1. Policy Loss Computation (lightrft/trainer/ppo_loss.py)¶
Purpose: Implements PPO policy loss with multiple surrogate objectives
Key Method:
forward(log_probs, old_log_probs, advantages, action_mask)Affected by: GSPO, GMPO, Dr.GRPO, DAPO, Token-Level Policy, select_high_entropy_tokens
Modification Type: Loss function design and token selection strategies
2. Experience Generation (lightrft/trainer/fast_exp_maker.py)¶
Purpose: Generates experiences using vLLM and other inference backends
Key Methods:
generate_samples(): Sample generation with various strategies_get_return_advs(): Returns and advantages calculation
Affected by: FIRE Sampling
Modification Type: Sampling strategies and inference optimization
3. Advantage & Reward Processing (lightrft/trainer/fast_exp_maker.py)¶
Purpose: Processes rewards and computes advantages for policy updates
Key Method:
_get_return_advs(): Advantage estimation with various baselinesAffected by: GRPO, REINFORCE++, CPGD, Reward Norm/Clip
Modification Type: Advantage estimation methods and reward shaping
Modification Types¶
Algorithmic Changes:
Loss Design: Core objective function modifications
Advantage Estimation: Updates to advantage calculation methods
Sampling Strategy: Changes to sample generation processes
Token Selection: Which tokens are used in training
Reward Shaping: Reward preprocessing and filtering
Implementation Changes:
Efficiency Optimization: Performance improvements (e.g., FP8)
Parameter Tuning: Hyperparameter adjustments
Pipeline Integration: New components or workflow changes
Policy Optimization Algorithms¶
GRPO (Group Relative Policy Optimization)¶
Overview: GRPO uses group-based normalization for advantage estimation, providing stable training without requiring a separate value network.
Implementation: FastExperienceMaker._get_return_advs() - Advantage Estimation module
Modification Type: Advantage Estimation
Key Features:
No critic network required
Group-normalized advantages
Stable training with large batch sizes
Memory efficient
Usage:
python train.py \
--advantage_estimator group_norm \
--n_samples_per_prompt 8 \
--kl_estimator k3
Best For:
Large-scale training with limited memory
Quick prototyping without value network
Math reasoning and coding tasks
GSPO (Group Sequence Policy Optimization)¶
Overview: GSPO generalizes the PPO objective with flexible surrogate functions, allowing for better control over policy updates.
Implementation: PolicyLoss.forward() - Policy Loss module
Modification Type: Loss Design
Key Features:
Generalized clipping objectives
Adaptive trust region updates
Better sample efficiency
Usage:
python train.py \
--advantage_estimator gspo \
--gspo_alpha 0.1 \
--clip_range 0.2
Best For:
Tasks requiring precise policy control
Multi-task learning scenarios
GMPO (Geometric-Mean Policy Optimization)¶
Overview: GMPO leverages mirror descent principles for policy optimization, providing theoretical guarantees and improved convergence.
Implementation: PolicyLoss.forward() - Policy Loss module
Modification Type: Loss Design
Key Features:
Mirror descent updates
Theoretical convergence guarantees
Adaptive step sizes
Usage:
python train.py \
--advantage_estimator gmpo \
--mirror_tau 0.01
Best For:
Research applications requiring theoretical guarantees
Complex reward landscapes
Dr.GRPO (Group Relative Policy Optimization Done Right)¶
Overview: Dr.GRPO addresses length bias in reward models by explicitly modeling and mitigating the reward-length correlation.
Implementation: PolicyLoss.forward() - Policy Loss module
Modification Type: Loss Design (length bias mitigation)
Key Features:
Length bias mitigation
Reward debiasing mechanisms
Improved response quality
Usage:
python train.py \
--advantage_estimator group_norm \
--use_length_penalty \
--length_penalty_coef 0.01
Best For:
Tasks sensitive to response length
Instruction following
Open-ended generation
DAPO (Dynamic sAmpling Policy Optimization)¶
Overview: DAPO uses separate upper and lower clipping bounds for advantage-weighted policy updates combined with dynamic sampling strategies, improving training stability.
Implementation: PolicyLoss.forward() - Policy Loss module
Modification Type: Loss Design (decoupled clipping)
Key Features:
Decoupled clipping for positive/negative advantages
Dynamic sampling strategy
Better handling of distribution shifts
Improved stability
Usage:
python train.py \
--use_clip_higher \
--clip_range_higher 0.3 \
--clip_range_lower 0.2
Best For:
Highly noisy reward signals
Large distribution shifts
Challenging domains
Token-Level Policy¶
Overview: Optimizes policy at token granularity to improve stability and credit assignment.
Implementation: PolicyLoss.forward() - Policy Loss module
Modification Type: Token Selection
Key Features:
Token-granular optimization
Improved credit assignment
Better stability in long sequences
Usage: Typically combined with other policy optimization methods through implementation modifications.
Advantage Estimation Methods¶
REINFORCE++¶
Overview: An improved baseline estimation method that uses control variates to reduce variance in policy gradient estimates.
Implementation: FastExperienceMaker._get_return_advs() - Advantage Estimation module
Modification Type: Advantage Estimation
Key Features:
Lower variance gradients
Faster convergence
Compatible with all policy optimization methods
Usage:
python train.py \
--advantage_estimator reinforce_plus \
--baseline_type value_network
Best For:
High-variance environments
Sparse rewards
Combining with PPO or other on-policy methods
CPGD (Clipped Policy Gradient Optimization with Policy Drift)¶
Overview: CPGD constrains policy updates using KL-divergence to prevent catastrophic forgetting and maintain stable training.
Implementation: FastExperienceMaker._get_return_advs() - Advantage Estimation module
Modification Type: Advantage Estimation (KL-constrained)
Key Features:
KL-constrained updates
Prevents catastrophic forgetting
Adaptive constraint adjustment
Usage:
python train.py \
--advantage_estimator cpgd \
--kl_target 0.01 \
--kl_horizon 10000
Best For:
Fine-tuning pre-trained models
Preserving original capabilities
Multi-stage training
Reward Processing¶
Reward Normalization and Clipping¶
Overview: Standard reward preprocessing techniques to stabilize training.
Implementation: FastExperienceMaker._get_return_advs() - Reward Processing module
Modification Type: Reward Shaping (normalization/clipping)
Key Features:
Running reward statistics
Advantage normalization
Reward clipping
Usage:
python train.py \
--reward_running_norm \
--reward_running_norm_minus_mean \
--reward_clip 10.0 \
--advantage_clip 10.0
Best For:
All training scenarios (recommended baseline)
Reward scale varies across prompts
Training stability
Sampling Strategies¶
FIRE Sampling¶
Overview: FIRE (Filtered and Improved Reward Estimation) combines filtering and ranking strategies for better sample selection.
Implementation: FastExperienceMaker.generate_samples() - Experience Generation module
Modification Type: Sampling Strategy
Key Features:
Multi-stage filtering
Reward-based ranking
Sample efficiency
Usage:
python train.py \
--use_fire_sampling \
--fire_filter_ratio 0.5 \
--fire_rank_method reward
Best For:
Limited computational budgets
High-quality data generation
Best-of-N sampling scenarios
Implementation Notes¶
All policy loss algorithms modify the PolicyLoss module’s
forward()methodAdvantage estimation algorithms modify FastExperienceMaker’s
_get_return_advs()methodSampling strategies modify FastExperienceMaker’s
generate_samples()methodReward processing algorithms primarily work within
_get_return_advs()methodMost modifications are in core training loop components rather than peripheral utilities
References¶
For detailed algorithm descriptions and experimental results, refer to the linked papers. Implementation details can be found in the source code:
Policy Loss:
lightrft/models/loss.pyExperience Maker:
lightrft/trainer/fast_exp_maker.pyvLLM Utils:
lightrft/strategy/vllm_utils/