Supported Algorithms¶

LightRFT supports a rich ecosystem of reinforcement learning algorithms for fine-tuning large language models. This comprehensive guide provides algorithm details and implementation references.

Purpose of This Guide¶

With the rapid development in the RFT field and emerging algorithmic innovations, this guide helps you:

Quickly identify which algorithms suit your needs
Understand implementation by mapping algorithms to code modules
Plan integration of multiple algorithms by identifying synergies or conflicts
Maintain clarity through documented relationships between algorithms and components

Algorithm Overview with Implementation¶

Algorithm	Type	Module	Description	Implementation	Paper
GRPO	Policy Optimization	Advantage Estimation	Uses group-based normalization for advantage estimation without requiring a separate value network	`FastExperienceMaker._get_return_advs()`	arXiv:2402.03300
GSPO	Policy Optimization	Policy Loss	Group sequence policy optimization	`PolicyLoss.forward()`	arXiv:2507.18071
REINFORCE++	Advantage Estimation	Advantage Estimation	Modifies return and advantage calculation with improved baseline estimation	`FastExperienceMaker._get_return_advs()`	arXiv:2501.03262
CPGD	Advantage Estimation	Advantage Estimation	Adds KL-based drift constraint and clipped log-ratio for stable return/advantage computation	`FastExperienceMaker._get_return_advs()`	arXiv:2505.12504
FIRE Sampling	Sampling Strategy	Experience Generation	Samples first token with high temperature and remaining tokens with regular temperature to improve diversity	`FastExperienceMaker.generate_samples()`	arXiv:2410.21236
GMPO	Policy Optimization	Policy Loss	Geometric-Mean Policy Optimization	`PolicyLoss.forward()`	arXiv:2507.20673
Dr.GRPO	Policy Optimization	Policy Loss	Introduces an unbiased policy optimization to mitigate length bias and improve token efficiency	`PolicyLoss.forward()`	arXiv:2503.20783
DAPO	Policy Optimization	Policy Loss	Introduces decoupled clipping and dynamic sampling scheme to stabilize large-scale RL optimization	`PolicyLoss.forward()`	arXiv:2503.14476
Token-Level Policy	Policy Optimization	Policy Loss	Optimizes policy at token granularity to improve stability and credit assignment	`PolicyLoss.forward()`	arXiv:2503.14476
Reward Norm/Clip	Reward Processing	Reward Processing	Applies reward normalization and clipping to stabilize advantage computation	`FastExperienceMaker._get_return_advs()`	GitHub
select_high_entropy_tokens	Policy Optimization	Policy Loss	Modifies PolicyLoss to implement high entropy token selection during training	`PolicyLoss.forward()`	arXiv:2506.01939

Algorithm Architecture¶

Core Training Components¶

LightRFT’s algorithm implementations are organized around three main modules:

1. Policy Loss Computation (`lightrft/trainer/ppo_loss.py`)¶

Purpose: Implements PPO policy loss with multiple surrogate objectives
Key Method: forward(log_probs, old_log_probs, advantages, action_mask)
Affected by: GSPO, GMPO, Dr.GRPO, DAPO, Token-Level Policy, select_high_entropy_tokens
Modification Type: Loss function design and token selection strategies

2. Experience Generation (`lightrft/trainer/fast_exp_maker.py`)¶

Purpose: Generates experiences using vLLM and other inference backends
Key Methods:
- generate_samples(): Sample generation with various strategies
- _get_return_advs(): Returns and advantages calculation
Affected by: FIRE Sampling
Modification Type: Sampling strategies and inference optimization

3. Advantage & Reward Processing (`lightrft/trainer/fast_exp_maker.py`)¶

Purpose: Processes rewards and computes advantages for policy updates
Key Method: _get_return_advs(): Advantage estimation with various baselines
Affected by: GRPO, REINFORCE++, CPGD, Reward Norm/Clip
Modification Type: Advantage estimation methods and reward shaping

Modification Types¶

Algorithmic Changes:

Loss Design: Core objective function modifications
Advantage Estimation: Updates to advantage calculation methods
Sampling Strategy: Changes to sample generation processes
Token Selection: Which tokens are used in training
Reward Shaping: Reward preprocessing and filtering

Implementation Changes:

Efficiency Optimization: Performance improvements (e.g., FP8)
Parameter Tuning: Hyperparameter adjustments
Pipeline Integration: New components or workflow changes

Policy Optimization Algorithms¶

GRPO (Group Relative Policy Optimization)¶

Overview: GRPO uses group-based normalization for advantage estimation, providing stable training without requiring a separate value network.

Implementation: FastExperienceMaker._get_return_advs() - Advantage Estimation module Modification Type: Advantage Estimation

Key Features:

No critic network required
Group-normalized advantages
Stable training with large batch sizes
Memory efficient

Usage:

python train.py \
    --advantage_estimator group_norm \
    --n_samples_per_prompt 8 \
    --kl_estimator k3

Best For:

Large-scale training with limited memory
Quick prototyping without value network
Math reasoning and coding tasks

GSPO (Group Sequence Policy Optimization)¶

Overview: GSPO generalizes the PPO objective with flexible surrogate functions, allowing for better control over policy updates.

Implementation: PolicyLoss.forward() - Policy Loss module Modification Type: Loss Design

Key Features:

Generalized clipping objectives
Adaptive trust region updates
Better sample efficiency

Usage:

python train.py \
    --advantage_estimator gspo \
    --gspo_alpha 0.1 \
    --clip_range 0.2

Best For:

Tasks requiring precise policy control
Multi-task learning scenarios

GMPO (Geometric-Mean Policy Optimization)¶

Overview: GMPO leverages mirror descent principles for policy optimization, providing theoretical guarantees and improved convergence.

Implementation: PolicyLoss.forward() - Policy Loss module Modification Type: Loss Design

Key Features:

Mirror descent updates
Theoretical convergence guarantees
Adaptive step sizes

Usage:

python train.py \
    --advantage_estimator gmpo \
    --mirror_tau 0.01

Best For:

Research applications requiring theoretical guarantees
Complex reward landscapes

Dr.GRPO (Group Relative Policy Optimization Done Right)¶

Overview: Dr.GRPO addresses length bias in reward models by explicitly modeling and mitigating the reward-length correlation.

Implementation: PolicyLoss.forward() - Policy Loss module Modification Type: Loss Design (length bias mitigation)

Key Features:

Length bias mitigation
Reward debiasing mechanisms
Improved response quality

Usage:

python train.py \
    --advantage_estimator group_norm \
    --use_length_penalty \
    --length_penalty_coef 0.01

Best For:

Tasks sensitive to response length
Instruction following
Open-ended generation

DAPO (Dynamic sAmpling Policy Optimization)¶

Overview: DAPO uses separate upper and lower clipping bounds for advantage-weighted policy updates combined with dynamic sampling strategies, improving training stability.

Implementation: PolicyLoss.forward() - Policy Loss module Modification Type: Loss Design (decoupled clipping)

Key Features:

Decoupled clipping for positive/negative advantages
Dynamic sampling strategy
Better handling of distribution shifts
Improved stability

Usage:

python train.py \
    --use_clip_higher \
    --clip_range_higher 0.3 \
    --clip_range_lower 0.2

Best For:

Highly noisy reward signals
Large distribution shifts
Challenging domains

Token-Level Policy¶

Overview: Optimizes policy at token granularity to improve stability and credit assignment.

Implementation: PolicyLoss.forward() - Policy Loss module Modification Type: Token Selection

Key Features:

Token-granular optimization
Improved credit assignment
Better stability in long sequences

Usage: Typically combined with other policy optimization methods through implementation modifications.

Advantage Estimation Methods¶

REINFORCE++¶

Overview: An improved baseline estimation method that uses control variates to reduce variance in policy gradient estimates.

Implementation: FastExperienceMaker._get_return_advs() - Advantage Estimation module Modification Type: Advantage Estimation

Key Features:

Lower variance gradients
Faster convergence
Compatible with all policy optimization methods

Usage:

python train.py \
    --advantage_estimator reinforce_plus \
    --baseline_type value_network

Best For:

High-variance environments
Sparse rewards
Combining with PPO or other on-policy methods

CPGD (Clipped Policy Gradient Optimization with Policy Drift)¶

Overview: CPGD constrains policy updates using KL-divergence to prevent catastrophic forgetting and maintain stable training.

Implementation: FastExperienceMaker._get_return_advs() - Advantage Estimation module Modification Type: Advantage Estimation (KL-constrained)

Key Features:

KL-constrained updates
Prevents catastrophic forgetting
Adaptive constraint adjustment

Usage:

python train.py \
    --advantage_estimator cpgd \
    --kl_target 0.01 \
    --kl_horizon 10000

Best For:

Fine-tuning pre-trained models
Preserving original capabilities
Multi-stage training

Reward Processing¶

Reward Normalization and Clipping¶

Overview: Standard reward preprocessing techniques to stabilize training.

Implementation: FastExperienceMaker._get_return_advs() - Reward Processing module Modification Type: Reward Shaping (normalization/clipping)

Key Features:

Running reward statistics
Advantage normalization
Reward clipping

Usage:

python train.py \
    --reward_running_norm \
    --reward_running_norm_minus_mean \
    --reward_clip 10.0 \
    --advantage_clip 10.0

Best For:

All training scenarios (recommended baseline)
Reward scale varies across prompts
Training stability

Sampling Strategies¶

FIRE Sampling¶

Overview: FIRE (Flaming-hot Initiation with Regular Execution) is a simple yet effective sampling method that improves diversity and quality in response generation.

Paper: arXiv:2410.21236

Implementation: FastExperienceMaker.generate_samples() - Experience Generation module Modification Type: Sampling Strategy

Key Mechanism: According to the paper, FIRE sampling works by:

Sampling the first token with a very high temperature (e.g., 10.0) - “flaming-hot initiation”
Sampling remaining tokens with regular temperature (e.g., 0.7 or 1.0)
Important: All other sampling parameters (top_k, top_p, min_p) remain identical between first token and remaining tokens

This approach promotes diversity at the beginning of generation while maintaining coherent continuation, leading to improved pass@n rates and more effective training.

Key Features:

Increased diversity through high-temperature first token
Better pass@n performance (improved success rate within N trials)
No additional computational cost
Benefits both inference and training stages

Usage:

python examples/gsm8k_geo3k/train_colocate.py \
    --use_fire \
    --first_token_temperature 10.0 \

Best For:

Mathematical reasoning tasks
Code generation
Tasks with verifiable correctness (sandbox checkers)
Best-of-N sampling scenarios
When diversity in initial reasoning paths is beneficial

Implementation Notes¶

All policy loss algorithms modify the PolicyLoss module’s forward() method
Advantage estimation algorithms modify FastExperienceMaker’s _get_return_advs() method
Sampling strategies modify FastExperienceMaker’s generate_samples() method
Reward processing algorithms primarily work within _get_return_advs() method
Most modifications are in core training loop components rather than peripheral utilities

References¶

For detailed algorithm descriptions and experimental results, refer to the linked papers. Implementation details can be found in the source code:

Policy Loss: lightrft/models/loss.py
Experience Maker: lightrft/trainer/fast_exp_maker.py
vLLM Utils: lightrft/strategy/vllm_utils/

Supported Algorithms¶

Purpose of This Guide¶

Algorithm Overview with Implementation¶

Algorithm Architecture¶

Core Training Components¶

1. Policy Loss Computation (lightrft/trainer/ppo_loss.py)¶

2. Experience Generation (lightrft/trainer/fast_exp_maker.py)¶

3. Advantage & Reward Processing (lightrft/trainer/fast_exp_maker.py)¶

Modification Types¶

Policy Optimization Algorithms¶

GRPO (Group Relative Policy Optimization)¶

GSPO (Group Sequence Policy Optimization)¶

GMPO (Geometric-Mean Policy Optimization)¶

Dr.GRPO (Group Relative Policy Optimization Done Right)¶

DAPO (Dynamic sAmpling Policy Optimization)¶

Token-Level Policy¶

Advantage Estimation Methods¶

REINFORCE++¶

CPGD (Clipped Policy Gradient Optimization with Policy Drift)¶

Reward Processing¶

Reward Normalization and Clipping¶

Sampling Strategies¶

FIRE Sampling¶

Implementation Notes¶

References¶

1. Policy Loss Computation (`lightrft/trainer/ppo_loss.py`)¶

2. Experience Generation (`lightrft/trainer/fast_exp_maker.py`)¶

3. Advantage & Reward Processing (`lightrft/trainer/fast_exp_maker.py`)¶