# Supported Algorithms

LightRFT supports a rich ecosystem of reinforcement learning algorithms for fine-tuning large language models. This comprehensive guide provides algorithm details and implementation references.

### Purpose of This Guide

With the rapid development in the RFT field and emerging algorithmic innovations, this guide helps you:

1. **Quickly identify** which algorithms suit your needs
2. **Understand implementation** by mapping algorithms to code modules
3. **Plan integration** of multiple algorithms by identifying synergies or conflicts
4. **Maintain clarity** through documented relationships between algorithms and components

### Algorithm Overview with Implementation

| Algorithm | Type | Module | Description | Implementation | Paper |
|-----------|------|--------|-------------|----------------|-------|
| **GRPO** | Policy Optimization | Advantage Estimation | Uses group-based normalization for advantage estimation without requiring a separate value network | `FastExperienceMaker._get_return_advs()` | [arXiv:2402.03300](https://arxiv.org/pdf/2402.03300) |
| **GSPO** | Policy Optimization | Policy Loss | Group sequence policy optimization | `PolicyLoss.forward()` | [arXiv:2507.18071](https://arxiv.org/abs/2507.18071) |
| **REINFORCE++** | Advantage Estimation | Advantage Estimation | Modifies return and advantage calculation with improved baseline estimation | `FastExperienceMaker._get_return_advs()` | [arXiv:2501.03262](https://arxiv.org/abs/2501.03262) |
| **CPGD** | Advantage Estimation | Advantage Estimation | Adds KL-based drift constraint and clipped log-ratio for stable return/advantage computation | `FastExperienceMaker._get_return_advs()` | [arXiv:2505.12504](https://arxiv.org/abs/2505.12504) |
| **FIRE Sampling** | Sampling Strategy | Experience Generation | Modifies sample generation process with filtering and ranking strategies | `FastExperienceMaker.generate_samples()` | [arXiv:2410.21236](https://arxiv.org/abs/2410.21236) |
| **GMPO** | Policy Optimization | Policy Loss | Geometric-Mean Policy Optimization | `PolicyLoss.forward()` | [arXiv:2507.20673](https://arxiv.org/abs/2507.20673) |
| **Dr.GRPO** | Policy Optimization | Policy Loss | Introduces an unbiased policy optimization to mitigate length bias and improve token efficiency | `PolicyLoss.forward()` | [arXiv:2503.20783](https://arxiv.org/abs/2503.20783) |
| **DAPO** | Policy Optimization | Policy Loss | Introduces decoupled clipping and dynamic sampling scheme to stabilize large-scale RL optimization | `PolicyLoss.forward()` | [arXiv:2503.14476](https://arxiv.org/abs/2503.14476) |
| **Token-Level Policy** | Policy Optimization | Policy Loss | Optimizes policy at token granularity to improve stability and credit assignment | `PolicyLoss.forward()` | [arXiv:2503.14476](https://arxiv.org/abs/2503.14476) |
| **Reward Norm/Clip** | Reward Processing | Reward Processing | Applies reward normalization and clipping to stabilize advantage computation | `FastExperienceMaker._get_return_advs()` | [GitHub](https://github.com/alibaba/ROLL) |
| **select_high_entropy_tokens** | Policy Optimization | Policy Loss | Modifies PolicyLoss to implement high entropy token selection during training |  `PolicyLoss.forward()` | [arXiv:2506.01939](https://arxiv.org/abs/2506.01939) |

### Algorithm Architecture

#### Core Training Components

LightRFT's algorithm implementations are organized around three main modules:

##### 1. Policy Loss Computation (`lightrft/trainer/ppo_loss.py`)
- **Purpose**: Implements PPO policy loss with multiple surrogate objectives
- **Key Method**: `forward(log_probs, old_log_probs, advantages, action_mask)`
- **Affected by**: GSPO, GMPO, Dr.GRPO, DAPO, Token-Level Policy, select_high_entropy_tokens
- **Modification Type**: Loss function design and token selection strategies

##### 2. Experience Generation (`lightrft/trainer/fast_exp_maker.py`)
- **Purpose**: Generates experiences using vLLM and other inference backends
- **Key Methods**:
  - `generate_samples()`: Sample generation with various strategies
  - `_get_return_advs()`: Returns and advantages calculation
- **Affected by**: FIRE Sampling
- **Modification Type**: Sampling strategies and inference optimization

##### 3. Advantage & Reward Processing (`lightrft/trainer/fast_exp_maker.py`)
- **Purpose**: Processes rewards and computes advantages for policy updates
- **Key Method**: `_get_return_advs()`: Advantage estimation with various baselines
- **Affected by**: GRPO, REINFORCE++, CPGD, Reward Norm/Clip
- **Modification Type**: Advantage estimation methods and reward shaping

#### Modification Types

**Algorithmic Changes**:
- **Loss Design**: Core objective function modifications
- **Advantage Estimation**: Updates to advantage calculation methods
- **Sampling Strategy**: Changes to sample generation processes
- **Token Selection**: Which tokens are used in training
- **Reward Shaping**: Reward preprocessing and filtering

**Implementation Changes**:
- **Efficiency Optimization**: Performance improvements (e.g., FP8)
- **Parameter Tuning**: Hyperparameter adjustments
- **Pipeline Integration**: New components or workflow changes

### Policy Optimization Algorithms

#### GRPO (Group Relative Policy Optimization)

**Overview**: GRPO uses group-based normalization for advantage estimation, providing stable training without requiring a separate value network.

**Implementation**: `FastExperienceMaker._get_return_advs()` - Advantage Estimation module
**Modification Type**: Advantage Estimation

**Key Features**:
- No critic network required
- Group-normalized advantages
- Stable training with large batch sizes
- Memory efficient

**Usage**:
```bash
python train.py \
    --advantage_estimator group_norm \
    --n_samples_per_prompt 8 \
    --kl_estimator k3
```

**Best For**:
- Large-scale training with limited memory
- Quick prototyping without value network
- Math reasoning and coding tasks

---

#### GSPO (Group Sequence Policy Optimization)

**Overview**: GSPO generalizes the PPO objective with flexible surrogate functions, allowing for better control over policy updates.

**Implementation**: `PolicyLoss.forward()` - Policy Loss module
**Modification Type**: Loss Design

**Key Features**:
- Generalized clipping objectives
- Adaptive trust region updates
- Better sample efficiency

**Usage**:
```bash
python train.py \
    --advantage_estimator gspo \
    --gspo_alpha 0.1 \
    --clip_range 0.2
```

**Best For**:
- Tasks requiring precise policy control
- Multi-task learning scenarios

---

#### GMPO (Geometric-Mean Policy Optimization)

**Overview**: GMPO leverages mirror descent principles for policy optimization, providing theoretical guarantees and improved convergence.

**Implementation**: `PolicyLoss.forward()` - Policy Loss module
**Modification Type**: Loss Design

**Key Features**:
- Mirror descent updates
- Theoretical convergence guarantees
- Adaptive step sizes

**Usage**:
```bash
python train.py \
    --advantage_estimator gmpo \
    --mirror_tau 0.01
```

**Best For**:
- Research applications requiring theoretical guarantees
- Complex reward landscapes

---

#### Dr.GRPO (Group Relative Policy Optimization Done Right)

**Overview**: Dr.GRPO addresses length bias in reward models by explicitly modeling and mitigating the reward-length correlation.

**Implementation**: `PolicyLoss.forward()` - Policy Loss module
**Modification Type**: Loss Design (length bias mitigation)

**Key Features**:
- Length bias mitigation
- Reward debiasing mechanisms
- Improved response quality

**Usage**:
```bash
python train.py \
    --advantage_estimator group_norm \
    --use_length_penalty \
    --length_penalty_coef 0.01
```

**Best For**:
- Tasks sensitive to response length
- Instruction following
- Open-ended generation

---

#### DAPO (Dynamic sAmpling Policy Optimization)

**Overview**: DAPO uses separate upper and lower clipping bounds for advantage-weighted policy updates combined with dynamic sampling strategies, improving training stability.

**Implementation**: `PolicyLoss.forward()` - Policy Loss module
**Modification Type**: Loss Design (decoupled clipping)

**Key Features**:
- Decoupled clipping for positive/negative advantages
- Dynamic sampling strategy
- Better handling of distribution shifts
- Improved stability

**Usage**:
```bash
python train.py \
    --use_clip_higher \
    --clip_range_higher 0.3 \
    --clip_range_lower 0.2
```

**Best For**:
- Highly noisy reward signals
- Large distribution shifts
- Challenging domains

---

#### Token-Level Policy

**Overview**: Optimizes policy at token granularity to improve stability and credit assignment.

**Implementation**: `PolicyLoss.forward()` - Policy Loss module
**Modification Type**: Token Selection

**Key Features**:
- Token-granular optimization
- Improved credit assignment
- Better stability in long sequences

**Usage**: Typically combined with other policy optimization methods through implementation modifications.

### Advantage Estimation Methods

#### REINFORCE++

**Overview**: An improved baseline estimation method that uses control variates to reduce variance in policy gradient estimates.

**Implementation**: `FastExperienceMaker._get_return_advs()` - Advantage Estimation module
**Modification Type**: Advantage Estimation

**Key Features**:
- Lower variance gradients
- Faster convergence
- Compatible with all policy optimization methods

**Usage**:
```bash
python train.py \
    --advantage_estimator reinforce_plus \
    --baseline_type value_network
```

**Best For**:
- High-variance environments
- Sparse rewards
- Combining with PPO or other on-policy methods

---

#### CPGD (Clipped Policy Gradient Optimization with Policy Drift)

**Overview**: CPGD constrains policy updates using KL-divergence to prevent catastrophic forgetting and maintain stable training.

**Implementation**: `FastExperienceMaker._get_return_advs()` - Advantage Estimation module
**Modification Type**: Advantage Estimation (KL-constrained)

**Key Features**:
- KL-constrained updates
- Prevents catastrophic forgetting
- Adaptive constraint adjustment

**Usage**:
```bash
python train.py \
    --advantage_estimator cpgd \
    --kl_target 0.01 \
    --kl_horizon 10000
```

**Best For**:
- Fine-tuning pre-trained models
- Preserving original capabilities
- Multi-stage training

### Reward Processing

#### Reward Normalization and Clipping

**Overview**: Standard reward preprocessing techniques to stabilize training.

**Implementation**: `FastExperienceMaker._get_return_advs()` - Reward Processing module
**Modification Type**: Reward Shaping (normalization/clipping)

**Key Features**:
- Running reward statistics
- Advantage normalization
- Reward clipping

**Usage**:
```bash
python train.py \
    --reward_running_norm \
    --reward_running_norm_minus_mean \
    --reward_clip 10.0 \
    --advantage_clip 10.0
```

**Best For**:
- All training scenarios (recommended baseline)
- Reward scale varies across prompts
- Training stability

### Sampling Strategies

#### FIRE Sampling

**Overview**: FIRE (Filtered and Improved Reward Estimation) combines filtering and ranking strategies for better sample selection.

**Implementation**: `FastExperienceMaker.generate_samples()` - Experience Generation module
**Modification Type**: Sampling Strategy

**Key Features**:
- Multi-stage filtering
- Reward-based ranking
- Sample efficiency

**Usage**:
```bash
python train.py \
    --use_fire_sampling \
    --fire_filter_ratio 0.5 \
    --fire_rank_method reward
```

**Best For**:
- Limited computational budgets
- High-quality data generation
- Best-of-N sampling scenarios


### Implementation Notes

- All policy loss algorithms modify the **PolicyLoss** module's `forward()` method
- Advantage estimation algorithms modify **FastExperienceMaker**'s `_get_return_advs()` method
- Sampling strategies modify **FastExperienceMaker**'s `generate_samples()` method
- Reward processing algorithms primarily work within `_get_return_advs()` method
- Most modifications are in core training loop components rather than peripheral utilities

### References

For detailed algorithm descriptions and experimental results, refer to the linked papers. Implementation details can be found in the source code:
- Policy Loss: `lightrft/models/loss.py`
- Experience Maker: `lightrft/trainer/fast_exp_maker.py`
- vLLM Utils: `lightrft/strategy/vllm_utils/`