Shortcuts

Strategy Usage Guide

Overview

LightRFT’s strategy module is the distributed training capabilities with additional features for efficient reinforcement learning fine-tuning. The strategy provides a unified interface for managing:

  • Distributed Training Backends: DeepSpeed ZeRO and FSDP (Fully Sharded Data Parallel)

  • Inference Engine Integration: vLLM and SGLang for high-throughput generation

  • Memory Optimization: Optimizer offloading, gradient accumulation, and engine sleep modes

  • Sequence Parallelism: Efficient handling of long sequences across multiple GPUs

Core API Extensions

LightRFT adds the following key methods to the strategy interface:

Method

Purpose

setup_inference_engine()

Initialize vLLM or SGLang inference engine

update_engine_weights()

Synchronize actor model weights to inference engine

gather_and_generate()

Distributed generation with automatic prompt gathering

maybe_load_optimizer()

Load optimizer states from CPU (FSDP only)

maybe_offload_optimizer()

Offload optimizer states to CPU (FSDP only)

wakeup_inference_engine()

Wake up inference engine from sleep mode

maybe_sleep_inference_engine()

Put inference engine to sleep to save memory

Creating a Strategy

Basic Setup

Use the factory function get_strategy() to create a strategy instance:

from lightrft.strategy import get_strategy
from lightrft.utils import add_arguments

def train(args):
    # Create strategy (automatically selects DeepSpeed or FSDP based on args)
    strategy = get_strategy(args)

    # Setup inference engine for generation
    strategy.setup_inference_engine(args, engine_type='vllm')

    # Access the engine if needed
    vllm_engine = strategy.inference_engine

    # Create trainer
    trainer = SPMDPPOTrainer(
        strategy=strategy,
        actor=actor,
        critic=critic,
        reward_model=reward_model,
        initial_model=initial_model,
        ema_model=ema_model,
        actor_optim=actor_optim,
        critic_optim=critic_optim,
        actor_scheduler=actor_scheduler,
        critic_scheduler=critic_scheduler,
        ...
    )

Strategy Selection

The strategy type is automatically determined by configuration arguments:

  • FSDP: Set --fsdp flag

  • DeepSpeed: Default when --fsdp is not set (configurable via --zero_stage)

Using Strategy in Trainers

Standard Training Operations

The strategy provides standard distributed training operations:

# Backward pass
strategy.backward(loss, model, optimizer)

# Optimizer step with gradient clipping
strategy.optimizer_step(optimizer, model, scheduler, name="actor")

# Distributed communication
averaged_value = strategy.all_reduce(local_value, op="mean")
gathered_values = strategy.all_gather(local_value)

Memory-Optimized Training

For FSDP-based training, use optimizer offloading to reduce GPU memory:

def ppo_train(self, global_steps=0):
    torch.cuda.synchronize()
    train_begin = time.time()

    # Load optimizer states from CPU to GPU (FSDP only)
    self.strategy.maybe_load_optimizer(self.actor_optim)

    # Perform training
    train_ret = super().ppo_train(global_steps)

    # Offload optimizer states from GPU to CPU (FSDP only)
    self.strategy.maybe_offload_optimizer(self.actor_optim)

    torch.cuda.synchronize()
    self.strategy.print(f"PPO Train TIMECOST {time.time() - train_begin}")

    # Synchronize actor weights to inference engine
    self.strategy.update_engine_weights(self.actor)

    return train_ret

Engine Weight Synchronization

After training updates, synchronize model weights to the inference engine:

# Update inference engine with latest actor weights
strategy.update_engine_weights(actor)

This ensures that the inference engine uses the most recent model parameters for generation.

Using Strategy in Experience Makers

Text Generation (LLM)

Use gather_and_generate() for distributed text generation:

# Tokenize prompts (without padding for efficiency)
all_prompt_token_ids = self.tokenize_fn(
    all_prompts,
    self.prompt_max_len,
    padding=False
)["input_ids"]

# Generate responses with automatic distribution
all_outputs = self.strategy.gather_and_generate(
    sampling_params=sampling_params,
    all_prompt_token_ids=all_prompt_token_ids,
    sleep_engine=True  # Automatically sleep engine after generation
)

if dist.get_rank(self.vllm_mp_group) == 0:
    self.strategy.print(f"Generated {len(all_outputs)} outputs")

Multimodal Generation (VLM)

For vision-language models with images:

# Generate with multimodal inputs
all_outputs = self.strategy.gather_and_generate(
    sampling_params=sampling_params,
    all_prompts=all_prompts,        # Text prompts
    all_images=all_images,          # Image data
    images_num=images_num,          # Number of images per prompt
    sleep_engine=True
)

How gather_and_generate() Works

The method performs the following operations:

  1. Gather: Collects prompts from all ranks within the tensor-parallel group to rank 0

    • Example: With world_size=8 and engine_tp_size=4, ranks [0,1,2,3] gather to rank 0, and ranks [4,5,6,7] gather to rank 4

  2. Generate: Executes inference using the vLLM/SGLang engine on the gathered prompts

  3. Distribute: Scatters the generated outputs back to the originating ranks in the same order

  4. Sleep Management: Automatically handles engine sleep/wake cycles based on the sleep_engine parameter

Note

Users don’t need to manually manage engine sleep states when using this interface.

Required Arguments

Add LightRFT-specific arguments to your argument parser:

from lightrft.utils import add_arguments
import argparse

# Create parser
parser = argparse.ArgumentParser()

# Add LightRFT arguments
add_arguments(parser)

# Parse arguments
args = parser.parse_args()

Key Arguments

Inference Engine Configuration:

--engine_tp_size 4              # Tensor parallelism size for inference engine
--engine_mem_util 0.85          # GPU memory utilization for KV cache (0.0-1.0)
--engine_type vllm              # Engine type: 'vllm' or 'sglang'
--enable_engine_sleep           # Enable engine sleep mode (default: True)
--disable_engine_sleep          # Disable engine sleep mode

Distributed Training:

--fsdp                          # Use FSDP instead of DeepSpeed
--zero_stage 2                  # DeepSpeed ZeRO stage (1, 2, or 3)
--fsdp_cpu_offload              # Offload FSDP optimizer states to CPU
--adam_offload                  # Offload Adam optimizer states
--sp_size 2                     # Sequence parallelism size

Training Optimization:

--packing_samples               # Pack multiple samples into sequences
--use_mp_opt                    # Use mixed precision optimizer (FSDP)
--fused_linear_logprob          # Fused linear layer and logprob computation
--chunk_size 4096               # Chunk size for fused operations

Monitoring:

--log_dir ./logs                # Directory for logs and visualizations
--plot_every 10                 # Plot generation length distribution every N steps

Strategy Implementation Details

Available Strategies

LightRFT provides two main strategy implementations:

  1. DeepspeedStrategy (default)

    • Uses DeepSpeed ZeRO for memory-efficient training

    • Configurable ZeRO stages (1, 2, or 3)

    • Supports gradient accumulation and mixed precision

    • Best for: General RLHF training, well-established workflows

  2. FSDPV2Strategy (when --fsdp is set)

    • Uses PyTorch’s Fully Sharded Data Parallel

    • Supports CPU offloading for optimizer states

    • Native PyTorch implementation with better integration

    • Best for: Maximum memory efficiency, PyTorch-native workflows

Strategy Selection Logic

# In get_strategy() function
if args.fsdp:
    strategy = FSDPV2Strategy(...)
else:
    strategy = DeepspeedStrategy(...)

Engine Sleep/Wake Mechanism

The strategy provides automatic memory management through engine sleep modes:

# Engine lifecycle management
strategy.setup_inference_engine(args, engine_type='vllm')  # Creates and wakes engine
strategy.maybe_sleep_inference_engine()                     # Sleep to save memory
strategy.wakeup_inference_engine()                          # Wake for generation

Important

When using gather_and_generate() with sleep_engine=True, the sleep/wake cycle is handled automatically.

Configuration Examples

High-Throughput Setup (8 GPUs, DeepSpeed)

# Using DeepSpeed ZeRO-2 with large tensor parallelism
python train.py \
    --zero_stage 2 \
    --engine_tp_size 4 \
    --engine_mem_util 0.9 \
    --enable_engine_sleep \
    --micro_train_batch_size 1 \
    --train_batch_size 128

Memory-Efficient Setup (8 GPUs, FSDP with CPU Offload)

# Using FSDP with CPU offloading for maximum memory efficiency
python train.py \
    --fsdp \
    --fsdp_cpu_offload \
    --use_mp_opt \
    --engine_tp_size 2 \
    --engine_mem_util 0.85 \
    --enable_engine_sleep \
    --micro_train_batch_size 1 \
    --train_batch_size 64

Vision-Language Model Setup

# Training VLMs with multimodal data
python train_vl.py \
    --fsdp \
    --engine_tp_size 4 \
    --mixed_mm_data \
    --packing_samples \
    --enable_engine_sleep \
    --plot_every 20

Best Practices

1. Tensor Parallelism Configuration

  • Set engine_tp_size to match your model size and GPU count

  • For 7B models: engine_tp_size=1 or 2

  • For 13B-70B models: engine_tp_size=4 or 8

  • Ensure world_size % engine_tp_size == 0

2. Memory Management

  • Enable engine sleep mode for memory-constrained setups: --enable_engine_sleep

  • Adjust engine_mem_util based on available memory (0.5-0.9)

  • Use FSDP with CPU offload for maximum memory savings: --fsdp --fsdp_cpu_offload

3. Performance Optimization

  • Use --packing_samples for varied sequence lengths

  • Enable --fused_linear_logprob for large vocabulary models

  • Set appropriate micro_train_batch_size to saturate GPU utilization

4. Debugging and Monitoring

  • Use --plot_every with --log_dir to track generation length distribution

  • Monitor memory with strategy.report_memory(prefix="checkpoint_name")

  • Check engine status with strategy.inference_engine_status

Advanced Features

Sequence Parallelism

Enable sequence parallelism for very long sequences:

# In arguments
--sp_size 4  # Split sequence across 4 GPUs

The strategy automatically creates sequence-parallel groups and handles communication.

Custom Reward Models

For multiple reward models or remote reward APIs:

# Multiple reward models
reward_models = [reward_model_1, reward_model_2, reward_model_3]
strategy = get_strategy(args)

# Models are automatically sharded across GPUs
prepared_rms = [strategy.prepare_model(rm, shard_size=8) for rm in reward_models]

Mixed Precision Training

Control mixed precision behavior:

# Enable BF16 training
--bf16

# Use mixed precision optimizer (FSDP)
--use_mp_opt

Troubleshooting

Common Issues

Issue: Out of memory during generation

  • Solution: Reduce engine_mem_util or increase engine_tp_size

Issue: Engine not updating with new weights

  • Solution: Ensure update_engine_weights() is called after training

Issue: Slow generation speed

  • Solution: Increase micro_rollout_batch_size or reduce engine_tp_size

Issue: FSDP optimizer offload errors

  • Solution: Verify you’re using FSDP strategy (--fsdp) and calling offload/load in pairs

API Reference

For detailed API documentation, see:

  • lightrft.strategy.strategy_base.StrategyBase - Base strategy class

  • lightrft.strategy.get_strategy() - Strategy factory function

  • lightrft.utils.add_arguments() - Argument configuration