Configuration Parameters¶
This comprehensive guide covers all configuration parameters available in LightRFT. Parameters are organized by category for easy reference.
Table of Contents¶
Model Parameters¶
--pretrain¶
Type:
strRequired: Yes
Description: Path to pre-trained model and tokenizer
Example:
/path/to/Qwen2.5-7B-Instruct
--reward_pretrain¶
Type:
strDefault: Same as
--pretrainDescription: Path to reward model
Example:
/path/to/reward-model
--remote_rm_url¶
Type:
strDefault:
NoneDescription: URL for remote reward model server
Example:
http://localhost:5000
--max_len¶
Type:
intDefault:
4096Description: Maximum sequence length (prompt + response)
--prompt_max_len¶
Type:
intDefault:
2048Description: Maximum prompt length
Training Parameters¶
--num_episodes¶
Type:
intDefault:
1Description: Total number of training episodes
Recommended: 10-100 for most tasks
--max_epochs¶
Type:
intDefault:
1Description: Number of training epochs per episode
Recommended: 1-3
--actor_learning_rate¶
Type:
floatDefault:
5e-7Description: Learning rate for actor (policy) model
Recommended Range:
1e-7to5e-6
--critic_learning_rate¶
Type:
floatDefault:
9e-6Description: Learning rate for critic (value) model
Recommended Range:
1e-6to1e-5
--lr_warmup_ratio¶
Type:
floatDefault:
0.03Description: Ratio of warmup steps to total steps
Range:
0.0to0.1
--max_norm¶
Type:
floatDefault:
1.0Description: Maximum gradient norm for clipping
--l2¶
Type:
floatDefault:
0.0Description: L2 regularization coefficient
--adam_betas¶
Type:
tuple[float, float]Default:
(0.9, 0.95)Description: Adam optimizer beta parameters
Batch Size Configuration¶
Important Constraint¶
Rule: train_batch_size >= rollout_batch_size × n_samples_per_prompt
--train_batch_size (TBS)¶
Type:
intRequired: Yes
Description: Global training batch size across all GPUs
Example:
256Calculation:
micro_train_batch_size × num_gpus × gradient_accumulation_steps
--micro_train_batch_size¶
Type:
intDefault:
1Description: Per-GPU batch size for training
Typical Values:
1,2,4
--rollout_batch_size (RBS)¶
Type:
intRequired: Yes
Description: Global batch size for experience generation
Example:
64Note: Must be divisible by number of GPUs
--micro_rollout_batch_size¶
Type:
intDefault:
2Description: Per-GPU batch size for rollout
Typical Values:
2,4,8
Example Configurations¶
Configuration 1: 8 GPUs, Memory-Constrained
--train_batch_size 128 \
--micro_train_batch_size 1 \
--rollout_batch_size 64 \
--micro_rollout_batch_size 2
Configuration 2: 8 GPUs, High-Throughput
--train_batch_size 512 \
--micro_train_batch_size 2 \
--rollout_batch_size 256 \
--micro_rollout_batch_size 8
Algorithm Parameters¶
--advantage_estimator¶
Type:
strChoices:
group_norm,reinforce,cpgd,gspo,gmpoDefault:
group_normDescription: Method for advantage estimation
Recommendation:
group_norm: General purpose (GRPO)reinforce: Low variance neededcpgd: Preserve base capabilities
--n_samples_per_prompt¶
Type:
intDefault:
4Description: Number of responses to sample per prompt
Typical Values:
4,8,16Note: Higher values = better but slower
--kl_estimator¶
Type:
strChoices:
k1,k2,k3Default:
k3Description: KL divergence estimator type
Recommendation:
k3for most cases
--init_kl_coef¶
Type:
floatDefault:
0.001Description: Initial KL penalty coefficient
Range:
0.0001to0.01
--kl_target¶
Type:
floatDefault:
0.01Description: Target KL divergence (for CPGD)
--clip_range¶
Type:
floatDefault:
0.2Description: PPO clipping range
Range:
0.1to0.3
--clip_range_higher¶
Type:
floatDefault:
0.3Description: Upper clipping range (Clip Higher algorithm)
--temperature¶
Type:
floatDefault:
1.0Description: Sampling temperature
Range:
0.6to1.2Note: Lower = more deterministic
--top_p¶
Type:
floatDefault:
0.9Description: Nucleus sampling probability
Range:
0.8to1.0
Distributed Training¶
--zero_stage¶
Type:
intChoices:
1,2,3Default:
2Description: DeepSpeed ZeRO optimization stage
Recommendation:
Stage 1: Optimizer state partitioning
Stage 2: + Gradient partitioning (recommended)
Stage 3: + Parameter partitioning (max memory saving)
--fsdp¶
Action:
store_trueDefault:
FalseDescription: Use FSDP instead of DeepSpeed
When to Use: PyTorch-native workflows, maximum memory efficiency
--fsdp_cpu_offload¶
Action:
store_trueDefault:
FalseDescription: Offload FSDP optimizer states to CPU
Note: Reduces GPU memory at cost of speed
--bf16¶
Action:
store_trueDefault: Typically enabled
Description: Use bfloat16 mixed precision
--gradient_checkpointing¶
Action:
store_trueDefault:
FalseDescription: Enable gradient checkpointing
Note: Trades computation for memory
--sp_size¶
Type:
intDefault:
1Description: Sequence parallelism size
Recommendation:
1,2,4for very long sequences
Inference Engine¶
--engine_type¶
Type:
strChoices:
vllm,sglangDefault:
vllmDescription: Inference engine type
--engine_tp_size¶
Type:
intDefault:
1Description: Tensor parallelism size for inference engine
Recommendation:
7B models:
1or213B-34B models:
2or470B+ models:
4or8
Constraint:
world_size % engine_tp_size == 0
--engine_mem_util¶
Type:
floatDefault:
0.5Range:
0.3to0.9Description: GPU memory utilization for KV cache
Recommendation:
High memory:
0.8-0.9Medium memory:
0.5-0.7Low memory:
0.3-0.5
--enable_engine_sleep¶
Action:
store_trueDefault:
TrueDescription: Enable inference engine sleep mode
Note: Saves memory when engine not in use
--disable_engine_sleep¶
Action:
store_falseDest:
enable_engine_sleepDescription: Disable engine sleep mode
--rm_use_engine¶
Action:
store_trueDefault:
FalseDescription: Use inference engine for reward model
When to Use: High-throughput reward computation
Memory Optimization¶
--adam_offload¶
Action:
store_trueDefault:
FalseDescription: Offload Adam optimizer states to CPU
--use_mp_opt¶
Action:
store_trueDefault:
FalseDescription: Use mixed precision optimizer (FSDP)
--packing_samples¶
Action:
store_trueDefault:
FalseDescription: Pack multiple samples into sequences
When to Use: Varied sequence lengths, improve GPU utilization
--fused_linear_logprob¶
Action:
store_trueDefault:
FalseDescription: Fused linear layer and logprob computation
Note: Saves memory for large vocabulary models
--chunk_size¶
Type:
intDefault:
4096Description: Chunk size for fused operations
Logging and Monitoring¶
--log_dir¶
Type:
strDefault:
NoneDescription: Directory for logs and visualizations
--plot_every¶
Type:
intDefault:
10Description: Plot generation length distribution every N steps
--use_tensorboard¶
Type:
strDefault:
NoneDescription: TensorBoard log directory
--use_wandb¶
Type:
strDefault:
NoneDescription: Weights & Biases project name
--wandb_org¶
Type:
strDefault:
NoneDescription: W&B organization name
--wandb_run_name¶
Type:
strDefault: Auto-generated
Description: W&B run name
Checkpoint Management¶
--save_path¶
Type:
strRequired: Yes
Description: Directory to save checkpoints
--ckpt_path¶
Type:
strDefault:
NoneDescription: Path to load checkpoint from
--load_checkpoint¶
Action:
store_trueDefault:
FalseDescription: Enable checkpoint loading
--save_interval¶
Type:
intDefault:
1Description: Save checkpoint every N episodes
--max_ckpt_num¶
Type:
intDefault:
3Description: Maximum number of checkpoints to keep
--max_ckpt_mem¶
Type:
intDefault:
1000Description: Maximum checkpoint memory in GB
Reward Processing¶
--reward_running_norm¶
Action:
store_trueDefault:
FalseDescription: Apply running normalization to rewards
--reward_running_norm_minus_mean¶
Action:
store_trueDefault:
FalseDescription: Subtract mean during reward normalization
--advantages_norm¶
Action:
store_trueDefault:
FalseDescription: Normalize advantages
--advantage_clip¶
Type:
floatDefault:
0.0Description: Clip advantages (0 = no clipping)
--reward_clip¶
Type:
floatDefault:
0.0Description: Clip rewards (0 = no clipping)
Multimodal (VLM) Parameters¶
--mixed_mm_data¶
Action:
store_trueDefault:
FalseDescription: Handle mixed multimodal and text-only data
--processor¶
Type:
strDefault: Auto-detected
Description: Multimodal processor type
Complete Example Configuration¶
Math Reasoning (GSM8K, MATH)¶
python train.py \
# Model
--pretrain /path/to/Qwen2.5-7B-Instruct \
--reward_pretrain /path/to/reward-model \
--max_len 4096 \
--prompt_max_len 2048 \
\
# Training
--num_episodes 20 \
--max_epochs 1 \
--actor_learning_rate 5e-7 \
--critic_learning_rate 9e-6 \
\
# Batch Size
--train_batch_size 128 \
--micro_train_batch_size 1 \
--rollout_batch_size 64 \
--micro_rollout_batch_size 2 \
\
# Algorithm
--advantage_estimator group_norm \
--n_samples_per_prompt 8 \
--kl_estimator k3 \
--init_kl_coef 0.001 \
--temperature 0.6 \
\
# Distributed
--zero_stage 2 \
--bf16 \
--gradient_checkpointing \
\
# Engine
--engine_type vllm \
--engine_tp_size 1 \
--engine_mem_util 0.85 \
--enable_engine_sleep \
\
# Memory
--adam_offload \
--fused_linear_logprob \
\
# Logging
--use_tensorboard ./tb_logs \
--plot_every 10 \
\
# Checkpoint
--save_path ./checkpoints \
--save_interval 1 \
--max_ckpt_num 3 \
\
# Reward
--reward_running_norm \
--advantages_norm
Multimodal VLM Training¶
python train_vl.py \
# Model
--pretrain /path/to/Qwen2-VL-7B-Instruct \
--max_len 4096 \
\
# Training
--num_episodes 10 \
--actor_learning_rate 1e-6 \
\
# Batch Size
--train_batch_size 64 \
--micro_train_batch_size 1 \
--rollout_batch_size 32 \
--micro_rollout_batch_size 1 \
\
# Algorithm
--advantage_estimator group_norm \
--n_samples_per_prompt 4 \
\
# Distributed
--fsdp \
--fsdp_cpu_offload \
--gradient_checkpointing \
\
# Engine
--engine_tp_size 4 \
--engine_mem_util 0.6 \
\
# VLM Specific
--mixed_mm_data \
--packing_samples
Parameter Validation¶
LightRFT performs automatic validation of parameters. Common validation rules:
Batch Size:
train_batch_size >= rollout_batch_size × n_samples_per_promptDivisibility: Batch sizes must be divisible by number of GPUs
Memory: Engine TP size must divide world size evenly
Learning Rate: Actor LR typically < Critic LR
KL Target: Should be small (0.001-0.01) for stable training
Environment Variables¶
Useful environment variables for optimization:
# NCCL optimization
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
# CUDA optimization
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# Debugging
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO
See Also¶
Algorithm Guide - Detailed algorithm descriptions
Strategy Usage - Distributed training strategies
Installation - Setup instructions