Configuration Parameters¶

This comprehensive guide covers all configuration parameters available in LightRFT. Parameters are organized by category for easy reference.

Table of Contents¶

Model Parameters
Training Parameters
Batch Size Configuration
Algorithm Parameters
Distributed Training
Inference Engine
Memory Optimization
Logging and Monitoring
Checkpoint Management

Model Parameters¶

`--pretrain`¶

Type: str
Required: Yes
Description: Path to pre-trained model and tokenizer
Example: /path/to/Qwen2.5-7B-Instruct

`--reward_pretrain`¶

Type: str
Default: Same as --pretrain
Description: Path to reward model
Example: /path/to/reward-model

`--remote_rm_url`¶

Type: str
Default: None
Description: URL for remote reward model server
Example: http://localhost:5000

`--max_len`¶

Type: int
Default: 4096
Description: Maximum sequence length (prompt + response)

`--prompt_max_len`¶

Type: int
Default: 2048
Description: Maximum prompt length

Training Parameters¶

`--num_episodes`¶

Type: int
Default: 1
Description: Total number of training episodes
Recommended: 10-100 for most tasks

`--max_epochs`¶

Type: int
Default: 1
Description: Number of training epochs per episode
Recommended: 1-3

`--actor_learning_rate`¶

Type: float
Default: 5e-7
Description: Learning rate for actor (policy) model
Recommended Range: 1e-7 to 5e-6

`--critic_learning_rate`¶

Type: float
Default: 9e-6
Description: Learning rate for critic (value) model
Recommended Range: 1e-6 to 1e-5

`--lr_warmup_ratio`¶

Type: float
Default: 0.03
Description: Ratio of warmup steps to total steps
Range: 0.0 to 0.1

`--max_norm`¶

Type: float
Default: 1.0
Description: Maximum gradient norm for clipping

`--l2`¶

Type: float
Default: 0.0
Description: L2 regularization coefficient

`--adam_betas`¶

Type: tuple[float, float]
Default: (0.9, 0.95)
Description: Adam optimizer beta parameters

Batch Size Configuration¶

Important Constraint¶

Rule: train_batch_size >= rollout_batch_size × n_samples_per_prompt

`--train_batch_size` (TBS)¶

Type: int
Required: Yes
Description: Global training batch size across all GPUs
Example: 256
Calculation: micro_train_batch_size × num_gpus × gradient_accumulation_steps

`--micro_train_batch_size`¶

Type: int
Default: 1
Description: Per-GPU batch size for training
Typical Values: 1, 2, 4

`--rollout_batch_size` (RBS)¶

Type: int
Required: Yes
Description: Global batch size for experience generation
Example: 64
Note: Must be divisible by number of GPUs

`--micro_rollout_batch_size`¶

Type: int
Default: 2
Description: Per-GPU batch size for rollout
Typical Values: 2, 4, 8

Example Configurations¶

Configuration 1: 8 GPUs, Memory-Constrained

--train_batch_size 128 \
--micro_train_batch_size 1 \
--rollout_batch_size 64 \
--micro_rollout_batch_size 2

Configuration 2: 8 GPUs, High-Throughput

--train_batch_size 512 \
--micro_train_batch_size 2 \
--rollout_batch_size 256 \
--micro_rollout_batch_size 8

Algorithm Parameters¶

`--advantage_estimator`¶

Type: str
Choices: group_norm, reinforce, cpgd, gspo, gmpo
Default: group_norm
Description: Method for advantage estimation
Recommendation:
- group_norm: General purpose (GRPO)
- reinforce: Low variance needed
- cpgd: Preserve base capabilities

`--n_samples_per_prompt`¶

Type: int
Default: 4
Description: Number of responses to sample per prompt
Typical Values: 4, 8, 16
Note: Higher values = better but slower

`--kl_estimator`¶

Type: str
Choices: k1, k2, k3
Default: k3
Description: KL divergence estimator type
Recommendation: k3 for most cases

`--init_kl_coef`¶

Type: float
Default: 0.001
Description: Initial KL penalty coefficient
Range: 0.0001 to 0.01

`--kl_target`¶

Type: float
Default: 0.01
Description: Target KL divergence (for CPGD)

`--clip_range`¶

Type: float
Default: 0.2
Description: PPO clipping range
Range: 0.1 to 0.3

`--clip_range_higher`¶

Type: float
Default: 0.3
Description: Upper clipping range (Clip Higher algorithm)

`--temperature`¶

Type: float
Default: 1.0
Description: Sampling temperature
Range: 0.6 to 1.2
Note: Lower = more deterministic

`--top_p`¶

Type: float
Default: 0.9
Description: Nucleus sampling probability
Range: 0.8 to 1.0

Distributed Training¶

`--zero_stage`¶

Type: int
Choices: 1, 2, 3
Default: 2
Description: DeepSpeed ZeRO optimization stage
Recommendation:
- Stage 1: Optimizer state partitioning
- Stage 2: + Gradient partitioning (recommended)
- Stage 3: + Parameter partitioning (max memory saving)

`--fsdp`¶

Action: store_true
Default: False
Description: Use FSDP instead of DeepSpeed
When to Use: PyTorch-native workflows, maximum memory efficiency

`--fsdp_cpu_offload`¶

Action: store_true
Default: False
Description: Offload FSDP optimizer states to CPU
Note: Reduces GPU memory at cost of speed

`--bf16`¶

Action: store_true
Default: Typically enabled
Description: Use bfloat16 mixed precision

`--gradient_checkpointing`¶

Action: store_true
Default: False
Description: Enable gradient checkpointing
Note: Trades computation for memory

`--sp_size`¶

Type: int
Default: 1
Description: Sequence parallelism size
Recommendation: 1, 2, 4 for very long sequences

Inference Engine¶

`--engine_type`¶

Type: str
Choices: vllm, sglang
Default: vllm
Description: Inference engine type

`--engine_tp_size`¶

Type: int
Default: 1
Description: Tensor parallelism size for inference engine
Recommendation:
- 7B models: 1 or 2
- 13B-34B models: 2 or 4
- 70B+ models: 4 or 8
Constraint: world_size % engine_tp_size == 0

`--engine_mem_util`¶

Type: float
Default: 0.5
Range: 0.3 to 0.9
Description: GPU memory utilization for KV cache
Recommendation:
- High memory: 0.8 - 0.9
- Medium memory: 0.5 - 0.7
- Low memory: 0.3 - 0.5

`--enable_engine_sleep`¶

Action: store_true
Default: True
Description: Enable inference engine sleep mode
Note: Saves memory when engine not in use

`--disable_engine_sleep`¶

Action: store_false
Dest: enable_engine_sleep
Description: Disable engine sleep mode

`--rm_use_engine`¶

Action: store_true
Default: False
Description: Use inference engine for reward model
When to Use: High-throughput reward computation

Memory Optimization¶

`--adam_offload`¶

Action: store_true
Default: False
Description: Offload Adam optimizer states to CPU

`--use_mp_opt`¶

Action: store_true
Default: False
Description: Use mixed precision optimizer (FSDP)

`--packing_samples`¶

Action: store_true
Default: False
Description: Pack multiple samples into sequences
When to Use: Varied sequence lengths, improve GPU utilization

`--fused_linear_logprob`¶

Action: store_true
Default: False
Description: Fused linear layer and logprob computation
Note: Saves memory for large vocabulary models

`--chunk_size`¶

Type: int
Default: 4096
Description: Chunk size for fused operations

Logging and Monitoring¶

`--log_dir`¶

Type: str
Default: None
Description: Directory for logs and visualizations

`--plot_every`¶

Type: int
Default: 10
Description: Plot generation length distribution every N steps

`--use_tensorboard`¶

Type: str
Default: None
Description: TensorBoard log directory

`--use_wandb`¶

Type: str
Default: None
Description: Weights & Biases project name

`--wandb_org`¶

Type: str
Default: None
Description: W&B organization name

`--wandb_run_name`¶

Type: str
Default: Auto-generated
Description: W&B run name

Checkpoint Management¶

`--save_path`¶

Type: str
Required: Yes
Description: Directory to save checkpoints

`--ckpt_path`¶

Type: str
Default: None
Description: Path to load checkpoint from

`--load_checkpoint`¶

Action: store_true
Default: False
Description: Enable checkpoint loading

`--save_interval`¶

Type: int
Default: 1
Description: Save checkpoint every N episodes

`--max_ckpt_num`¶

Type: int
Default: 3
Description: Maximum number of checkpoints to keep

`--max_ckpt_mem`¶

Type: int
Default: 1000
Description: Maximum checkpoint memory in GB

Reward Processing¶

`--reward_running_norm`¶

Action: store_true
Default: False
Description: Apply running normalization to rewards

`--reward_running_norm_minus_mean`¶

Action: store_true
Default: False
Description: Subtract mean during reward normalization

`--advantages_norm`¶

Action: store_true
Default: False
Description: Normalize advantages

`--advantage_clip`¶

Type: float
Default: 0.0
Description: Clip advantages (0 = no clipping)

`--reward_clip`¶

Type: float
Default: 0.0
Description: Clip rewards (0 = no clipping)

Multimodal (VLM) Parameters¶

`--mixed_mm_data`¶

Action: store_true
Default: False
Description: Handle mixed multimodal and text-only data

`--processor`¶

Type: str
Default: Auto-detected
Description: Multimodal processor type

Complete Example Configuration¶

Math Reasoning (GSM8K, MATH)¶

python train.py \
    # Model
    --pretrain /path/to/Qwen2.5-7B-Instruct \
    --reward_pretrain /path/to/reward-model \
    --max_len 4096 \
    --prompt_max_len 2048 \
    \
    # Training
    --num_episodes 20 \
    --max_epochs 1 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    \
    # Batch Size
    --train_batch_size 128 \
    --micro_train_batch_size 1 \
    --rollout_batch_size 64 \
    --micro_rollout_batch_size 2 \
    \
    # Algorithm
    --advantage_estimator group_norm \
    --n_samples_per_prompt 8 \
    --kl_estimator k3 \
    --init_kl_coef 0.001 \
    --temperature 0.6 \
    \
    # Distributed
    --zero_stage 2 \
    --bf16 \
    --gradient_checkpointing \
    \
    # Engine
    --engine_type vllm \
    --engine_tp_size 1 \
    --engine_mem_util 0.85 \
    --enable_engine_sleep \
    \
    # Memory
    --adam_offload \
    --fused_linear_logprob \
    \
    # Logging
    --use_tensorboard ./tb_logs \
    --plot_every 10 \
    \
    # Checkpoint
    --save_path ./checkpoints \
    --save_interval 1 \
    --max_ckpt_num 3 \
    \
    # Reward
    --reward_running_norm \
    --advantages_norm

Multimodal VLM Training¶

python train_vl.py \
    # Model
    --pretrain /path/to/Qwen2-VL-7B-Instruct \
    --max_len 4096 \
    \
    # Training
    --num_episodes 10 \
    --actor_learning_rate 1e-6 \
    \
    # Batch Size
    --train_batch_size 64 \
    --micro_train_batch_size 1 \
    --rollout_batch_size 32 \
    --micro_rollout_batch_size 1 \
    \
    # Algorithm
    --advantage_estimator group_norm \
    --n_samples_per_prompt 4 \
    \
    # Distributed
    --fsdp \
    --fsdp_cpu_offload \
    --gradient_checkpointing \
    \
    # Engine
    --engine_tp_size 4 \
    --engine_mem_util 0.6 \
    \
    # VLM Specific
    --mixed_mm_data \
    --packing_samples

Parameter Validation¶

LightRFT performs automatic validation of parameters. Common validation rules:

Batch Size: train_batch_size >= rollout_batch_size × n_samples_per_prompt
Divisibility: Batch sizes must be divisible by number of GPUs
Memory: Engine TP size must divide world size evenly
Learning Rate: Actor LR typically < Critic LR
KL Target: Should be small (0.001-0.01) for stable training

Environment Variables¶

Useful environment variables for optimization:

# NCCL optimization
export TORCH_NCCL_AVOID_RECORD_STREAMS=1

# CUDA optimization
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Debugging
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO

Configuration Parameters¶

Table of Contents¶

Model Parameters¶

--pretrain¶

--reward_pretrain¶

--remote_rm_url¶

--max_len¶

--prompt_max_len¶

Training Parameters¶

--num_episodes¶

--max_epochs¶

--actor_learning_rate¶

--critic_learning_rate¶

--lr_warmup_ratio¶

--max_norm¶

--l2¶

--adam_betas¶

Batch Size Configuration¶

Important Constraint¶

--train_batch_size (TBS)¶

--micro_train_batch_size¶

--rollout_batch_size (RBS)¶

--micro_rollout_batch_size¶

Example Configurations¶

Algorithm Parameters¶

--advantage_estimator¶

--n_samples_per_prompt¶

--kl_estimator¶

--init_kl_coef¶

--kl_target¶

--clip_range¶

--clip_range_higher¶

--temperature¶

--top_p¶

Distributed Training¶

--zero_stage¶

--fsdp¶

--fsdp_cpu_offload¶

--bf16¶

--gradient_checkpointing¶

--sp_size¶