# Configuration Parameters

This comprehensive guide covers all configuration parameters available in LightRFT. Parameters are organized by category for easy reference.

## Table of Contents

1. [Model Parameters](#model-parameters)
2. [Training Parameters](#training-parameters)
3. [Batch Size Configuration](#batch-size-configuration)
4. [Algorithm Parameters](#algorithm-parameters)
5. [Distributed Training](#distributed-training)
6. [Inference Engine](#inference-engine)
7. [Memory Optimization](#memory-optimization)
8. [Logging and Monitoring](#logging-and-monitoring)
9. [Checkpoint Management](#checkpoint-management)

## Model Parameters

### `--pretrain`
- **Type**: `str`
- **Required**: Yes
- **Description**: Path to pre-trained model and tokenizer
- **Example**: `/path/to/Qwen2.5-7B-Instruct`

### `--reward_pretrain`
- **Type**: `str`
- **Default**: Same as `--pretrain`
- **Description**: Path to reward model
- **Example**: `/path/to/reward-model`

### `--remote_rm_url`
- **Type**: `str`
- **Default**: `None`
- **Description**: URL for remote reward model server
- **Example**: `http://localhost:5000`

### `--max_len`
- **Type**: `int`
- **Default**: `4096`
- **Description**: Maximum sequence length (prompt + response)

### `--prompt_max_len`
- **Type**: `int`
- **Default**: `2048`
- **Description**: Maximum prompt length

## Training Parameters

### `--num_episodes`
- **Type**: `int`
- **Default**: `1`
- **Description**: Total number of training episodes
- **Recommended**: 10-100 for most tasks

### `--max_epochs`
- **Type**: `int`
- **Default**: `1`
- **Description**: Number of training epochs per episode
- **Recommended**: 1-3

### `--actor_learning_rate`
- **Type**: `float`
- **Default**: `5e-7`
- **Description**: Learning rate for actor (policy) model
- **Recommended Range**: `1e-7` to `5e-6`

### `--critic_learning_rate`
- **Type**: `float`
- **Default**: `9e-6`
- **Description**: Learning rate for critic (value) model
- **Recommended Range**: `1e-6` to `1e-5`

### `--lr_warmup_ratio`
- **Type**: `float`
- **Default**: `0.03`
- **Description**: Ratio of warmup steps to total steps
- **Range**: `0.0` to `0.1`

### `--max_norm`
- **Type**: `float`
- **Default**: `1.0`
- **Description**: Maximum gradient norm for clipping

### `--l2`
- **Type**: `float`
- **Default**: `0.0`
- **Description**: L2 regularization coefficient

### `--adam_betas`
- **Type**: `tuple[float, float]`
- **Default**: `(0.9, 0.95)`
- **Description**: Adam optimizer beta parameters

## Batch Size Configuration

### Important Constraint
**Rule**: `train_batch_size >= rollout_batch_size × n_samples_per_prompt`

### `--train_batch_size` (TBS)
- **Type**: `int`
- **Required**: Yes
- **Description**: Global training batch size across all GPUs
- **Example**: `256`
- **Calculation**: `micro_train_batch_size × num_gpus × gradient_accumulation_steps`

### `--micro_train_batch_size`
- **Type**: `int`
- **Default**: `1`
- **Description**: Per-GPU batch size for training
- **Typical Values**: `1`, `2`, `4`

### `--rollout_batch_size` (RBS)
- **Type**: `int`
- **Required**: Yes
- **Description**: Global batch size for experience generation
- **Example**: `64`
- **Note**: Must be divisible by number of GPUs

### `--micro_rollout_batch_size`
- **Type**: `int`
- **Default**: `2`
- **Description**: Per-GPU batch size for rollout
- **Typical Values**: `2`, `4`, `8`

### Example Configurations

**Configuration 1: 8 GPUs, Memory-Constrained**
```bash
--train_batch_size 128 \
--micro_train_batch_size 1 \
--rollout_batch_size 64 \
--micro_rollout_batch_size 2
```

**Configuration 2: 8 GPUs, High-Throughput**
```bash
--train_batch_size 512 \
--micro_train_batch_size 2 \
--rollout_batch_size 256 \
--micro_rollout_batch_size 8
```

## Algorithm Parameters

### `--advantage_estimator`
- **Type**: `str`
- **Choices**: `group_norm`, `reinforce`, `cpgd`, `gspo`, `gmpo`
- **Default**: `group_norm`
- **Description**: Method for advantage estimation
- **Recommendation**:
  - `group_norm`: General purpose (GRPO)
  - `reinforce`: Low variance needed
  - `cpgd`: Preserve base capabilities

### `--n_samples_per_prompt`
- **Type**: `int`
- **Default**: `4`
- **Description**: Number of responses to sample per prompt
- **Typical Values**: `4`, `8`, `16`
- **Note**: Higher values = better but slower

### `--kl_estimator`
- **Type**: `str`
- **Choices**: `k1`, `k2`, `k3`
- **Default**: `k3`
- **Description**: KL divergence estimator type
- **Recommendation**: `k3` for most cases

### `--init_kl_coef`
- **Type**: `float`
- **Default**: `0.001`
- **Description**: Initial KL penalty coefficient
- **Range**: `0.0001` to `0.01`

### `--kl_target`
- **Type**: `float`
- **Default**: `0.01`
- **Description**: Target KL divergence (for CPGD)

### `--clip_range`
- **Type**: `float`
- **Default**: `0.2`
- **Description**: PPO clipping range
- **Range**: `0.1` to `0.3`

### `--clip_range_higher`
- **Type**: `float`
- **Default**: `0.3`
- **Description**: Upper clipping range (Clip Higher algorithm)

### `--temperature`
- **Type**: `float`
- **Default**: `1.0`
- **Description**: Sampling temperature
- **Range**: `0.6` to `1.2`
- **Note**: Lower = more deterministic

### `--top_p`
- **Type**: `float`
- **Default**: `0.9`
- **Description**: Nucleus sampling probability
- **Range**: `0.8` to `1.0`

## Distributed Training

### `--zero_stage`
- **Type**: `int`
- **Choices**: `1`, `2`, `3`
- **Default**: `2`
- **Description**: DeepSpeed ZeRO optimization stage
- **Recommendation**:
  - Stage 1: Optimizer state partitioning
  - Stage 2: + Gradient partitioning (recommended)
  - Stage 3: + Parameter partitioning (max memory saving)

### `--fsdp`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Use FSDP instead of DeepSpeed
- **When to Use**: PyTorch-native workflows, maximum memory efficiency

### `--fsdp_cpu_offload`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Offload FSDP optimizer states to CPU
- **Note**: Reduces GPU memory at cost of speed

### `--bf16`
- **Action**: `store_true`
- **Default**: Typically enabled
- **Description**: Use bfloat16 mixed precision

### `--gradient_checkpointing`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Enable gradient checkpointing
- **Note**: Trades computation for memory

### `--sp_size`
- **Type**: `int`
- **Default**: `1`
- **Description**: Sequence parallelism size
- **Recommendation**: `1`, `2`, `4` for very long sequences

## Inference Engine

### `--engine_type`
- **Type**: `str`
- **Choices**: `vllm`, `sglang`
- **Default**: `vllm`
- **Description**: Inference engine type

### `--engine_tp_size`
- **Type**: `int`
- **Default**: `1`
- **Description**: Tensor parallelism size for inference engine
- **Recommendation**:
  - 7B models: `1` or `2`
  - 13B-34B models: `2` or `4`
  - 70B+ models: `4` or `8`
- **Constraint**: `world_size % engine_tp_size == 0`

### `--engine_mem_util`
- **Type**: `float`
- **Default**: `0.5`
- **Range**: `0.3` to `0.9`
- **Description**: GPU memory utilization for KV cache
- **Recommendation**:
  - High memory: `0.8` - `0.9`
  - Medium memory: `0.5` - `0.7`
  - Low memory: `0.3` - `0.5`

### `--enable_engine_sleep`
- **Action**: `store_true`
- **Default**: `True`
- **Description**: Enable inference engine sleep mode
- **Note**: Saves memory when engine not in use

### `--disable_engine_sleep`
- **Action**: `store_false`
- **Dest**: `enable_engine_sleep`
- **Description**: Disable engine sleep mode

### `--rm_use_engine`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Use inference engine for reward model
- **When to Use**: High-throughput reward computation

## Memory Optimization

### `--adam_offload`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Offload Adam optimizer states to CPU

### `--use_mp_opt`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Use mixed precision optimizer (FSDP)

### `--packing_samples`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Pack multiple samples into sequences
- **When to Use**: Varied sequence lengths, improve GPU utilization

### `--fused_linear_logprob`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Fused linear layer and logprob computation
- **Note**: Saves memory for large vocabulary models

### `--chunk_size`
- **Type**: `int`
- **Default**: `4096`
- **Description**: Chunk size for fused operations

## Logging and Monitoring

### `--log_dir`
- **Type**: `str`
- **Default**: `None`
- **Description**: Directory for logs and visualizations

### `--plot_every`
- **Type**: `int`
- **Default**: `10`
- **Description**: Plot generation length distribution every N steps

### `--use_tensorboard`
- **Type**: `str`
- **Default**: `None`
- **Description**: TensorBoard log directory

### `--use_wandb`
- **Type**: `str`
- **Default**: `None`
- **Description**: Weights & Biases project name

### `--wandb_org`
- **Type**: `str`
- **Default**: `None`
- **Description**: W&B organization name

### `--wandb_run_name`
- **Type**: `str`
- **Default**: Auto-generated
- **Description**: W&B run name

## Checkpoint Management

### `--save_path`
- **Type**: `str`
- **Required**: Yes
- **Description**: Directory to save checkpoints

### `--ckpt_path`
- **Type**: `str`
- **Default**: `None`
- **Description**: Path to load checkpoint from

### `--load_checkpoint`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Enable checkpoint loading

### `--save_interval`
- **Type**: `int`
- **Default**: `1`
- **Description**: Save checkpoint every N episodes

### `--max_ckpt_num`
- **Type**: `int`
- **Default**: `3`
- **Description**: Maximum number of checkpoints to keep

### `--max_ckpt_mem`
- **Type**: `int`
- **Default**: `1000`
- **Description**: Maximum checkpoint memory in GB

## Reward Processing

### `--reward_running_norm`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Apply running normalization to rewards

### `--reward_running_norm_minus_mean`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Subtract mean during reward normalization

### `--advantages_norm`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Normalize advantages

### `--advantage_clip`
- **Type**: `float`
- **Default**: `0.0`
- **Description**: Clip advantages (0 = no clipping)

### `--reward_clip`
- **Type**: `float`
- **Default**: `0.0`
- **Description**: Clip rewards (0 = no clipping)

## Multimodal (VLM) Parameters

### `--mixed_mm_data`
- **Action**: `store_true`
- **Default**: `False`
- **Description**: Handle mixed multimodal and text-only data

### `--processor`
- **Type**: `str`
- **Default**: Auto-detected
- **Description**: Multimodal processor type

## Complete Example Configuration

### Math Reasoning (GSM8K, MATH)
```bash
python train.py \
    # Model
    --pretrain /path/to/Qwen2.5-7B-Instruct \
    --reward_pretrain /path/to/reward-model \
    --max_len 4096 \
    --prompt_max_len 2048 \
    \
    # Training
    --num_episodes 20 \
    --max_epochs 1 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    \
    # Batch Size
    --train_batch_size 128 \
    --micro_train_batch_size 1 \
    --rollout_batch_size 64 \
    --micro_rollout_batch_size 2 \
    \
    # Algorithm
    --advantage_estimator group_norm \
    --n_samples_per_prompt 8 \
    --kl_estimator k3 \
    --init_kl_coef 0.001 \
    --temperature 0.6 \
    \
    # Distributed
    --zero_stage 2 \
    --bf16 \
    --gradient_checkpointing \
    \
    # Engine
    --engine_type vllm \
    --engine_tp_size 1 \
    --engine_mem_util 0.85 \
    --enable_engine_sleep \
    \
    # Memory
    --adam_offload \
    --fused_linear_logprob \
    \
    # Logging
    --use_tensorboard ./tb_logs \
    --plot_every 10 \
    \
    # Checkpoint
    --save_path ./checkpoints \
    --save_interval 1 \
    --max_ckpt_num 3 \
    \
    # Reward
    --reward_running_norm \
    --advantages_norm
```

### Multimodal VLM Training
```bash
python train_vl.py \
    # Model
    --pretrain /path/to/Qwen2-VL-7B-Instruct \
    --max_len 4096 \
    \
    # Training
    --num_episodes 10 \
    --actor_learning_rate 1e-6 \
    \
    # Batch Size
    --train_batch_size 64 \
    --micro_train_batch_size 1 \
    --rollout_batch_size 32 \
    --micro_rollout_batch_size 1 \
    \
    # Algorithm
    --advantage_estimator group_norm \
    --n_samples_per_prompt 4 \
    \
    # Distributed
    --fsdp \
    --fsdp_cpu_offload \
    --gradient_checkpointing \
    \
    # Engine
    --engine_tp_size 4 \
    --engine_mem_util 0.6 \
    \
    # VLM Specific
    --mixed_mm_data \
    --packing_samples
```

## Parameter Validation

LightRFT performs automatic validation of parameters. Common validation rules:

1. **Batch Size**: `train_batch_size >= rollout_batch_size × n_samples_per_prompt`
2. **Divisibility**: Batch sizes must be divisible by number of GPUs
3. **Memory**: Engine TP size must divide world size evenly
4. **Learning Rate**: Actor LR typically < Critic LR
5. **KL Target**: Should be small (0.001-0.01) for stable training

## Environment Variables

Useful environment variables for optimization:

```bash
# NCCL optimization
export TORCH_NCCL_AVOID_RECORD_STREAMS=1

# CUDA optimization
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Debugging
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO
```

## See Also

- [Algorithm Guide](algorithms.md) - Detailed algorithm descriptions
- [Strategy Usage](../best_practice/strategy_usage.md) - Distributed training strategies
- [Installation](../installation/index.rst) - Setup instructions