Troubleshooting Guide¶
This guide helps you diagnose and resolve common issues when using LightRFT.
Quick Diagnosis¶
Use this flowchart to quickly identify your issue:
Issue Type?
├─ Installation/Setup → See [Installation Issues](#installation-issues)
├─ Out of Memory → See [Memory Issues](#memory-issues)
├─ Training Issues → See [Training Problems](#training-problems)
├─ Performance → See [Performance Issues](#performance-issues)
└─ Distributed Training → See [Distributed Issues](#distributed-training-issues)
Installation Issues¶
Problem: Package import errors¶
Symptoms:
ModuleNotFoundError: No module named 'lightrft'
Solution:
# Ensure you're in the correct directory
cd /path/to/LightRFT
pip install -r requirements.txt
pip install -e .
Problem: CUDA version mismatch¶
Symptoms:
RuntimeError: CUDA error: no kernel image is available for execution
Solution:
# Check CUDA version
nvcc --version
python -c "import torch; print(torch.version.cuda)"
# Reinstall PyTorch with correct CUDA version
pip install torch==2.5.1+cu118 --index-url https://download.pytorch.org/whl/cu118
Problem: vLLM installation fails¶
Symptoms:
ERROR: Failed building wheel for vllm
Solution:
# Install build dependencies
pip install ninja packaging wheel
# Install vLLM from source
pip install vllm --no-build-isolation
# Or use pre-built wheel
pip install vllm==0.5.3.post1
Memory Issues¶
Problem: Out of Memory (OOM) Errors¶
Symptoms:
RuntimeError: CUDA out of memory
torch.cuda.OutOfMemoryError
Solution Strategy (try in order):
1. Reduce Batch Sizes
# Before
--micro_train_batch_size 2
--micro_rollout_batch_size 4
# After
--micro_train_batch_size 1
--micro_rollout_batch_size 2
2. Enable Gradient Checkpointing
--gradient_checkpointing
Trades ~20% speed for ~50% memory savings.
3. Lower Engine Memory
# Before
--engine_mem_util 0.9
# After
--engine_mem_util 0.5 # Or 0.4 for very low memory
4. Use FSDP with CPU Offload
--fsdp \
--fsdp_cpu_offload \
--use_mp_opt
5. Enable Adam Offload
--adam_offload
6. Use ZeRO-3
--zero_stage 3
7. Reduce Model/Sequence Length
--max_len 2048 # Instead of 4096
--prompt_max_len 1024
Complete Low-Memory Configuration:
python train.py \
--micro_train_batch_size 1 \
--micro_rollout_batch_size 1 \
--gradient_checkpointing \
--engine_mem_util 0.4 \
--fsdp \
--fsdp_cpu_offload \
--adam_offload \
--max_len 2048 \
--use_mp_opt
Problem: vLLM Engine OOM¶
Symptoms:
Failed to allocate memory for KV cache
Solution:
# Reduce KV cache memory
--engine_mem_util 0.3
# Increase tensor parallelism
--engine_tp_size 2 # or 4
# Enable engine sleep
--enable_engine_sleep
# Use smaller max length
--max_len 2048
Problem: Memory Leak During Training¶
Symptoms:
Memory gradually increases
Eventually OOMs after several episodes
Solution:
# Enable NCCL optimization
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
# Clear cache periodically
# Add to training code:
torch.cuda.empty_cache()
# Use engine sleep
--enable_engine_sleep
Training Problems¶
Problem: num_rollouts_per_episodes = 0¶
Symptoms:
AssertionError: num_rollouts_per_episodes should be > 0
Root Cause:
train_batch_size < rollout_batch_size × n_samples_per_prompt
Solution:
# Ensure TBS >= RBS × n_samples
# Example: RBS=64, n_samples=8
--train_batch_size 512 # Must be >= 64×8=512
--rollout_batch_size 64
--n_samples_per_prompt 8
Problem: Training Not Converging¶
Symptoms:
Reward not increasing
Loss oscillating
Model not improving
Diagnosis & Solutions:
1. Check Learning Rate
# If too high (loss spikes):
--actor_learning_rate 1e-7 # Lower
# If too low (no progress):
--actor_learning_rate 1e-6 # Higher
2. Enable Reward Normalization
--reward_running_norm \
--reward_running_norm_minus_mean \
--advantages_norm
3. Check KL Penalty
# If KL too large (policy not updating):
--init_kl_coef 0.0001 # Lower
# If KL too small (instability):
--init_kl_coef 0.01 # Higher
4. Try Different Algorithm
# Switch from GRPO to CPGD
--advantage_estimator cpgd \
--kl_target 0.01
5. Check Reward Model Quality
# Test reward model separately
python test_reward_model.py --model /path/to/rm
Problem: NaN Loss or Gradients¶
Symptoms:
Loss: nan
Gradient: nan
Solution:
# 1. Enable gradient clipping
--max_norm 1.0
# 2. Lower learning rate
--actor_learning_rate 1e-7
# 3. Use BF16 instead of FP16
--bf16
# 4. Enable reward clipping
--reward_clip 10.0
# 5. Check for division by zero
--advantages_norm # Normalizes before use
Problem: Training Extremely Slow¶
Symptoms:
< 100 samples/min on 8×A100
Each episode takes hours
Solutions:
1. Profile Bottleneck
# Add profiling
with torch.profiler.profile() as prof:
trainer.fit()
print(prof.key_averages())
2. Check Data Loading
# Increase workers
--num_workers 8
# Use faster dataloader
--dataloader_pin_memory
3. Optimize Generation
# Use FP8 inference
--engine_type vllm # vLLM supports FP8
# Increase TP for generation
--engine_tp_size 2
# Reduce max length if possible
--max_len 2048
4. Reduce Logging
# Don't log every step
--log_interval 100
Distributed Training Issues¶
Problem: NCCL Timeout¶
Symptoms:
RuntimeError: NCCL timeout
[E ProcessGroupNCCL.cpp] Caught collective operation timeout
Solution:
# Increase timeout
export NCCL_TIMEOUT=1800
# Debug NCCL
export NCCL_DEBUG=INFO
# Try different network interface
export NCCL_SOCKET_IFNAME=eth0
# Disable InfiniBand if issues
export NCCL_IB_DISABLE=1
# Use GLOO for debugging
export NCCL_BACKEND=gloo
Problem: Distributed Initialization Hanging¶
Symptoms:
Script hangs at “Initializing process group”
No error message
Solution:
# 1. Check network connectivity
ping $MASTER_ADDR
# 2. Check port availability
nc -zv $MASTER_ADDR $MASTER_PORT
# 3. Set correct environment variables
export MASTER_ADDR=192.168.1.1
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=0 # 0 to 7 for each GPU
# 4. Use explicit init_method
torchrun --nproc_per_node=8 \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train.py
# 5. Enable debug logging
export TORCH_DISTRIBUTED_DEBUG=DETAIL
Problem: Uneven GPU Utilization¶
Symptoms:
Some GPUs at 100%, others idle
Slow training despite multiple GPUs
Solution:
# 1. Check batch size divisibility
# Ensure batch_size % world_size == 0
# 2. Use tensor parallelism
--engine_tp_size 2 # Splits model across GPUs
# 3. Check for pipeline bubbles
# Ensure train_batch_size is large enough
# 4. Monitor GPU utilization
nvidia-smi dmon -i 0,1,2,3,4,5,6,7 -s u
# 5. Use sequence parallelism for long sequences
--sp_size 2
Problem: Multi-Node Training Fails¶
Symptoms:
Works on single node
Fails on multiple nodes
Solution:
# 1. Use SLURM
srun -N2 --gres=gpu:8 --ntasks-per-node=8 bash train.sh
# 2. Or explicit torchrun
# On each node:
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=29500 \
train.py
# 3. Check firewall rules
# Ensure ports are open between nodes
# 4. Use shared filesystem
# Ensure all nodes can access model/data paths
Performance Issues¶
Problem: Low GPU Utilization¶
Symptoms:
GPU utilization < 80%
Training slower than expected
Solutions:
1. Increase Batch Size
--micro_train_batch_size 2 # Double it
--micro_rollout_batch_size 4
2. Reduce CPU Bottleneck
--num_workers 8
--prefetch_factor 2
3. Enable Flash Attention
--flash_attn
4. Use Fused Kernels
--fused_linear_logprob
Problem: Generation Too Slow¶
Symptoms:
Rollout phase takes majority of time
< 100 tokens/sec generation
Solutions:
1. Use vLLM
--engine_type vllm # Instead of HF
--engine_tp_size 2
2. Optimize KV Cache
--engine_mem_util 0.9 # If memory allows
3. Use FP8 (if supported)
# vLLM automatically uses FP8 on H100
--engine_type vllm
4. Reduce Samples
--n_samples_per_prompt 4 # Instead of 8
Inference Engine Issues¶
Problem: vLLM Engine Fails to Initialize¶
Symptoms:
Failed to initialize vLLM engine
RuntimeError: Cannot allocate memory
Solution:
# 1. Check GPU memory
nvidia-smi
# 2. Reduce memory allocation
--engine_mem_util 0.5
# 3. Use smaller TP size
--engine_tp_size 1
# 4. Check model compatibility
# Some models need specific vLLM versions
# 5. Update vLLM
pip install -U vllm
Problem: Engine Not Updating Weights¶
Symptoms:
Policy model updates but generations don’t change
Rewards stay constant
Solution:
# Ensure update_engine_weights is called
self.strategy.update_engine_weights(self.actor)
# Check in training loop:
def ppo_train(self):
...
# After training
self.strategy.update_engine_weights(self.actor)
Problem: Engine Sleep/Wake Issues¶
Symptoms:
Training hangs after generation
“Engine already sleeping” errors
Solution:
# 1. Disable engine sleep for debugging
--disable_engine_sleep
# 2. Or use automatic management
# gather_and_generate handles sleep/wake automatically
all_outputs = self.strategy.gather_and_generate(
...,
sleep_engine=True # Automatic management
)
Checkpoint Issues¶
Problem: Cannot Load Checkpoint¶
Symptoms:
FileNotFoundError: Checkpoint not found
RuntimeError: Error loading state dict
Solution:
# 1. Check checkpoint path
ls -la /path/to/checkpoint
# 2. Load with relaxed matching
--load_checkpoint \
--ckpt_path /path/to/checkpoint
# 3. Skip optimizer states if incompatible
# Edit code to load model only:
model.load_state_dict(torch.load(ckpt_path))
Problem: Checkpoint Saving Fails¶
Symptoms:
OSError: Disk quota exceeded
RuntimeError: Cannot save checkpoint
Solution:
# 1. Check disk space
df -h
# 2. Limit checkpoint number
--max_ckpt_num 3
# 3. Set max checkpoint size
--max_ckpt_mem 1000 # GB
# 4. Use different save path
--save_path /path/with/space
Debugging Tips¶
Enable Debug Logging¶
# PyTorch distributed
export TORCH_DISTRIBUTED_DEBUG=DETAIL
# NCCL
export NCCL_DEBUG=INFO
# CUDA
export CUDA_LAUNCH_BLOCKING=1
Memory Profiling¶
import torch
# Track memory allocation
torch.cuda.memory._record_memory_history()
# Training loop
...
# Dump memory snapshot
torch.cuda.memory._dump_snapshot("memory.pickle")
Performance Profiling¶
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
with_stack=True
) as prof:
trainer.fit()
# View results
print(prof.key_averages().table(sort_by="cuda_time_total"))
Debugging Checklist¶
When reporting bugs, include:
Hardware: GPU model, count, memory
Software: CUDA, PyTorch, vLLM versions
Full command with all arguments
Full error traceback
Environment variables set
Minimal reproduction script
What you’ve tried already
Getting Help¶
If you can’t resolve the issue:
Check FAQ: FAQ
Search Issues: GitHub Issues
Ask Community: GitHub Discussions
Report Bug: Open new issue with debugging info
Common Error Messages Reference¶
Error Message |
Section |
Quick Fix |
|---|---|---|
|
Reduce batch size, enable checkpointing |
|
|
Increase |
|
|
|
|
|
Reduce |
|
|
Lower learning rate, clip gradients |
See Also¶
FAQ - Frequently asked questions
Configuration - All parameters
Best Practices - Optimization tips