Shortcuts

Troubleshooting Guide

This guide helps you diagnose and resolve common issues when using LightRFT.

Quick Diagnosis

Use this flowchart to quickly identify your issue:

Issue Type?
├─ Installation/Setup → See [Installation Issues](#installation-issues)
├─ Out of Memory → See [Memory Issues](#memory-issues)
├─ Training Issues → See [Training Problems](#training-problems)
├─ Performance → See [Performance Issues](#performance-issues)
└─ Distributed Training → See [Distributed Issues](#distributed-training-issues)

Installation Issues

Problem: Package import errors

Symptoms:

ModuleNotFoundError: No module named 'lightrft'

Solution:

# Ensure you're in the correct directory
cd /path/to/LightRFT
pip install -r requirements.txt
pip install -e .

Problem: CUDA version mismatch

Symptoms:

RuntimeError: CUDA error: no kernel image is available for execution

Solution:

# Check CUDA version
nvcc --version
python -c "import torch; print(torch.version.cuda)"

# Reinstall PyTorch with correct CUDA version
pip install torch==2.5.1+cu118 --index-url https://download.pytorch.org/whl/cu118

Problem: vLLM installation fails

Symptoms:

ERROR: Failed building wheel for vllm

Solution:

# Install build dependencies
pip install ninja packaging wheel

# Install vLLM from source
pip install vllm --no-build-isolation

# Or use pre-built wheel
pip install vllm==0.5.3.post1

Memory Issues

Problem: Out of Memory (OOM) Errors

Symptoms:

RuntimeError: CUDA out of memory
torch.cuda.OutOfMemoryError

Solution Strategy (try in order):

1. Reduce Batch Sizes

# Before
--micro_train_batch_size 2
--micro_rollout_batch_size 4

# After
--micro_train_batch_size 1
--micro_rollout_batch_size 2

2. Enable Gradient Checkpointing

--gradient_checkpointing

Trades ~20% speed for ~50% memory savings.

3. Lower Engine Memory

# Before
--engine_mem_util 0.9

# After
--engine_mem_util 0.5  # Or 0.4 for very low memory

4. Use FSDP with CPU Offload

--fsdp \
--fsdp_cpu_offload \
--use_mp_opt

5. Enable Adam Offload

--adam_offload

6. Use ZeRO-3

--zero_stage 3

7. Reduce Model/Sequence Length

--max_len 2048  # Instead of 4096
--prompt_max_len 1024

Complete Low-Memory Configuration:

python train.py \
    --micro_train_batch_size 1 \
    --micro_rollout_batch_size 1 \
    --gradient_checkpointing \
    --engine_mem_util 0.4 \
    --fsdp \
    --fsdp_cpu_offload \
    --adam_offload \
    --max_len 2048 \
    --use_mp_opt

Problem: vLLM Engine OOM

Symptoms:

Failed to allocate memory for KV cache

Solution:

# Reduce KV cache memory
--engine_mem_util 0.3

# Increase tensor parallelism
--engine_tp_size 2  # or 4

# Enable engine sleep
--enable_engine_sleep

# Use smaller max length
--max_len 2048

Problem: Memory Leak During Training

Symptoms:

  • Memory gradually increases

  • Eventually OOMs after several episodes

Solution:

# Enable NCCL optimization
export TORCH_NCCL_AVOID_RECORD_STREAMS=1

# Clear cache periodically
# Add to training code:
torch.cuda.empty_cache()

# Use engine sleep
--enable_engine_sleep

Training Problems

Problem: num_rollouts_per_episodes = 0

Symptoms:

AssertionError: num_rollouts_per_episodes should be > 0

Root Cause: train_batch_size < rollout_batch_size × n_samples_per_prompt

Solution:

# Ensure TBS >= RBS × n_samples
# Example: RBS=64, n_samples=8
--train_batch_size 512  # Must be >= 64×8=512
--rollout_batch_size 64
--n_samples_per_prompt 8

Problem: Training Not Converging

Symptoms:

  • Reward not increasing

  • Loss oscillating

  • Model not improving

Diagnosis & Solutions:

1. Check Learning Rate

# If too high (loss spikes):
--actor_learning_rate 1e-7  # Lower

# If too low (no progress):
--actor_learning_rate 1e-6  # Higher

2. Enable Reward Normalization

--reward_running_norm \
--reward_running_norm_minus_mean \
--advantages_norm

3. Check KL Penalty

# If KL too large (policy not updating):
--init_kl_coef 0.0001  # Lower

# If KL too small (instability):
--init_kl_coef 0.01  # Higher

4. Try Different Algorithm

# Switch from GRPO to CPGD
--advantage_estimator cpgd \
--kl_target 0.01

5. Check Reward Model Quality

# Test reward model separately
python test_reward_model.py --model /path/to/rm

Problem: NaN Loss or Gradients

Symptoms:

Loss: nan
Gradient: nan

Solution:

# 1. Enable gradient clipping
--max_norm 1.0

# 2. Lower learning rate
--actor_learning_rate 1e-7

# 3. Use BF16 instead of FP16
--bf16

# 4. Enable reward clipping
--reward_clip 10.0

# 5. Check for division by zero
--advantages_norm  # Normalizes before use

Problem: Training Extremely Slow

Symptoms:

  • < 100 samples/min on 8×A100

  • Each episode takes hours

Solutions:

1. Profile Bottleneck

# Add profiling
with torch.profiler.profile() as prof:
    trainer.fit()
print(prof.key_averages())

2. Check Data Loading

# Increase workers
--num_workers 8

# Use faster dataloader
--dataloader_pin_memory

3. Optimize Generation

# Use FP8 inference
--engine_type vllm  # vLLM supports FP8

# Increase TP for generation
--engine_tp_size 2

# Reduce max length if possible
--max_len 2048

4. Reduce Logging

# Don't log every step
--log_interval 100

Distributed Training Issues

Problem: NCCL Timeout

Symptoms:

RuntimeError: NCCL timeout
[E ProcessGroupNCCL.cpp] Caught collective operation timeout

Solution:

# Increase timeout
export NCCL_TIMEOUT=1800

# Debug NCCL
export NCCL_DEBUG=INFO

# Try different network interface
export NCCL_SOCKET_IFNAME=eth0

# Disable InfiniBand if issues
export NCCL_IB_DISABLE=1

# Use GLOO for debugging
export NCCL_BACKEND=gloo

Problem: Distributed Initialization Hanging

Symptoms:

  • Script hangs at “Initializing process group”

  • No error message

Solution:

# 1. Check network connectivity
ping $MASTER_ADDR

# 2. Check port availability
nc -zv $MASTER_ADDR $MASTER_PORT

# 3. Set correct environment variables
export MASTER_ADDR=192.168.1.1
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=0  # 0 to 7 for each GPU

# 4. Use explicit init_method
torchrun --nproc_per_node=8 \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    train.py

# 5. Enable debug logging
export TORCH_DISTRIBUTED_DEBUG=DETAIL

Problem: Uneven GPU Utilization

Symptoms:

  • Some GPUs at 100%, others idle

  • Slow training despite multiple GPUs

Solution:

# 1. Check batch size divisibility
# Ensure batch_size % world_size == 0

# 2. Use tensor parallelism
--engine_tp_size 2  # Splits model across GPUs

# 3. Check for pipeline bubbles
# Ensure train_batch_size is large enough

# 4. Monitor GPU utilization
nvidia-smi dmon -i 0,1,2,3,4,5,6,7 -s u

# 5. Use sequence parallelism for long sequences
--sp_size 2

Problem: Multi-Node Training Fails

Symptoms:

  • Works on single node

  • Fails on multiple nodes

Solution:

# 1. Use SLURM
srun -N2 --gres=gpu:8 --ntasks-per-node=8 bash train.sh

# 2. Or explicit torchrun
# On each node:
torchrun \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
    train.py

# 3. Check firewall rules
# Ensure ports are open between nodes

# 4. Use shared filesystem
# Ensure all nodes can access model/data paths

Performance Issues

Problem: Low GPU Utilization

Symptoms:

  • GPU utilization < 80%

  • Training slower than expected

Solutions:

1. Increase Batch Size

--micro_train_batch_size 2  # Double it
--micro_rollout_batch_size 4

2. Reduce CPU Bottleneck

--num_workers 8
--prefetch_factor 2

3. Enable Flash Attention

--flash_attn

4. Use Fused Kernels

--fused_linear_logprob

Problem: Generation Too Slow

Symptoms:

  • Rollout phase takes majority of time

  • < 100 tokens/sec generation

Solutions:

1. Use vLLM

--engine_type vllm  # Instead of HF
--engine_tp_size 2

2. Optimize KV Cache

--engine_mem_util 0.9  # If memory allows

3. Use FP8 (if supported)

# vLLM automatically uses FP8 on H100
--engine_type vllm

4. Reduce Samples

--n_samples_per_prompt 4  # Instead of 8

Inference Engine Issues

Problem: vLLM Engine Fails to Initialize

Symptoms:

Failed to initialize vLLM engine
RuntimeError: Cannot allocate memory

Solution:

# 1. Check GPU memory
nvidia-smi

# 2. Reduce memory allocation
--engine_mem_util 0.5

# 3. Use smaller TP size
--engine_tp_size 1

# 4. Check model compatibility
# Some models need specific vLLM versions

# 5. Update vLLM
pip install -U vllm

Problem: Engine Not Updating Weights

Symptoms:

  • Policy model updates but generations don’t change

  • Rewards stay constant

Solution:

# Ensure update_engine_weights is called
self.strategy.update_engine_weights(self.actor)

# Check in training loop:
def ppo_train(self):
    ...
    # After training
    self.strategy.update_engine_weights(self.actor)

Problem: Engine Sleep/Wake Issues

Symptoms:

  • Training hangs after generation

  • “Engine already sleeping” errors

Solution:

# 1. Disable engine sleep for debugging
--disable_engine_sleep

# 2. Or use automatic management
# gather_and_generate handles sleep/wake automatically
all_outputs = self.strategy.gather_and_generate(
    ...,
    sleep_engine=True  # Automatic management
)

Checkpoint Issues

Problem: Cannot Load Checkpoint

Symptoms:

FileNotFoundError: Checkpoint not found
RuntimeError: Error loading state dict

Solution:

# 1. Check checkpoint path
ls -la /path/to/checkpoint

# 2. Load with relaxed matching
--load_checkpoint \
--ckpt_path /path/to/checkpoint

# 3. Skip optimizer states if incompatible
# Edit code to load model only:
model.load_state_dict(torch.load(ckpt_path))

Problem: Checkpoint Saving Fails

Symptoms:

OSError: Disk quota exceeded
RuntimeError: Cannot save checkpoint

Solution:

# 1. Check disk space
df -h

# 2. Limit checkpoint number
--max_ckpt_num 3

# 3. Set max checkpoint size
--max_ckpt_mem 1000  # GB

# 4. Use different save path
--save_path /path/with/space

Debugging Tips

Enable Debug Logging

# PyTorch distributed
export TORCH_DISTRIBUTED_DEBUG=DETAIL

# NCCL
export NCCL_DEBUG=INFO

# CUDA
export CUDA_LAUNCH_BLOCKING=1

Memory Profiling

import torch

# Track memory allocation
torch.cuda.memory._record_memory_history()

# Training loop
...

# Dump memory snapshot
torch.cuda.memory._dump_snapshot("memory.pickle")

Performance Profiling

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True
) as prof:
    trainer.fit()

# View results
print(prof.key_averages().table(sort_by="cuda_time_total"))

Debugging Checklist

When reporting bugs, include:

  • Hardware: GPU model, count, memory

  • Software: CUDA, PyTorch, vLLM versions

  • Full command with all arguments

  • Full error traceback

  • Environment variables set

  • Minimal reproduction script

  • What you’ve tried already

Getting Help

If you can’t resolve the issue:

  1. Check FAQ: FAQ

  2. Search Issues: GitHub Issues

  3. Ask Community: GitHub Discussions

  4. Report Bug: Open new issue with debugging info

Common Error Messages Reference

Error Message

Section

Quick Fix

CUDA out of memory

Memory Issues

Reduce batch size, enable checkpointing

num_rollouts_per_episodes = 0

Training Problems

Increase train_batch_size

NCCL timeout

Distributed Issues

export NCCL_TIMEOUT=1800

Failed to initialize vLLM

Inference Engine Issues

Reduce engine_mem_util

NaN loss

Training Problems

Lower learning rate, clip gradients

See Also