# Troubleshooting Guide

This guide helps you diagnose and resolve common issues when using LightRFT.

## Quick Diagnosis

Use this flowchart to quickly identify your issue:

```
Issue Type?
├─ Installation/Setup → See [Installation Issues](#installation-issues)
├─ Out of Memory → See [Memory Issues](#memory-issues)
├─ Training Issues → See [Training Problems](#training-problems)
├─ Performance → See [Performance Issues](#performance-issues)
└─ Distributed Training → See [Distributed Issues](#distributed-training-issues)
```

## Installation Issues

### Problem: Package import errors

**Symptoms**:
```
ModuleNotFoundError: No module named 'lightrft'
```

**Solution**:
```bash
# Ensure you're in the correct directory
cd /path/to/LightRFT
pip install -r requirements.txt
pip install -e .
```

### Problem: CUDA version mismatch

**Symptoms**:
```
RuntimeError: CUDA error: no kernel image is available for execution
```

**Solution**:
```bash
# Check CUDA version
nvcc --version
python -c "import torch; print(torch.version.cuda)"

# Reinstall PyTorch with correct CUDA version
pip install torch==2.5.1+cu118 --index-url https://download.pytorch.org/whl/cu118
```

### Problem: vLLM installation fails

**Symptoms**:
```
ERROR: Failed building wheel for vllm
```

**Solution**:
```bash
# Install build dependencies
pip install ninja packaging wheel

# Install vLLM from source
pip install vllm --no-build-isolation

# Or use pre-built wheel
pip install vllm==0.5.3.post1
```

## Memory Issues

### Problem: Out of Memory (OOM) Errors

**Symptoms**:
```
RuntimeError: CUDA out of memory
torch.cuda.OutOfMemoryError
```

**Solution Strategy** (try in order):

**1. Reduce Batch Sizes**
```bash
# Before
--micro_train_batch_size 2
--micro_rollout_batch_size 4

# After
--micro_train_batch_size 1
--micro_rollout_batch_size 2
```

**2. Enable Gradient Checkpointing**
```bash
--gradient_checkpointing
```
Trades ~20% speed for ~50% memory savings.

**3. Lower Engine Memory**
```bash
# Before
--engine_mem_util 0.9

# After
--engine_mem_util 0.5  # Or 0.4 for very low memory
```

**4. Use FSDP with CPU Offload**
```bash
--fsdp \
--fsdp_cpu_offload \
--use_mp_opt
```

**5. Enable Adam Offload**
```bash
--adam_offload
```

**6. Use ZeRO-3**
```bash
--zero_stage 3
```

**7. Reduce Model/Sequence Length**
```bash
--max_len 2048  # Instead of 4096
--prompt_max_len 1024
```

**Complete Low-Memory Configuration**:
```bash
python train.py \
    --micro_train_batch_size 1 \
    --micro_rollout_batch_size 1 \
    --gradient_checkpointing \
    --engine_mem_util 0.4 \
    --fsdp \
    --fsdp_cpu_offload \
    --adam_offload \
    --max_len 2048 \
    --use_mp_opt
```

### Problem: vLLM Engine OOM

**Symptoms**:
```
Failed to allocate memory for KV cache
```

**Solution**:
```bash
# Reduce KV cache memory
--engine_mem_util 0.3

# Increase tensor parallelism
--engine_tp_size 2  # or 4

# Enable engine sleep
--enable_engine_sleep

# Use smaller max length
--max_len 2048
```

### Problem: Memory Leak During Training

**Symptoms**:
- Memory gradually increases
- Eventually OOMs after several episodes

**Solution**:
```bash
# Enable NCCL optimization
export TORCH_NCCL_AVOID_RECORD_STREAMS=1

# Clear cache periodically
# Add to training code:
torch.cuda.empty_cache()

# Use engine sleep
--enable_engine_sleep
```

## Training Problems

### Problem: `num_rollouts_per_episodes = 0`

**Symptoms**:
```
AssertionError: num_rollouts_per_episodes should be > 0
```

**Root Cause**:
`train_batch_size` < `rollout_batch_size × n_samples_per_prompt`

**Solution**:
```bash
# Ensure TBS >= RBS × n_samples
# Example: RBS=64, n_samples=8
--train_batch_size 512  # Must be >= 64×8=512
--rollout_batch_size 64
--n_samples_per_prompt 8
```

### Problem: Training Not Converging

**Symptoms**:
- Reward not increasing
- Loss oscillating
- Model not improving

**Diagnosis & Solutions**:

**1. Check Learning Rate**
```bash
# If too high (loss spikes):
--actor_learning_rate 1e-7  # Lower

# If too low (no progress):
--actor_learning_rate 1e-6  # Higher
```

**2. Enable Reward Normalization**
```bash
--reward_running_norm \
--reward_running_norm_minus_mean \
--advantages_norm
```

**3. Check KL Penalty**
```bash
# If KL too large (policy not updating):
--init_kl_coef 0.0001  # Lower

# If KL too small (instability):
--init_kl_coef 0.01  # Higher
```

**4. Try Different Algorithm**
```bash
# Switch from GRPO to CPGD
--advantage_estimator cpgd \
--kl_target 0.01
```

**5. Check Reward Model Quality**
```python
# Test reward model separately
python test_reward_model.py --model /path/to/rm
```

### Problem: NaN Loss or Gradients

**Symptoms**:
```
Loss: nan
Gradient: nan
```

**Solution**:
```bash
# 1. Enable gradient clipping
--max_norm 1.0

# 2. Lower learning rate
--actor_learning_rate 1e-7

# 3. Use BF16 instead of FP16
--bf16

# 4. Enable reward clipping
--reward_clip 10.0

# 5. Check for division by zero
--advantages_norm  # Normalizes before use
```

### Problem: Training Extremely Slow

**Symptoms**:
- < 100 samples/min on 8×A100
- Each episode takes hours

**Solutions**:

**1. Profile Bottleneck**
```python
# Add profiling
with torch.profiler.profile() as prof:
    trainer.fit()
print(prof.key_averages())
```

**2. Check Data Loading**
```bash
# Increase workers
--num_workers 8

# Use faster dataloader
--dataloader_pin_memory
```

**3. Optimize Generation**
```bash
# Use FP8 inference
--engine_type vllm  # vLLM supports FP8

# Increase TP for generation
--engine_tp_size 2

# Reduce max length if possible
--max_len 2048
```

**4. Reduce Logging**
```bash
# Don't log every step
--log_interval 100
```

## Distributed Training Issues

### Problem: NCCL Timeout

**Symptoms**:
```
RuntimeError: NCCL timeout
[E ProcessGroupNCCL.cpp] Caught collective operation timeout
```

**Solution**:
```bash
# Increase timeout
export NCCL_TIMEOUT=1800

# Debug NCCL
export NCCL_DEBUG=INFO

# Try different network interface
export NCCL_SOCKET_IFNAME=eth0

# Disable InfiniBand if issues
export NCCL_IB_DISABLE=1

# Use GLOO for debugging
export NCCL_BACKEND=gloo
```

### Problem: Distributed Initialization Hanging

**Symptoms**:
- Script hangs at "Initializing process group"
- No error message

**Solution**:
```bash
# 1. Check network connectivity
ping $MASTER_ADDR

# 2. Check port availability
nc -zv $MASTER_ADDR $MASTER_PORT

# 3. Set correct environment variables
export MASTER_ADDR=192.168.1.1
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=0  # 0 to 7 for each GPU

# 4. Use explicit init_method
torchrun --nproc_per_node=8 \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    train.py

# 5. Enable debug logging
export TORCH_DISTRIBUTED_DEBUG=DETAIL
```

### Problem: Uneven GPU Utilization

**Symptoms**:
- Some GPUs at 100%, others idle
- Slow training despite multiple GPUs

**Solution**:
```bash
# 1. Check batch size divisibility
# Ensure batch_size % world_size == 0

# 2. Use tensor parallelism
--engine_tp_size 2  # Splits model across GPUs

# 3. Check for pipeline bubbles
# Ensure train_batch_size is large enough

# 4. Monitor GPU utilization
nvidia-smi dmon -i 0,1,2,3,4,5,6,7 -s u

# 5. Use sequence parallelism for long sequences
--sp_size 2
```

### Problem: Multi-Node Training Fails

**Symptoms**:
- Works on single node
- Fails on multiple nodes

**Solution**:
```bash
# 1. Use SLURM
srun -N2 --gres=gpu:8 --ntasks-per-node=8 bash train.sh

# 2. Or explicit torchrun
# On each node:
torchrun \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
    train.py

# 3. Check firewall rules
# Ensure ports are open between nodes

# 4. Use shared filesystem
# Ensure all nodes can access model/data paths
```

## Performance Issues

### Problem: Low GPU Utilization

**Symptoms**:
- GPU utilization < 80%
- Training slower than expected

**Solutions**:

**1. Increase Batch Size**
```bash
--micro_train_batch_size 2  # Double it
--micro_rollout_batch_size 4
```

**2. Reduce CPU Bottleneck**
```bash
--num_workers 8
--prefetch_factor 2
```

**3. Enable Flash Attention**
```bash
--flash_attn
```

**4. Use Fused Kernels**
```bash
--fused_linear_logprob
```

### Problem: Generation Too Slow

**Symptoms**:
- Rollout phase takes majority of time
- < 100 tokens/sec generation

**Solutions**:

**1. Use vLLM**
```bash
--engine_type vllm  # Instead of HF
--engine_tp_size 2
```

**2. Optimize KV Cache**
```bash
--engine_mem_util 0.9  # If memory allows
```

**3. Use FP8 (if supported)**
```bash
# vLLM automatically uses FP8 on H100
--engine_type vllm
```

**4. Reduce Samples**
```bash
--n_samples_per_prompt 4  # Instead of 8
```

## Inference Engine Issues

### Problem: vLLM Engine Fails to Initialize

**Symptoms**:
```
Failed to initialize vLLM engine
RuntimeError: Cannot allocate memory
```

**Solution**:
```bash
# 1. Check GPU memory
nvidia-smi

# 2. Reduce memory allocation
--engine_mem_util 0.5

# 3. Use smaller TP size
--engine_tp_size 1

# 4. Check model compatibility
# Some models need specific vLLM versions

# 5. Update vLLM
pip install -U vllm
```

### Problem: Engine Not Updating Weights

**Symptoms**:
- Policy model updates but generations don't change
- Rewards stay constant

**Solution**:
```python
# Ensure update_engine_weights is called
self.strategy.update_engine_weights(self.actor)

# Check in training loop:
def ppo_train(self):
    ...
    # After training
    self.strategy.update_engine_weights(self.actor)
```

### Problem: Engine Sleep/Wake Issues

**Symptoms**:
- Training hangs after generation
- "Engine already sleeping" errors

**Solution**:
```bash
# 1. Disable engine sleep for debugging
--disable_engine_sleep

# 2. Or use automatic management
# gather_and_generate handles sleep/wake automatically
all_outputs = self.strategy.gather_and_generate(
    ...,
    sleep_engine=True  # Automatic management
)
```

## Checkpoint Issues

### Problem: Cannot Load Checkpoint

**Symptoms**:
```
FileNotFoundError: Checkpoint not found
RuntimeError: Error loading state dict
```

**Solution**:
```bash
# 1. Check checkpoint path
ls -la /path/to/checkpoint

# 2. Load with relaxed matching
--load_checkpoint \
--ckpt_path /path/to/checkpoint

# 3. Skip optimizer states if incompatible
# Edit code to load model only:
model.load_state_dict(torch.load(ckpt_path))
```

### Problem: Checkpoint Saving Fails

**Symptoms**:
```
OSError: Disk quota exceeded
RuntimeError: Cannot save checkpoint
```

**Solution**:
```bash
# 1. Check disk space
df -h

# 2. Limit checkpoint number
--max_ckpt_num 3

# 3. Set max checkpoint size
--max_ckpt_mem 1000  # GB

# 4. Use different save path
--save_path /path/with/space
```

## Debugging Tips

### Enable Debug Logging

```bash
# PyTorch distributed
export TORCH_DISTRIBUTED_DEBUG=DETAIL

# NCCL
export NCCL_DEBUG=INFO

# CUDA
export CUDA_LAUNCH_BLOCKING=1
```

### Memory Profiling

```python
import torch

# Track memory allocation
torch.cuda.memory._record_memory_history()

# Training loop
...

# Dump memory snapshot
torch.cuda.memory._dump_snapshot("memory.pickle")
```

### Performance Profiling

```python
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True
) as prof:
    trainer.fit()

# View results
print(prof.key_averages().table(sort_by="cuda_time_total"))
```

### Debugging Checklist

When reporting bugs, include:

- [ ] Hardware: GPU model, count, memory
- [ ] Software: CUDA, PyTorch, vLLM versions
- [ ] Full command with all arguments
- [ ] Full error traceback
- [ ] Environment variables set
- [ ] Minimal reproduction script
- [ ] What you've tried already

## Getting Help

If you can't resolve the issue:

1. **Check FAQ**: [FAQ](faq.md)
2. **Search Issues**: [GitHub Issues](https://github.com/opendilab/LightRFT/issues)
3. **Ask Community**: GitHub Discussions
4. **Report Bug**: Open new issue with debugging info

## Common Error Messages Reference

| Error Message | Section | Quick Fix |
|---------------|---------|-----------|
| `CUDA out of memory` | [Memory Issues](#memory-issues) | Reduce batch size, enable checkpointing |
| `num_rollouts_per_episodes = 0` | [Training Problems](#training-problems) | Increase `train_batch_size` |
| `NCCL timeout` | [Distributed Issues](#distributed-training-issues) | `export NCCL_TIMEOUT=1800` |
| `Failed to initialize vLLM` | [Inference Engine Issues](#inference-engine-issues) | Reduce `engine_mem_util` |
| `NaN loss` | [Training Problems](#training-problems) | Lower learning rate, clip gradients |

## See Also

- [FAQ](faq.md) - Frequently asked questions
- [Configuration](../user_guide/configuration.md) - All parameters
- [Best Practices](../best_practice/strategy_usage.md) - Optimization tips