GRPO Training with GSM8K & Geo3K Datasets¶

This tutorial walks you through using LightRFT to run GRPO (Group Relative Policy Optimization) training on two math reasoning datasets: GSM8K (text-only) and Geo3K (multi-modal with geometry diagrams).

Overview¶

Item	GSM8K	Geo3K
Task	Grade-school math word problems	Geometry problem solving
Modality	Text-only	Multi-modal (text + image)
Source	openai/gsm8k	hiyouga/geometry3k
Train / Test	7,473 / 1,319	~2,100 / 601
Reward	Pure rule-based (no neural reward model)	Pure rule-based (no neural reward model)
Base Model	Qwen2.5-0.5B-Instruct (or larger)	Qwen2.5-VL-7B-Instruct

Both tasks use a pure rule-based reward mechanism:

Format reward (10%): Checks for <think>...</think> reasoning tags followed by \boxed{} answer notation.
Accuracy reward (90%): Verifies the final answer against the ground truth using mathruler.

No separate neural reward model is required.

1. Dataset Preprocessing¶

Before training, convert the raw HuggingFace datasets into LightRFT-compatible parquet format.

1.1 GSM8K Preprocessing¶

python examples/gsm8k_geo3k/data_preprocess/gsm8k.py \
    --local_save_dir ~/data/gsm8k

The script performs the following steps:

Loads openai/gsm8k from HuggingFace (or a local path via --local_dataset_path).
Extracts the numerical answer from the #### ANSWER pattern in each solution (commas are stripped, e.g. 1,000 → 1000).
Wraps each question into a chat-structured prompt:

{
  "prompt": [
    {"role": "system", "content": "You FIRST think about the reasoning process step by step ... The final answer MUST BE put in \\boxed{} after the reasoning."},
    {"role": "user", "content": "<original question>"}
  ],
  "extra_info": {
    "label": "gsm8k_rule",
    "reference": "<extracted numerical answer>"
  }
}

Saves train.parquet and test.parquet to the specified directory.

1.2 Geo3K Preprocessing¶

python examples/gsm8k_geo3k/data_preprocess/geo3k.py \
    --local_save_dir ~/data/geo3k

The script performs the following steps:

Loads hiyouga/geometry3k from HuggingFace (or a local path via --local_dataset_path).
Preserves the geometry diagram images from the dataset.
Wraps each problem into a chat-structured prompt with image references:

{
  "prompt": [
    {"role": "system", "content": "You FIRST think about the reasoning process ... The final answer MUST BE put in \\boxed{}."},
    {"role": "user", "content": "<geometry problem text>"}
  ],
  "images": ["<PIL Image>"],
  "extra_info": {
    "label": "geo3k_rule",
    "reference": "<ground truth answer>"
  }
}

Saves train.parquet and test.parquet to the specified directory.

1.3 Output Format Summary¶

Both preprocessing scripts produce a unified schema:

Field	Description
`prompt`	Chat-structured list `[{role, content}, ...]`
`images`	List of PIL images (Geo3K only; absent for GSM8K)
`extra_info.label`	`"gsm8k_rule"` or `"geo3k_rule"` — maps to the reward RECIPE
`extra_info.reference`	Ground truth answer string for rule-based evaluation

2. Training¶

2.1 GSM8K — Text-Only GRPO Training¶

# Edit paths in the script first, then:
bash examples/gsm8k_geo3k/run_grpo_gsm8k_qwen2.5_0.5b.sh

Or launch directly with torchrun:

torchrun --nnodes 1 --nproc-per-node 8 \
    examples/gsm8k_geo3k/train_colocate.py \
    --pretrain Qwen/Qwen2.5-0.5B-Instruct \
    --prompt_data /path/to/gsm8k_dataset \
    --input_key prompt --label_key label \
    --text_only \
    --loss_agg_mode seq-mean-token-mean \
    --advantage_estimator group_norm \
    --n_samples_per_prompt 5 \
    --num_episodes 30 \
    --max_epochs 1 \
    --train_batch_size 128 \
    --rollout_batch_size 128 \
    --micro_train_batch_size 4 \
    --micro_rollout_batch_size 4 \
    --actor_learning_rate 1e-6 \
    --lr_warmup_ratio 0.03 \
    --init_kl_coef 0.01 \
    --kl_estimator k3 \
    --use_kl_loss \
    --l2 1.0e-2 \
    --fsdp --zero_stage 3 --bf16 \
    --flash_attn --gradient_checkpointing \
    --apply_chat_template \
    --freeze_prefix \
    --adam_offload \
    --rm_use_engine \
    --reward_pretrain "{}" \
    --engine_type sglang --engine_tp_size 2 \
    --engine_mem_util 0.6 --enable_engine_sleep \
    --eval_steps 20 --eval_split test \
    --max_eval_samples 1319 \
    --save_path results/gsm8k_grpo \
    --save_steps 20 --max_ckpt_num 3 \
    --system_prompt 'A conversation between the User and Assistant. The User asks a question, and the Assistant provides a solution. The Assistant first thinks through the reasoning process internally with self-reflection and consistency check and then gives the final analysis and answer. The reasoning process should be enclosed within <think></think>, followed directly by the final thought and answer, the final answer MUST BE put in \\boxed{}, like this: <think> reasoning process here </think> final thought and \\boxed{answer} here.'

2.3 Geo3K — LoRA GRPO Training (Parameter-Efficient)¶

For resource-constrained environments, use LoRA to fine-tune only a small fraction of parameters:

bash examples/gsm8k_geo3k/run_grpo_geo3k_lora_qwen2.5_vl_7b.sh

Key LoRA-specific parameters:

--lora_rank 128 \
--lora_alpha 256 \
--target_modules all-linear

3. Key Hyperparameters¶

Parameter	GSM8K Default	Geo3K Default	Description
`--n_samples_per_prompt`	5	8	Number of rollout samples per prompt (GRPO group size)
`--num_episodes`	30	20	Total training episodes
`--train_batch_size`	128	128	Global training batch size
`--rollout_batch_size`	128	128	Global rollout batch size
`--micro_train_batch_size`	4	4	Per-GPU micro training batch size
`--micro_rollout_batch_size`	4	8	Per-GPU micro rollout batch size
`--actor_learning_rate`	1e-6	1e-6	Actor learning rate
`--lr_warmup_ratio`	0.03	0.03	Learning rate warmup ratio
`--init_kl_coef`	0.01	0.01	KL divergence penalty coefficient
`--kl_estimator`	k3	k3	KL estimator type
`--prompt_max_len`	1024	1024	Maximum prompt length
`--generate_max_len`	2048	2048	Maximum generation length
`--advantage_estimator`	group_norm	group_norm	GRPO advantage estimation
`--engine_tp_size`	2	2	Inference engine tensor parallelism
`--l2`	1e-2	1e-2	L2 regularization weight
`--text_only`	Yes	No	Text-only mode (no image processing)
`--mixed_mm_data`	No	Yes	Enable multi-modal data processing

4. Reward Mechanism¶

4.1 RECIPE Configuration¶

The reward system is driven by a label-based RECIPE mapping defined in examples/gsm8k_geo3k/reward_models_utils.py:

RECIPE = {
    "geo3k_rule": [("geo3k_rule", None, 1.0)],
    "gsm8k_rule": [("gsm8k_rule", None, 1.0)],
}

Each sample’s label field (set during preprocessing) determines which reward function is applied.

4.2 Format Reward (10% weight)¶

Validates the model output follows the required reasoning format:

<think> reasoning process here </think> final thought and \boxed{answer}

The check uses regex to verify:

<think>...</think> tags are present.
\boxed{...} notation is present.
The </think> closing tag appears before \boxed{}.

def format_reward_fn(sol: str) -> float:
    think_match = re.search(r'<think>.*?</think>', sol, re.DOTALL)
    boxed_match = re.search(r'\\boxed\{.*?\}', sol, re.DOTALL)
    if think_match and boxed_match:
        return 1.0 if think_match.end() <= boxed_match.start() else 0.0
    return 0.0

4.3 Accuracy Reward (90% weight)¶

Extracts the answer from \boxed{} and compares it against the ground truth:

def accuracy_reward_fn(sol: str, gt: str) -> float:
    from mathruler.grader import extract_boxed_content, grade_answer
    pred = extract_boxed_content(sol)
    return 1.0 if grade_answer(pred, gt) else 0.0

mathruler.grader handles numeric equivalence, fraction simplification, and other mathematical normalization.

4.4 Combined Reward¶

final_reward = 0.9 * accuracy_reward + 0.1 * format_reward

4.5 Response Extraction¶

Before reward computation, the assistant’s response is extracted from the full chat transcript to avoid false positives from system prompt examples:

def extract_response(text: str) -> str:
    # Finds the last <|im_start|>assistant ... <|im_end|> segment
    ...

5. Training Monitoring (W&B)¶

5.1 Enable W&B Logging¶

Set the following in your training script:

export WANDB_API_KEY="your_api_key"
export WANDB_PROJECT="LightRFT-Experiments"
export WANDB_MODE="online"  # "offline" for local-only logging

5.2 Key Metrics¶

rollout/reward¶

GSM8K (Qwen2.5-0.5B-Instruct):

GSM8K rollout/reward curve

Geo3K (Qwen2.5-VL-7B-Instruct):

Geo3K rollout/reward curve

Expected trend: The rollout/reward curve should show a steady upward trend over training steps. In early episodes, the reward typically starts low as the model has not yet learned the correct format and reasoning patterns. As training progresses, the reward should increase smoothly and monotonically with minor fluctuations, eventually converging to a stable level. If the reward plateaus early or drops sharply, consider adjusting the KL coefficient or learning rate.

eval/accuracy¶

GSM8K (Qwen2.5-0.5B-Instruct):

GSM8K eval/accuracy curve

Geo3K (Qwen2.5-VL-7B-Instruct):

Geo3K eval/accuracy curve

Expected trend: The eval/accuracy curve reflects the model’s actual problem-solving ability on the held-out test set. It should correlate with the reward curve but may lag slightly. Accuracy is expected to gradually improve from the base model’s initial level, showing an overall upward trend before eventually converging. The eval curve is noisier than the reward curve due to smaller evaluation sample sizes. Sudden drops may indicate overfitting or KL divergence issues — consider adjusting the KL coefficient or learning rate accordingly.

5.3 Additional Useful Metrics¶

Metric	Description
`rollout/format_reward`	Format compliance rate
`rollout/accuracy_reward`	Answer correctness rate
`train/actor_loss`	Actor policy loss (should decrease)
`train/kl_divergence`	KL divergence from reference policy (should stay bounded)
`train/entropy`	Policy entropy (gradual decrease indicates learning)

6. Tips & Troubleshooting¶

OOM: Reduce micro_train_batch_size / micro_rollout_batch_size, or lower --engine_mem_util.
Slow convergence: Increase --n_samples_per_prompt for better GRPO advantage estimation.
Format reward stuck at 0: Verify the system prompt is correctly passed via --system_prompt and --apply_chat_template.
Geo3K image loading errors: Ensure the preprocessed parquet files contain valid PIL image objects and --images_key images is set.
LoRA training: Use --lora_rank 128 --lora_alpha 256 for a good balance between efficiency and capacity.

GRPO Training with GSM8K & Geo3K Datasets¶

Overview¶

1. Dataset Preprocessing¶

1.1 GSM8K Preprocessing¶

1.2 Geo3K Preprocessing¶

1.3 Output Format Summary¶

2. Training¶

2.1 GSM8K — Text-Only GRPO Training¶

2.2 Geo3K — Multi-Modal GRPO Training¶

2.3 Geo3K — LoRA GRPO Training (Parameter-Efficient)¶

3. Key Hyperparameters¶

4. Reward Mechanism¶

4.1 RECIPE Configuration¶

4.2 Format Reward (10% weight)¶

4.3 Accuracy Reward (90% weight)¶

4.4 Combined Reward¶

4.5 Response Extraction¶

5. Training Monitoring (W&B)¶

5.1 Enable W&B Logging¶

5.2 Key Metrics¶

rollout/reward¶

eval/accuracy¶

5.3 Additional Useful Metrics¶

6. Tips & Troubleshooting¶

Related Resources¶