Shortcuts

lightrft.trainer.experience_maker

class lightrft.trainer.experience_maker.Experience(sequences: torch.Tensor, action_log_probs: torch.Tensor, base_action_log_probs: torch.Tensor, values: torch.Tensor, returns: torch.Tensor | None, advantages: torch.Tensor | None, attention_mask: torch.LongTensor | None, action_mask: torch.BoolTensor | None, info: dict | None, kl: torch.Tensor | None = None, action_entropy: torch.Tensor | None = None)[source]

Bases: object

Experience is a batch of data containing sequences and associated RL training information.

These data should have the same sequence length and number of actions. Left padding for sequences is applied.

Tensor shapes:
  • sequences: (B, S) where B is batch size, S is sequence length

  • action_log_probs: (B, A) where A is number of actions

  • values: (B, A)

  • returns: (B, A)

  • advantages: (B, A)

  • attention_mask: (B, S)

  • action_mask: (B, A)

  • kl: (B, A)

  • action_entropy: (B, A) - Entropy values for high-entropy token filtering

Parameters:
  • sequences (torch.Tensor) – Token sequences including both prompt and response.

  • action_log_probs (torch.Tensor) – Log probabilities of actions from the current policy.

  • base_action_log_probs (torch.Tensor) – Log probabilities from the reference (initial) policy.

  • values (torch.Tensor) – Value estimates from the critic.

  • returns (Optional[torch.Tensor]) – Discounted returns for each action.

  • advantages (Optional[torch.Tensor]) – Advantage estimates for each action.

  • attention_mask (Optional[torch.LongTensor]) – Mask indicating valid tokens in sequences.

  • action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens.

  • info (Optional[dict]) – Dictionary containing additional information (rewards, lengths, etc.).

  • kl (Optional[torch.Tensor]) – KL divergence between current and reference policy.

  • action_entropy (Optional[torch.Tensor]) – Entropy values for each action token, used for high-entropy token filtering. When provided, enables training only on high-entropy tokens (forking tokens that determine reasoning directions), improving training efficiency. Shape: (B, A). See: https://arxiv.org/abs/2506.01939

action_entropy: torch.Tensor | None = None
action_log_probs: torch.Tensor
action_mask: torch.BoolTensor | None
advantages: torch.Tensor | None
attention_mask: torch.LongTensor | None
base_action_log_probs: torch.Tensor
info: dict | None
kl: torch.Tensor | None = None
pin_memory()[source]

Pin all tensors in memory for faster GPU transfer.

Returns:

Self with pinned tensors.

Return type:

Experience

returns: torch.Tensor | None
sequences: torch.Tensor
to_device(device: torch.device) None

Move all tensors in the experience to the specified device.

Parameters:

device (torch.device) – Target device.

values: torch.Tensor
class lightrft.trainer.experience_maker.NaiveExperienceMaker(actor: ActorLanguage, critic: torch.nn.Module, reward_model: torch.nn.Module, initial_model: ActorLanguage, tokenizer, prompt_max_len: int, kl_controller, strategy, remote_rm_url: List[str] | None = None, reward_fn: Callable | None = None, reward_fn_label_map: Dict | None = None, reward_recipe: Dict | None = None)[source]

Bases: ABC

A naive experience maker for reinforcement learning.

This class is responsible for generating experiences (sequences of prompts, actions, rewards, etc.) which are then used to train the actor and critic models. It orchestrates the interaction between the actor, critic, reward model, and the initial reference model to produce the data needed for a single step of PPO (or a similar RL algorithm).

Parameters:
  • actor (ActorLanguage) – The policy model to be trained.

  • critic (nn.Module) – The value model to be trained.

  • reward_model (nn.Module) – The reward model used to score generated responses.

  • initial_model (ActorLanguage) – The reference model for KL divergence calculation (typically a frozen copy of the SFT model).

  • tokenizer (Tokenizer) – The tokenizer for encoding and decoding text.

  • prompt_max_len (int) – The maximum length of input prompts after tokenization.

  • kl_controller (KLController) – The controller for managing the KL penalty coefficient.

  • strategy (Strategy, optional) – The training strategy containing configurations and distributed training logic, defaults to None.

  • remote_rm_url (List[str], optional) – A list of URLs for remote reward models, defaults to None.

  • reward_fn (Callable, optional) – A custom reward function, defaults to None.

  • reward_fn_label_map (Dict, optional) – A map for reward function labels, defaults to None.

  • reward_recipe (Dict, optional) – A dictionary defining how to combine different reward sources, defaults to None.

generate_samples(all_prompts: List[str], **generate_kwargs) List[Samples]

Generate samples and return in batches.

Parameters:
  • all_prompts (List[str]) – List of prompt strings.

  • generate_kwargs (dict) – Additional generation parameters.

Returns:

List of Samples objects.

Return type:

List[Samples]

get_advantages_and_returns(values: torch.Tensor, rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float, lambd: float) Tuple[torch.Tensor, torch.Tensor]

Compute advantages and returns from rewards and values using GAE.

Calculated as in the original PPO paper: https://arxiv.org/abs/1707.06347 Note that rewards may include a KL divergence loss term.

Advantages formula:
Adv1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
  • V1 + γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …

Returns formula:
Ret1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
  • γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …

Parameters:
  • values (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • action_mask (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • gamma (float) – Discount factor.

  • lambd (float) – GAE lambda parameter.

Returns:

Tuple of (advantages, returns), both of shape (batch_size, response_size).

Return type:

Tuple[torch.Tensor, torch.Tensor]

get_cumulative_returns(rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float) Tuple[torch.Tensor, torch.Tensor]

Compute cumulative returns from rewards using REINFORCE.

REINFORCE uses cumulative returns without GAE (Generalized Advantage Estimation).

Parameters:
  • rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • action_mask (torch.Tensor) – Binary mask tensor of shape (batch_size, response_size).

  • gamma (float) – Discount factor.

Returns:

Returns tensor of shape (batch_size, response_size).

Return type:

torch.Tensor

make_experience(samples: Samples) Experience

Turn samples into experience by calculating log probs, values, rewards, and KL divergence.

Parameters:

samples (Samples) – Samples object containing sequences and metadata.

Returns:

Experience object with all computed values.

Return type:

Experience

make_experience_list(all_prompts: str | List[str], **generate_kwargs) List[Experience]

Make a list of experiences with the micro_rollout_batch_size.

This method first calculates the response sequences and rewards for the given prompts. Then, if we need certain processing for the rewards or filtering, we process the rollout as a whole. After that, we calculate the advantages and returns for each experience.

Parameters:
  • all_prompts (Union[str, List[str]]) – Prompts to generate responses for.

  • generate_kwargs (dict) – Additional generation parameters (gamma, lambd, etc.).

Returns:

List of Experience objects.

Return type:

List[Experience]

process_experiences(experiences: List[Experience]) Tuple[List[Experience], List[torch.Tensor]]

Process experiences for reward shaping and filtering.

This can be used to filter out some experiences or do some processing on the rewards.

Parameters:

experiences (List[Experience]) – List of Experience objects.

Returns:

Tuple of (processed experiences, processed rewards).

Return type:

Tuple[List[Experience], List[torch.Tensor]]

tokenize_fn(texts, max_length, padding=True, device=None)[source]

Tokenize input texts.

Parameters:
  • texts (List[str]) – List of text strings to tokenize.

  • max_length (int) – Maximum sequence length.

  • padding (bool) – Whether to apply padding, defaults to True.

  • device (torch.device or str, optional) – Target device for tensors, defaults to None.

Returns:

Tokenized batch (as dict if padding=True, otherwise as list).

Return type:

dict or list

class lightrft.trainer.experience_maker.Samples(sequences: torch.Tensor, attention_mask: torch.LongTensor | None, action_mask: torch.BoolTensor | None, num_actions: int | torch.Tensor, packed_seq_lens: torch.Tensor | None, response_length: torch.Tensor, total_length: torch.Tensor, prompts: list[str], labels: list[str], pad_len: int | None)[source]

Bases: object

Samples is a batch of data that can be in batched or packed format.

The batched format applies padding to sequences, while the packed format concatenates prompt and response without padding.

Tensor shapes (batched / packed):
  • sequences: (B, S) or (1, total_length) - tokens of both prompt and response

  • attention_mask: (B, S) or (1, total_length) - attention mask for sequences

  • action_mask: (B, A) or None - response mask showing which part is the response

  • num_actions: int or (B,) - number of actions (tokens) in the response

  • packed_seq_lens: None or (B,) - length of each sample in packed format

  • response_length: (B,) - number of tokens in the response

  • total_length: (B,) - total number of tokens in sequences

  • prompts: list[str] - the prompts used to generate responses

  • labels: list[str] - ground truth labels (if available)

Parameters:
  • sequences (torch.Tensor) – Token sequences including both prompt and response.

  • attention_mask (Optional[torch.LongTensor]) – Attention mask for sequences.

  • action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens.

  • num_actions (Union[int, torch.Tensor]) – Number of actions per sample.

  • packed_seq_lens (Optional[torch.Tensor]) – Sequence lengths for packed format.

  • response_length (torch.Tensor) – Length of each response.

  • total_length (torch.Tensor) – Total length of each sequence.

  • prompts (list[str]) – List of prompt strings.

  • labels (list[str]) – List of label strings.

  • pad_len (Optional[int]) – Padding length applied.

action_mask: torch.BoolTensor | None
attention_mask: torch.LongTensor | None
labels: list[str]
num_actions: int | torch.Tensor
packed_seq_lens: torch.Tensor | None
pad_len: int | None
prompts: list[str]
response_length: torch.Tensor
sequences: torch.Tensor
total_length: torch.Tensor
lightrft.trainer.experience_maker.clip_filter_like_weight_func(rewards, clip_filter_like_weight_clip_eps=3.0, lamda=1.0)[source]

Compute clip-filter-like weights for rewards.

This function applies a weighting scheme similar to the clip-filter method used in early RLHF implementations, where samples with zero variance are given special weights.

Parameters:
  • rewards (torch.Tensor) – Reward tensor of shape [batch_size, n_samples].

  • clip_filter_like_weight_clip_eps (float) – Maximum clipping value for weights, defaults to 3.0.

  • lamda (float) – Weight value for samples with zero variance, defaults to 1.0.

Returns:

Weight tensor of the same shape as rewards.

Return type:

torch.Tensor

lightrft.trainer.experience_maker.pin_memory(tensor: torch.Tensor | list[torch.Tensor])[source]

Pin tensor(s) in memory for faster GPU transfer.

Parameters:

tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to pin.

Returns:

Pinned tensor(s).

Return type:

Union[torch.Tensor, list[torch.Tensor]]

lightrft.trainer.experience_maker.to(tensor: torch.Tensor | list[torch.Tensor], device)[source]

Move tensor(s) to the specified device.

Parameters:
  • tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to move.

  • device (torch.device or str) – Target device.

Returns:

Tensor(s) on the target device.

Return type:

Union[torch.Tensor, list[torch.Tensor]]