lightrft.trainer.experience_maker¶
- class lightrft.trainer.experience_maker.Experience(sequences: torch.Tensor, action_log_probs: torch.Tensor, base_action_log_probs: torch.Tensor, values: torch.Tensor, returns: torch.Tensor | None, advantages: torch.Tensor | None, attention_mask: torch.LongTensor | None, action_mask: torch.BoolTensor | None, info: dict | None, kl: torch.Tensor | None = None, action_entropy: torch.Tensor | None = None)[source]¶
Bases:
objectExperience is a batch of data containing sequences and associated RL training information.
These data should have the same sequence length and number of actions. Left padding for sequences is applied.
- Tensor shapes:
sequences: (B, S) where B is batch size, S is sequence length
action_log_probs: (B, A) where A is number of actions
values: (B, A)
returns: (B, A)
advantages: (B, A)
attention_mask: (B, S)
action_mask: (B, A)
kl: (B, A)
action_entropy: (B, A) - Entropy values for high-entropy token filtering
- Parameters:
sequences (torch.Tensor) – Token sequences including both prompt and response.
action_log_probs (torch.Tensor) – Log probabilities of actions from the current policy.
base_action_log_probs (torch.Tensor) – Log probabilities from the reference (initial) policy.
values (torch.Tensor) – Value estimates from the critic.
returns (Optional[torch.Tensor]) – Discounted returns for each action.
advantages (Optional[torch.Tensor]) – Advantage estimates for each action.
attention_mask (Optional[torch.LongTensor]) – Mask indicating valid tokens in sequences.
action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens.
info (Optional[dict]) – Dictionary containing additional information (rewards, lengths, etc.).
kl (Optional[torch.Tensor]) – KL divergence between current and reference policy.
action_entropy (Optional[torch.Tensor]) – Entropy values for each action token, used for high-entropy token filtering. When provided, enables training only on high-entropy tokens (forking tokens that determine reasoning directions), improving training efficiency. Shape: (B, A). See: https://arxiv.org/abs/2506.01939
- action_entropy: torch.Tensor | None = None¶
- action_log_probs: torch.Tensor¶
- action_mask: torch.BoolTensor | None¶
- advantages: torch.Tensor | None¶
- attention_mask: torch.LongTensor | None¶
- base_action_log_probs: torch.Tensor¶
- info: dict | None¶
- kl: torch.Tensor | None = None¶
- pin_memory()[source]¶
Pin all tensors in memory for faster GPU transfer.
- Returns:
Self with pinned tensors.
- Return type:
- returns: torch.Tensor | None¶
- sequences: torch.Tensor¶
- to_device(device: torch.device) None¶
Move all tensors in the experience to the specified device.
- Parameters:
device (torch.device) – Target device.
- values: torch.Tensor¶
- class lightrft.trainer.experience_maker.NaiveExperienceMaker(actor: ActorLanguage, critic: torch.nn.Module, reward_model: torch.nn.Module, initial_model: ActorLanguage, tokenizer, prompt_max_len: int, kl_controller, strategy, remote_rm_url: List[str] | None = None, reward_fn: Callable | None = None, reward_fn_label_map: Dict | None = None, reward_recipe: Dict | None = None)[source]¶
Bases:
ABCA naive experience maker for reinforcement learning.
This class is responsible for generating experiences (sequences of prompts, actions, rewards, etc.) which are then used to train the actor and critic models. It orchestrates the interaction between the actor, critic, reward model, and the initial reference model to produce the data needed for a single step of PPO (or a similar RL algorithm).
- Parameters:
actor (ActorLanguage) – The policy model to be trained.
critic (nn.Module) – The value model to be trained.
reward_model (nn.Module) – The reward model used to score generated responses.
initial_model (ActorLanguage) – The reference model for KL divergence calculation (typically a frozen copy of the SFT model).
tokenizer (Tokenizer) – The tokenizer for encoding and decoding text.
prompt_max_len (int) – The maximum length of input prompts after tokenization.
kl_controller (KLController) – The controller for managing the KL penalty coefficient.
strategy (Strategy, optional) – The training strategy containing configurations and distributed training logic, defaults to None.
remote_rm_url (List[str], optional) – A list of URLs for remote reward models, defaults to None.
reward_fn (Callable, optional) – A custom reward function, defaults to None.
reward_fn_label_map (Dict, optional) – A map for reward function labels, defaults to None.
reward_recipe (Dict, optional) – A dictionary defining how to combine different reward sources, defaults to None.
- generate_samples(all_prompts: List[str], **generate_kwargs) List[Samples]¶
Generate samples and return in batches.
- Parameters:
all_prompts (List[str]) – List of prompt strings.
generate_kwargs (dict) – Additional generation parameters.
- Returns:
List of Samples objects.
- Return type:
List[Samples]
- get_advantages_and_returns(values: torch.Tensor, rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float, lambd: float) Tuple[torch.Tensor, torch.Tensor]¶
Compute advantages and returns from rewards and values using GAE.
Calculated as in the original PPO paper: https://arxiv.org/abs/1707.06347 Note that rewards may include a KL divergence loss term.
- Advantages formula:
- Adv1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
V1 + γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …
- Returns formula:
- Ret1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …
- Parameters:
values (torch.Tensor) – Tensor of shape (batch_size, response_size).
rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).
action_mask (torch.Tensor) – Tensor of shape (batch_size, response_size).
gamma (float) – Discount factor.
lambd (float) – GAE lambda parameter.
- Returns:
Tuple of (advantages, returns), both of shape (batch_size, response_size).
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- get_cumulative_returns(rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float) Tuple[torch.Tensor, torch.Tensor]¶
Compute cumulative returns from rewards using REINFORCE.
REINFORCE uses cumulative returns without GAE (Generalized Advantage Estimation).
- Parameters:
rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).
action_mask (torch.Tensor) – Binary mask tensor of shape (batch_size, response_size).
gamma (float) – Discount factor.
- Returns:
Returns tensor of shape (batch_size, response_size).
- Return type:
torch.Tensor
- make_experience(samples: Samples) Experience¶
Turn samples into experience by calculating log probs, values, rewards, and KL divergence.
- Parameters:
samples (Samples) – Samples object containing sequences and metadata.
- Returns:
Experience object with all computed values.
- Return type:
- make_experience_list(all_prompts: str | List[str], **generate_kwargs) List[Experience]¶
Make a list of experiences with the micro_rollout_batch_size.
This method first calculates the response sequences and rewards for the given prompts. Then, if we need certain processing for the rewards or filtering, we process the rollout as a whole. After that, we calculate the advantages and returns for each experience.
- Parameters:
all_prompts (Union[str, List[str]]) – Prompts to generate responses for.
generate_kwargs (dict) – Additional generation parameters (gamma, lambd, etc.).
- Returns:
List of Experience objects.
- Return type:
List[Experience]
- process_experiences(experiences: List[Experience]) Tuple[List[Experience], List[torch.Tensor]]¶
Process experiences for reward shaping and filtering.
This can be used to filter out some experiences or do some processing on the rewards.
- Parameters:
experiences (List[Experience]) – List of Experience objects.
- Returns:
Tuple of (processed experiences, processed rewards).
- Return type:
Tuple[List[Experience], List[torch.Tensor]]
- tokenize_fn(texts, max_length, padding=True, device=None)[source]¶
Tokenize input texts.
- Parameters:
texts (List[str]) – List of text strings to tokenize.
max_length (int) – Maximum sequence length.
padding (bool) – Whether to apply padding, defaults to True.
device (torch.device or str, optional) – Target device for tensors, defaults to None.
- Returns:
Tokenized batch (as dict if padding=True, otherwise as list).
- Return type:
dict or list
- class lightrft.trainer.experience_maker.Samples(sequences: torch.Tensor, attention_mask: torch.LongTensor | None, action_mask: torch.BoolTensor | None, num_actions: int | torch.Tensor, packed_seq_lens: torch.Tensor | None, response_length: torch.Tensor, total_length: torch.Tensor, prompts: list[str], labels: list[str], pad_len: int | None)[source]¶
Bases:
objectSamples is a batch of data that can be in batched or packed format.
The batched format applies padding to sequences, while the packed format concatenates prompt and response without padding.
- Tensor shapes (batched / packed):
sequences: (B, S) or (1, total_length) - tokens of both prompt and response
attention_mask: (B, S) or (1, total_length) - attention mask for sequences
action_mask: (B, A) or None - response mask showing which part is the response
num_actions: int or (B,) - number of actions (tokens) in the response
packed_seq_lens: None or (B,) - length of each sample in packed format
response_length: (B,) - number of tokens in the response
total_length: (B,) - total number of tokens in sequences
prompts: list[str] - the prompts used to generate responses
labels: list[str] - ground truth labels (if available)
- Parameters:
sequences (torch.Tensor) – Token sequences including both prompt and response.
attention_mask (Optional[torch.LongTensor]) – Attention mask for sequences.
action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens.
num_actions (Union[int, torch.Tensor]) – Number of actions per sample.
packed_seq_lens (Optional[torch.Tensor]) – Sequence lengths for packed format.
response_length (torch.Tensor) – Length of each response.
total_length (torch.Tensor) – Total length of each sequence.
prompts (list[str]) – List of prompt strings.
labels (list[str]) – List of label strings.
pad_len (Optional[int]) – Padding length applied.
- action_mask: torch.BoolTensor | None¶
- attention_mask: torch.LongTensor | None¶
- labels: list[str]¶
- num_actions: int | torch.Tensor¶
- packed_seq_lens: torch.Tensor | None¶
- pad_len: int | None¶
- prompts: list[str]¶
- response_length: torch.Tensor¶
- sequences: torch.Tensor¶
- total_length: torch.Tensor¶
- lightrft.trainer.experience_maker.clip_filter_like_weight_func(rewards, clip_filter_like_weight_clip_eps=3.0, lamda=1.0)[source]¶
Compute clip-filter-like weights for rewards.
This function applies a weighting scheme similar to the clip-filter method used in early RLHF implementations, where samples with zero variance are given special weights.
- Parameters:
rewards (torch.Tensor) – Reward tensor of shape [batch_size, n_samples].
clip_filter_like_weight_clip_eps (float) – Maximum clipping value for weights, defaults to 3.0.
lamda (float) – Weight value for samples with zero variance, defaults to 1.0.
- Returns:
Weight tensor of the same shape as rewards.
- Return type:
torch.Tensor
- lightrft.trainer.experience_maker.pin_memory(tensor: torch.Tensor | list[torch.Tensor])[source]¶
Pin tensor(s) in memory for faster GPU transfer.
- Parameters:
tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to pin.
- Returns:
Pinned tensor(s).
- Return type:
Union[torch.Tensor, list[torch.Tensor]]
- lightrft.trainer.experience_maker.to(tensor: torch.Tensor | list[torch.Tensor], device)[source]¶
Move tensor(s) to the specified device.
- Parameters:
tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to move.
device (torch.device or str) – Target device.
- Returns:
Tensor(s) on the target device.
- Return type:
Union[torch.Tensor, list[torch.Tensor]]