Shortcuts

lightrft.trainer.experience_maker_vl

class lightrft.trainer.experience_maker_vl.ExperienceVL(sequences: torch.Tensor, pixel_values: torch.Tensor | None = None, image_grid_thws: torch.Tensor | None = None, raw_images: List[Image] | None = None, pixel_values_videos: torch.Tensor | None = None, video_grid_thws: torch.Tensor | None = None, action_log_probs: torch.Tensor = None, base_action_log_probs: torch.Tensor = None, values: torch.Tensor = None, returns: torch.Tensor | None = None, advantages: torch.Tensor | None = None, attention_mask: torch.LongTensor | None = None, action_mask: torch.BoolTensor | None = None, info: dict | None = None, kl: torch.Tensor | None = None, action_entropy: torch.Tensor | None = None)[source]

Bases: object

Experience is a batch of data for Vision-Language models.

These data should have the same sequence length and number of actions. Left padding for sequences is applied.

Tensor shapes:
  • sequences: (B, S) where B is batch size, S is sequence length

  • pixel_values: (B * h, w) - image pixels processed by HF processor

  • image_grid_thws: (B, 3) - image grid thw

  • raw_images: Optional[List[Image.Image]] - raw images before processing

  • pixel_values_videos: (B * f, c * h * w) - video pixels processed by HF processor

  • video_grid_thws: (B, 3) - video grid thw

  • action_log_probs: (B, A) where A is number of actions

  • base_action_log_probs: (B, A)

  • values: (B, A)

  • returns: (B, A)

  • advantages: (B, A)

  • attention_mask: (B, S)

  • action_mask: (B, A)

  • kl: (B, A)

  • action_entropy: (B, A) - Entropy values for high-entropy token filtering

Parameters:
  • sequences (torch.Tensor) – Token sequences including both prompt and response.

  • pixel_values (Optional[torch.Tensor]) – Image pixel values processed by HF processor, defaults to None.

  • image_grid_thws (Optional[torch.Tensor]) – Image grid thw, defaults to None.

  • raw_images (Optional[List[Image.Image]]) – Raw image data list, defaults to None.

  • pixel_values_videos (Optional[torch.Tensor]) – Video pixel values processed by HF processor, defaults to None.

  • video_grid_thws (Optional[torch.Tensor]) – Video grid thw, defaults to None.

  • action_log_probs (torch.Tensor) – Log probabilities of actions from the current policy, defaults to None.

  • base_action_log_probs (torch.Tensor) – Log probabilities from the reference policy, defaults to None.

  • values (torch.Tensor) – Value estimates from the critic, defaults to None.

  • returns (Optional[torch.Tensor]) – Discounted returns for each action, defaults to None.

  • advantages (Optional[torch.Tensor]) – Advantage estimates for each action, defaults to None.

  • attention_mask (Optional[torch.LongTensor]) – Mask indicating valid tokens in sequences, defaults to None.

  • action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens, defaults to None.

  • info (Optional[dict]) – Dictionary containing additional information, defaults to None.

  • kl (Optional[torch.Tensor]) – KL divergence between current and reference policy, defaults to None.

  • action_entropy (Optional[torch.Tensor]) – Entropy values for each action token, used for high-entropy token filtering. When provided, enables training only on high-entropy tokens (forking tokens that determine reasoning directions), improving training efficiency. Shape: (B, A). See: https://arxiv.org/abs/2506.01939

action_entropy: torch.Tensor | None = None
action_log_probs: torch.Tensor = None
action_mask: torch.BoolTensor | None = None
advantages: torch.Tensor | None = None
attention_mask: torch.LongTensor | None = None
base_action_log_probs: torch.Tensor = None
image_grid_thws: torch.Tensor | None = None
info: dict | None = None
kl: torch.Tensor | None = None
pin_memory()[source]

Pin all tensors in memory for faster GPU transfer.

Returns:

Self with pinned tensors.

Return type:

ExperienceVL

pixel_values: torch.Tensor | None = None
pixel_values_videos: torch.Tensor | None = None
raw_images: List[Image] | None = None
returns: torch.Tensor | None = None
sequences: torch.Tensor
to_device(device: torch.device)

Move all tensors in the experience to the specified device.

Parameters:

device (torch.device) – Target device.

Returns:

Self with tensors moved to device.

Return type:

ExperienceVL

values: torch.Tensor = None
video_grid_thws: torch.Tensor | None = None
class lightrft.trainer.experience_maker_vl.NaiveExperienceMakerVL(actor: ActorVL, critic: torch.nn.Module, reward_model: torch.nn.Module, initial_model: ActorVL, tokenizer, processor, prompt_max_len: int, kl_controller, strategy=None, remote_rm_url: list[str] = None, reward_fn=None)[source]

Bases: ABC

A naive experience maker for Vision-Language reinforcement learning.

This class is responsible for generating experiences (sequences of prompts, actions, rewards, etc.) which are then used to train the actor and critic models for Vision-Language tasks.

Parameters:
  • actor (ActorVL) – The Vision-Language policy model to be trained.

  • critic (nn.Module) – The value model to be trained.

  • reward_model (nn.Module) – The reward model used to score generated responses.

  • initial_model (ActorVL) – The reference model for KL divergence calculation.

  • tokenizer (Tokenizer) – The tokenizer for encoding and decoding text.

  • processor (Processor) – The processor for handling multi-modal inputs.

  • prompt_max_len (int) – The maximum length of input prompts after tokenization.

  • kl_controller (KLController) – The controller for managing the KL penalty coefficient.

  • strategy (Strategy, optional) – The training strategy containing configurations, defaults to None.

  • remote_rm_url (list[str], optional) – A list of URLs for remote reward models, defaults to None.

  • reward_fn (Callable, optional) – A custom reward function, defaults to None.

generate_samples(all_prompts: List[str], all_images, all_references, all_labels, **generate_kwargs) List[SamplesVL]

Generate samples and return in batches.

Parameters:
  • all_prompts (List[str]) – List of prompt strings.

  • all_images (List) – List of images corresponding to prompts.

  • all_references (List[str]) – List of reference texts.

  • all_labels (List[str]) – List of ground truth labels.

  • generate_kwargs (dict) – Additional generation parameters.

Returns:

List of SamplesVL objects.

Return type:

List[SamplesVL]

get_advantages_and_returns(values: torch.Tensor, rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float, lambd: float) Tuple[torch.Tensor, torch.Tensor]

Compute advantages and returns from rewards and values using GAE.

Calculated as in the original PPO paper: https://arxiv.org/abs/1707.06347 Note that rewards may include a KL divergence loss term.

Advantages formula:
Adv1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
  • V1 + γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …

Returns formula:
Ret1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
  • γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …

Parameters:
  • values (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • action_mask (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • gamma (float) – Discount factor.

  • lambd (float) – GAE lambda parameter.

Returns:

Tuple of (advantages, returns), both of shape (batch_size, response_size).

Return type:

Tuple[torch.Tensor, torch.Tensor]

get_cumulative_returns(rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float) Tuple[torch.Tensor, torch.Tensor]

Compute cumulative returns from rewards using REINFORCE.

REINFORCE uses cumulative returns without GAE (Generalized Advantage Estimation).

Parameters:
  • rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).

  • action_mask (torch.Tensor) – Binary mask tensor of shape (batch_size, response_size).

  • gamma (float) – Discount factor.

Returns:

Returns tensor of shape (batch_size, response_size).

Return type:

torch.Tensor

make_experience(samples: SamplesVL) ExperienceVL

Turn samples into experience by calculating log probs, values, rewards, and KL divergence.

Parameters:

samples (SamplesVL) – Samples object containing sequences and metadata.

Returns:

ExperienceVL object with all computed values.

Return type:

ExperienceVL

make_experience_list(all_prompts: str | List[str], all_images, all_references, all_labels, **generate_kwargs) List[ExperienceVL]

Make a list of experiences with the micro_rollout_batch_size.

This method will first calculate the response sequences and rewards for the given prompts. Then, if we need certain processing for the rewards or filtering, we process the rollout as a whole. After that, we calculate the advantages and returns for each experience.

Parameters:
  • all_prompts (Union[str, List[str]]) – Prompts to generate responses for.

  • all_images (List) – Images corresponding to prompts.

  • all_references (List[str]) – Reference texts for evaluation.

  • all_labels (List[str]) – Ground truth labels.

  • generate_kwargs (dict) – Additional generation parameters (gamma, lambd, etc.).

Returns:

List of ExperienceVL objects.

Return type:

List[ExperienceVL]

process_experiences(experiences: List[ExperienceVL]) Tuple[List[ExperienceVL], List[torch.Tensor]]

Process experiences for reward shaping and filtering.

This can be used to filter out some experiences or do some processing on the rewards.

Parameters:

experiences (List[ExperienceVL]) – List of ExperienceVL objects.

Returns:

Tuple of (processed experiences, processed rewards).

Return type:

Tuple[List[ExperienceVL], List[torch.Tensor]]

processor_fn(texts, images, max_length, padding=True, device=None)[source]

Process multi-modal inputs (text and images).

Parameters:
  • texts (List[str]) – List of text strings to process.

  • images (List[Image.Image]) – List of images to process.

  • max_length (int) – Maximum sequence length.

  • padding (bool) – Whether to apply padding, defaults to True.

  • device (torch.device or str, optional) – Target device for tensors, defaults to None.

Returns:

Processed batch (as dict if padding=True, otherwise as list).

Return type:

dict or list

tokenize_fn(texts, max_length, padding=True, device=None)[source]

Tokenize input texts.

Parameters:
  • texts (List[str]) – List of text strings to tokenize.

  • max_length (int) – Maximum sequence length.

  • padding (bool) – Whether to apply padding, defaults to True.

  • device (torch.device or str, optional) – Target device for tensors, defaults to None.

Returns:

Tokenized batch (as dict if padding=True, otherwise as list).

Return type:

dict or list

class lightrft.trainer.experience_maker_vl.SamplesVL(sequences: torch.Tensor, attention_mask: torch.LongTensor | None = None, action_mask: torch.BoolTensor | None = None, pixel_values: torch.Tensor | None = None, image_grid_thws: torch.Tensor | None = None, raw_images: List[Image] | None = None, image_num: List[int] | None = None, pixel_values_videos: torch.Tensor | None = None, video_grid_thws: torch.Tensor | None = None, video_num: List[int] | None = None, num_actions: int | torch.Tensor = None, packed_seq_lens: torch.Tensor | None = None, response_length: torch.Tensor = None, total_length: torch.Tensor = None, references: List[str] | None = None, labels: List[str] | None = None, prompts: list[str] = None, output_texts: list[str] = None)[source]

Bases: object

Samples is a batch of data for Vision-Language models.

There can be 2 formats to store the samples, batched or packed. The batched format means padding is applied to the sequences, while the packed format will concatenate the prompt and response without padding.

Tensor shapes (batched / packed):
  • sequences: (B, S) or (1, total_length) - tokens of both prompt and response

  • attention_mask: (B, S) or (1, total_length) - attention mask for sequences

  • action_mask: (B, A) or None - response mask showing which part is the response

  • pixel_values: Optional[torch.Tensor] - image pixels processed by HF processor

  • image_grid_thws: Optional[torch.Tensor] - image grid thw

  • raw_images: Optional[List[Image.Image]] - raw image data list

  • pixel_values_videos: Optional[torch.Tensor] - video pixels processed by HF processor

  • video_grid_thws: Optional[torch.Tensor] - video grid thw

  • num_actions: int or (B,) - number of actions (tokens) in the response

  • packed_seq_lens: None or (B,) - length of each sample in packed format

  • response_length: (B,) - number of tokens in the response

  • total_length: (B,) - total number of tokens in sequences

  • prompts: list[str] - the prompts used to generate responses

  • references: Optional[List[str]] - reference texts

  • labels: Optional[List[str]] - ground truth labels

  • output_texts: list[str] - generated output texts

  • image_num: Optional[List[int]] - image numbers

  • video_num: Optional[List[int]] - video numbers

Parameters:
  • sequences (torch.Tensor) – Token sequences including both prompt and response.

  • attention_mask (Optional[torch.LongTensor]) – Attention mask for sequences, defaults to None.

  • action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens, defaults to None.

  • pixel_values (Optional[torch.Tensor]) – Image pixels processed by HF processor, defaults to None.

  • image_grid_thws (Optional[torch.Tensor]) – Image grid thw, defaults to None.

  • raw_images (Optional[List[Image.Image]]) – Raw image data list, defaults to None.

  • num_actions (Union[int, torch.Tensor]) – Number of actions per sample, defaults to None.

  • packed_seq_lens (Optional[torch.Tensor]) – Sequence lengths for packed format, defaults to None.

  • response_length (torch.Tensor) – Length of each response, defaults to None.

  • total_length (torch.Tensor) – Total length of each sequence, defaults to None.

  • references (Optional[List[str]]) – Reference texts, defaults to None.

  • labels (Optional[List[str]]) – Ground truth labels, defaults to None.

  • prompts (list[str]) – List of prompt strings, defaults to None.

  • output_texts (list[str]) – Generated output texts, defaults to None.

  • image_num (Optional[List[int]]) – Image numbers, defaults to None.

  • video_num (Optional[List[int]]) – Video numbers, defaults to None.

action_mask: torch.BoolTensor | None = None
attention_mask: torch.LongTensor | None = None
image_grid_thws: torch.Tensor | None = None
image_num: List[int] | None = None
labels: List[str] | None = None
num_actions: int | torch.Tensor = None
output_texts: list[str] = None
packed_seq_lens: torch.Tensor | None = None
pixel_values: torch.Tensor | None = None
pixel_values_videos: torch.Tensor | None = None
prompts: list[str] = None
raw_images: List[Image] | None = None
references: List[str] | None = None
response_length: torch.Tensor = None
sequences: torch.Tensor
total_length: torch.Tensor = None
video_grid_thws: torch.Tensor | None = None
video_num: List[int] | None = None
lightrft.trainer.experience_maker_vl.cumulative_product(data: List[int] | int | ndarray | torch.Tensor) int[source]

Compute the cumulative product of a one-dimensional list, a tensor, or a single integer.

Parameters:

data (Union[List[int], int, np.ndarray, torch.Tensor]) – Input can be an integer, a list of integers, or a tensor (NumPy/torch).

Returns:

The cumulative product of the input.

Return type:

int

lightrft.trainer.experience_maker_vl.pin_memory(tensor: torch.Tensor | list[torch.Tensor])[source]

Pin tensor(s) in memory for faster GPU transfer.

Parameters:

tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to pin.

Returns:

Pinned tensor(s).

Return type:

Union[torch.Tensor, list[torch.Tensor]]

lightrft.trainer.experience_maker_vl.to(tensor: torch.Tensor | list[torch.Tensor], device)[source]

Move tensor(s) to the specified device.

Parameters:
  • tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to move.

  • device (torch.device or str) – Target device.

Returns:

Tensor(s) on the target device.

Return type:

Union[torch.Tensor, list[torch.Tensor]]