lightrft.trainer.experience_maker_vl¶
- class lightrft.trainer.experience_maker_vl.ExperienceVL(sequences: torch.Tensor, pixel_values: torch.Tensor | None = None, image_grid_thws: torch.Tensor | None = None, raw_images: List[Image] | None = None, pixel_values_videos: torch.Tensor | None = None, video_grid_thws: torch.Tensor | None = None, action_log_probs: torch.Tensor = None, base_action_log_probs: torch.Tensor = None, values: torch.Tensor = None, returns: torch.Tensor | None = None, advantages: torch.Tensor | None = None, attention_mask: torch.LongTensor | None = None, action_mask: torch.BoolTensor | None = None, info: dict | None = None, kl: torch.Tensor | None = None, action_entropy: torch.Tensor | None = None)[source]¶
Bases:
objectExperience is a batch of data for Vision-Language models.
These data should have the same sequence length and number of actions. Left padding for sequences is applied.
- Tensor shapes:
sequences: (B, S) where B is batch size, S is sequence length
pixel_values: (B * h, w) - image pixels processed by HF processor
image_grid_thws: (B, 3) - image grid thw
raw_images: Optional[List[Image.Image]] - raw images before processing
pixel_values_videos: (B * f, c * h * w) - video pixels processed by HF processor
video_grid_thws: (B, 3) - video grid thw
action_log_probs: (B, A) where A is number of actions
base_action_log_probs: (B, A)
values: (B, A)
returns: (B, A)
advantages: (B, A)
attention_mask: (B, S)
action_mask: (B, A)
kl: (B, A)
action_entropy: (B, A) - Entropy values for high-entropy token filtering
- Parameters:
sequences (torch.Tensor) – Token sequences including both prompt and response.
pixel_values (Optional[torch.Tensor]) – Image pixel values processed by HF processor, defaults to None.
image_grid_thws (Optional[torch.Tensor]) – Image grid thw, defaults to None.
raw_images (Optional[List[Image.Image]]) – Raw image data list, defaults to None.
pixel_values_videos (Optional[torch.Tensor]) – Video pixel values processed by HF processor, defaults to None.
video_grid_thws (Optional[torch.Tensor]) – Video grid thw, defaults to None.
action_log_probs (torch.Tensor) – Log probabilities of actions from the current policy, defaults to None.
base_action_log_probs (torch.Tensor) – Log probabilities from the reference policy, defaults to None.
values (torch.Tensor) – Value estimates from the critic, defaults to None.
returns (Optional[torch.Tensor]) – Discounted returns for each action, defaults to None.
advantages (Optional[torch.Tensor]) – Advantage estimates for each action, defaults to None.
attention_mask (Optional[torch.LongTensor]) – Mask indicating valid tokens in sequences, defaults to None.
action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens, defaults to None.
info (Optional[dict]) – Dictionary containing additional information, defaults to None.
kl (Optional[torch.Tensor]) – KL divergence between current and reference policy, defaults to None.
action_entropy (Optional[torch.Tensor]) – Entropy values for each action token, used for high-entropy token filtering. When provided, enables training only on high-entropy tokens (forking tokens that determine reasoning directions), improving training efficiency. Shape: (B, A). See: https://arxiv.org/abs/2506.01939
- action_entropy: torch.Tensor | None = None¶
- action_log_probs: torch.Tensor = None¶
- action_mask: torch.BoolTensor | None = None¶
- advantages: torch.Tensor | None = None¶
- attention_mask: torch.LongTensor | None = None¶
- base_action_log_probs: torch.Tensor = None¶
- image_grid_thws: torch.Tensor | None = None¶
- info: dict | None = None¶
- kl: torch.Tensor | None = None¶
- pin_memory()[source]¶
Pin all tensors in memory for faster GPU transfer.
- Returns:
Self with pinned tensors.
- Return type:
- pixel_values: torch.Tensor | None = None¶
- pixel_values_videos: torch.Tensor | None = None¶
- raw_images: List[Image] | None = None¶
- returns: torch.Tensor | None = None¶
- sequences: torch.Tensor¶
- to_device(device: torch.device)¶
Move all tensors in the experience to the specified device.
- Parameters:
device (torch.device) – Target device.
- Returns:
Self with tensors moved to device.
- Return type:
- values: torch.Tensor = None¶
- video_grid_thws: torch.Tensor | None = None¶
- class lightrft.trainer.experience_maker_vl.NaiveExperienceMakerVL(actor: ActorVL, critic: torch.nn.Module, reward_model: torch.nn.Module, initial_model: ActorVL, tokenizer, processor, prompt_max_len: int, kl_controller, strategy=None, remote_rm_url: list[str] = None, reward_fn=None)[source]¶
Bases:
ABCA naive experience maker for Vision-Language reinforcement learning.
This class is responsible for generating experiences (sequences of prompts, actions, rewards, etc.) which are then used to train the actor and critic models for Vision-Language tasks.
- Parameters:
actor (ActorVL) – The Vision-Language policy model to be trained.
critic (nn.Module) – The value model to be trained.
reward_model (nn.Module) – The reward model used to score generated responses.
initial_model (ActorVL) – The reference model for KL divergence calculation.
tokenizer (Tokenizer) – The tokenizer for encoding and decoding text.
processor (Processor) – The processor for handling multi-modal inputs.
prompt_max_len (int) – The maximum length of input prompts after tokenization.
kl_controller (KLController) – The controller for managing the KL penalty coefficient.
strategy (Strategy, optional) – The training strategy containing configurations, defaults to None.
remote_rm_url (list[str], optional) – A list of URLs for remote reward models, defaults to None.
reward_fn (Callable, optional) – A custom reward function, defaults to None.
- generate_samples(all_prompts: List[str], all_images, all_references, all_labels, **generate_kwargs) List[SamplesVL]¶
Generate samples and return in batches.
- Parameters:
all_prompts (List[str]) – List of prompt strings.
all_images (List) – List of images corresponding to prompts.
all_references (List[str]) – List of reference texts.
all_labels (List[str]) – List of ground truth labels.
generate_kwargs (dict) – Additional generation parameters.
- Returns:
List of SamplesVL objects.
- Return type:
List[SamplesVL]
- get_advantages_and_returns(values: torch.Tensor, rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float, lambd: float) Tuple[torch.Tensor, torch.Tensor]¶
Compute advantages and returns from rewards and values using GAE.
Calculated as in the original PPO paper: https://arxiv.org/abs/1707.06347 Note that rewards may include a KL divergence loss term.
- Advantages formula:
- Adv1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
V1 + γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …
- Returns formula:
- Ret1 = R1 + γ * λ * R2 + γ^2 * λ^2 * R3 + …
γ * (1 - λ) V2 + γ^2 * λ * (1 - λ) V3 + …
- Parameters:
values (torch.Tensor) – Tensor of shape (batch_size, response_size).
rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).
action_mask (torch.Tensor) – Tensor of shape (batch_size, response_size).
gamma (float) – Discount factor.
lambd (float) – GAE lambda parameter.
- Returns:
Tuple of (advantages, returns), both of shape (batch_size, response_size).
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- get_cumulative_returns(rewards: torch.Tensor, action_mask: torch.Tensor, gamma: float) Tuple[torch.Tensor, torch.Tensor]¶
Compute cumulative returns from rewards using REINFORCE.
REINFORCE uses cumulative returns without GAE (Generalized Advantage Estimation).
- Parameters:
rewards (torch.Tensor) – Tensor of shape (batch_size, response_size).
action_mask (torch.Tensor) – Binary mask tensor of shape (batch_size, response_size).
gamma (float) – Discount factor.
- Returns:
Returns tensor of shape (batch_size, response_size).
- Return type:
torch.Tensor
- make_experience(samples: SamplesVL) ExperienceVL¶
Turn samples into experience by calculating log probs, values, rewards, and KL divergence.
- Parameters:
samples (SamplesVL) – Samples object containing sequences and metadata.
- Returns:
ExperienceVL object with all computed values.
- Return type:
- make_experience_list(all_prompts: str | List[str], all_images, all_references, all_labels, **generate_kwargs) List[ExperienceVL]¶
Make a list of experiences with the micro_rollout_batch_size.
This method will first calculate the response sequences and rewards for the given prompts. Then, if we need certain processing for the rewards or filtering, we process the rollout as a whole. After that, we calculate the advantages and returns for each experience.
- Parameters:
all_prompts (Union[str, List[str]]) – Prompts to generate responses for.
all_images (List) – Images corresponding to prompts.
all_references (List[str]) – Reference texts for evaluation.
all_labels (List[str]) – Ground truth labels.
generate_kwargs (dict) – Additional generation parameters (gamma, lambd, etc.).
- Returns:
List of ExperienceVL objects.
- Return type:
List[ExperienceVL]
- process_experiences(experiences: List[ExperienceVL]) Tuple[List[ExperienceVL], List[torch.Tensor]]¶
Process experiences for reward shaping and filtering.
This can be used to filter out some experiences or do some processing on the rewards.
- Parameters:
experiences (List[ExperienceVL]) – List of ExperienceVL objects.
- Returns:
Tuple of (processed experiences, processed rewards).
- Return type:
Tuple[List[ExperienceVL], List[torch.Tensor]]
- processor_fn(texts, images, max_length, padding=True, device=None)[source]¶
Process multi-modal inputs (text and images).
- Parameters:
texts (List[str]) – List of text strings to process.
images (List[Image.Image]) – List of images to process.
max_length (int) – Maximum sequence length.
padding (bool) – Whether to apply padding, defaults to True.
device (torch.device or str, optional) – Target device for tensors, defaults to None.
- Returns:
Processed batch (as dict if padding=True, otherwise as list).
- Return type:
dict or list
- tokenize_fn(texts, max_length, padding=True, device=None)[source]¶
Tokenize input texts.
- Parameters:
texts (List[str]) – List of text strings to tokenize.
max_length (int) – Maximum sequence length.
padding (bool) – Whether to apply padding, defaults to True.
device (torch.device or str, optional) – Target device for tensors, defaults to None.
- Returns:
Tokenized batch (as dict if padding=True, otherwise as list).
- Return type:
dict or list
- class lightrft.trainer.experience_maker_vl.SamplesVL(sequences: torch.Tensor, attention_mask: torch.LongTensor | None = None, action_mask: torch.BoolTensor | None = None, pixel_values: torch.Tensor | None = None, image_grid_thws: torch.Tensor | None = None, raw_images: List[Image] | None = None, image_num: List[int] | None = None, pixel_values_videos: torch.Tensor | None = None, video_grid_thws: torch.Tensor | None = None, video_num: List[int] | None = None, num_actions: int | torch.Tensor = None, packed_seq_lens: torch.Tensor | None = None, response_length: torch.Tensor = None, total_length: torch.Tensor = None, references: List[str] | None = None, labels: List[str] | None = None, prompts: list[str] = None, output_texts: list[str] = None)[source]¶
Bases:
objectSamples is a batch of data for Vision-Language models.
There can be 2 formats to store the samples, batched or packed. The batched format means padding is applied to the sequences, while the packed format will concatenate the prompt and response without padding.
- Tensor shapes (batched / packed):
sequences: (B, S) or (1, total_length) - tokens of both prompt and response
attention_mask: (B, S) or (1, total_length) - attention mask for sequences
action_mask: (B, A) or None - response mask showing which part is the response
pixel_values: Optional[torch.Tensor] - image pixels processed by HF processor
image_grid_thws: Optional[torch.Tensor] - image grid thw
raw_images: Optional[List[Image.Image]] - raw image data list
pixel_values_videos: Optional[torch.Tensor] - video pixels processed by HF processor
video_grid_thws: Optional[torch.Tensor] - video grid thw
num_actions: int or (B,) - number of actions (tokens) in the response
packed_seq_lens: None or (B,) - length of each sample in packed format
response_length: (B,) - number of tokens in the response
total_length: (B,) - total number of tokens in sequences
prompts: list[str] - the prompts used to generate responses
references: Optional[List[str]] - reference texts
labels: Optional[List[str]] - ground truth labels
output_texts: list[str] - generated output texts
image_num: Optional[List[int]] - image numbers
video_num: Optional[List[int]] - video numbers
- Parameters:
sequences (torch.Tensor) – Token sequences including both prompt and response.
attention_mask (Optional[torch.LongTensor]) – Attention mask for sequences, defaults to None.
action_mask (Optional[torch.BoolTensor]) – Mask indicating action (response) tokens, defaults to None.
pixel_values (Optional[torch.Tensor]) – Image pixels processed by HF processor, defaults to None.
image_grid_thws (Optional[torch.Tensor]) – Image grid thw, defaults to None.
raw_images (Optional[List[Image.Image]]) – Raw image data list, defaults to None.
num_actions (Union[int, torch.Tensor]) – Number of actions per sample, defaults to None.
packed_seq_lens (Optional[torch.Tensor]) – Sequence lengths for packed format, defaults to None.
response_length (torch.Tensor) – Length of each response, defaults to None.
total_length (torch.Tensor) – Total length of each sequence, defaults to None.
references (Optional[List[str]]) – Reference texts, defaults to None.
labels (Optional[List[str]]) – Ground truth labels, defaults to None.
prompts (list[str]) – List of prompt strings, defaults to None.
output_texts (list[str]) – Generated output texts, defaults to None.
image_num (Optional[List[int]]) – Image numbers, defaults to None.
video_num (Optional[List[int]]) – Video numbers, defaults to None.
- action_mask: torch.BoolTensor | None = None¶
- attention_mask: torch.LongTensor | None = None¶
- image_grid_thws: torch.Tensor | None = None¶
- image_num: List[int] | None = None¶
- labels: List[str] | None = None¶
- num_actions: int | torch.Tensor = None¶
- output_texts: list[str] = None¶
- packed_seq_lens: torch.Tensor | None = None¶
- pixel_values: torch.Tensor | None = None¶
- pixel_values_videos: torch.Tensor | None = None¶
- prompts: list[str] = None¶
- raw_images: List[Image] | None = None¶
- references: List[str] | None = None¶
- response_length: torch.Tensor = None¶
- sequences: torch.Tensor¶
- total_length: torch.Tensor = None¶
- video_grid_thws: torch.Tensor | None = None¶
- video_num: List[int] | None = None¶
- lightrft.trainer.experience_maker_vl.cumulative_product(data: List[int] | int | ndarray | torch.Tensor) int[source]¶
Compute the cumulative product of a one-dimensional list, a tensor, or a single integer.
- Parameters:
data (Union[List[int], int, np.ndarray, torch.Tensor]) – Input can be an integer, a list of integers, or a tensor (NumPy/torch).
- Returns:
The cumulative product of the input.
- Return type:
int
- lightrft.trainer.experience_maker_vl.pin_memory(tensor: torch.Tensor | list[torch.Tensor])[source]¶
Pin tensor(s) in memory for faster GPU transfer.
- Parameters:
tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to pin.
- Returns:
Pinned tensor(s).
- Return type:
Union[torch.Tensor, list[torch.Tensor]]
- lightrft.trainer.experience_maker_vl.to(tensor: torch.Tensor | list[torch.Tensor], device)[source]¶
Move tensor(s) to the specified device.
- Parameters:
tensor (Union[torch.Tensor, list[torch.Tensor]]) – Tensor or list of tensors to move.
device (torch.device or str) – Target device.
- Returns:
Tensor(s) on the target device.
- Return type:
Union[torch.Tensor, list[torch.Tensor]]