lightrft.models.actor_vl¶
Vision-Language Actor Model Module for Reinforcement Learning.
This module provides the ActorVL class, which implements an actor model specifically designed for vision-language tasks in reinforcement learning scenarios. The actor is responsible for generating actions (text sequences) based on visual inputs (images and videos) and textual prompts.
The module supports various optimization techniques including: - LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning - Flash Attention 2.0 for improved performance - DeepSpeed integration for distributed training - Sample packing for efficient batch processing
Key Features: - Multi-modal input processing (text + vision) - Flexible model loading from pretrained checkpoints - Support for various vision-language model architectures - Gradient checkpointing for memory optimization - MoE (Mixture of Experts) model support
- class lightrft.models.actor_vl.ActorVL(*args: Any, **kwargs: Any)[source]¶
Bases:
ModuleVision-Language Actor model for reinforcement learning applications.
This class serves as a foundation for implementing vision-language actor models in RL, which are responsible for generating text sequences (actions) based on visual (images and videos) and textual inputs. The model supports various optimization techniques including LoRA adaptation, quantization, and distributed training.
The actor model can be initialized either from a pretrained model path or from an existing model instance, providing flexibility in model deployment scenarios.
- Parameters:
pretrain_or_model (Union[str, nn.Module]) – Either a string path to a pretrained model or a model instance
use_flash_attention_2 (bool) – Whether to utilize Flash Attention 2.0 for improved performance
bf16 (bool) – Enable bfloat16 precision for model computations
lora_rank (int) – Rank for LoRA adaptation (0 disables LoRA)
lora_alpha (int) – Alpha parameter for LoRA scaling
lora_dropout (float) – Dropout rate for LoRA layers
target_modules (Optional[list]) – List of target modules for applying LoRA (auto-detected if None)
ds_config (Optional[dict]) – Configuration for DeepSpeed distributed training
device_map (Optional[dict]) – Device mapping for loading the model onto specific devices
packing_samples (bool) – Whether to pack samples during training for efficiency
Example:
# Initialize with a pretrained model path actor = ActorVL( pretrain_or_model="microsoft/LLaVA-1.5-7b-hf", use_flash_attention_2=True, lora_rank=16, lora_alpha=32 ) # Generate responses sequences, attention_mask, action_mask = actor.generate( input_ids=input_tensor, pixel_values=image_tensor, image_grid_thw=grid_tensor, max_new_tokens=100 )
- forward(sequences: torch.LongTensor, num_actions: int | list[int] | None = None, attention_mask: torch.Tensor | None = None, pixel_values: torch.Tensor | None = None, image_grid_thw: torch.Tensor | None = None, pixel_values_videos: torch.Tensor | None = None, video_grid_thw: torch.Tensor | None = None, return_output=False, packed_seq_lens: list[int] | None = None) torch.Tensor[source]¶
Forward pass to compute action log probabilities for reinforcement learning.
This method processes input sequences and visual information to compute log probabilities of actions (tokens) for RL training. It supports both standard and packed sequence formats and can return either just the action log probabilities or the full model output.
- Parameters:
sequences (torch.LongTensor) – Input token sequences
num_actions (Optional[Union[int, list[int]]]) – Number of action tokens to extract log probs for
attention_mask (Optional[torch.Tensor]) – Attention mask for the sequences
pixel_values (Optional[torch.Tensor]) – Preprocessed pixel values of input images
image_grid_thw (Optional[torch.Tensor]) – Image grid dimensions (time, height, width)
pixel_values_videos (Optional[torch.Tensor]) – Preprocessed pixel values of input videos
video_grid_thw (Optional[torch.Tensor]) – Video grid dimensions
return_output (bool) – Whether to return the full model output along with log probs
packed_seq_lens (Optional[list[int]]) – Sequence lengths for packed samples
- Returns:
Action log probabilities or tuple of (action_log_probs, output) if return_output=True
- Return type:
torch.Tensor
Example:
# Compute action log probabilities for RL training log_probs = actor( sequences=token_sequences, num_actions=10, pixel_values=image_tensor, image_grid_thw=grid_tensor ) # Get both log probs and full output log_probs, output = actor( sequences=token_sequences, num_actions=10, pixel_values=image_tensor, image_grid_thw=grid_tensor, return_output=True )
- generate(input_ids: torch.Tensor, pixel_values: torch.Tensor | None = None, image_grid_thw: torch.Tensor | None = None, pixel_values_videos: torch.Tensor | None = None, video_grid_thw: torch.Tensor | None = None, **kwargs) Tuple[torch.LongTensor, torch.LongTensor] | Tuple[torch.LongTensor, torch.LongTensor, torch.BoolTensor]¶
Generate text sequences based on input text and visual information.
This method performs text generation conditioned on both textual prompts and visual inputs. It handles the generation process with various sampling strategies and returns the generated sequences along with attention masks and action masks for RL training.
- Parameters:
input_ids (torch.Tensor) – Input token IDs representing the text prompt
pixel_values (Optional[torch.Tensor]) – Preprocessed pixel values of input images
image_grid_thw (Optional[torch.Tensor]) – Image grid dimensions (time, height, width)
pixel_values_videos (Optional[torch.Tensor]) – Preprocessed pixel values of input videos
video_grid_thw (Optional[torch.Tensor]) – Video grid dimensions
kwargs (dict) – Additional generation parameters (top_k, top_p, temperature, etc.)
- Returns:
Tuple containing generated sequences, attention mask, and action mask
- Return type:
Union[Tuple[torch.LongTensor, torch.LongTensor], Tuple[torch.LongTensor, torch.LongTensor, torch.BoolTensor]] # noqa
Example:
sequences, attention_mask, action_mask = actor.generate( input_ids=torch.tensor([[1, 2, 3]]), pixel_values=image_tensor, image_grid_thw=torch.tensor([[1, 24, 24]]), max_new_tokens=50, temperature=0.8, do_sample=True )
- gradient_checkpointing_disable()[source]¶
Disable gradient checkpointing to use normal forward/backward computation.
This method restores the default behavior where all intermediate activations are stored during the forward pass for use in the backward pass. This increases memory usage but reduces computation time.
Example:
# Disable gradient checkpointing actor.gradient_checkpointing_disable()
- gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant': False})[source]¶
Enable gradient checkpointing to reduce memory usage during training.
Gradient checkpointing trades compute for memory by recomputing intermediate activations during the backward pass instead of storing them. This is particularly useful for training large vision-language models with limited GPU memory.
- Parameters:
gradient_checkpointing_kwargs (dict) – Additional arguments for gradient checkpointing
Example:
# Enable gradient checkpointing with default settings actor.gradient_checkpointing_enable() # Enable with custom settings actor.gradient_checkpointing_enable({"use_reentrant": True})
- modality = 'vision'¶
- print_trainable_parameters()[source]¶
Print information about trainable parameters in the model.
This method displays the number and percentage of trainable parameters, which is particularly useful when using parameter-efficient methods like LoRA. It helps monitor the efficiency of the fine-tuning approach.
Example:
# Print trainable parameter statistics actor.print_trainable_parameters() # Output: trainable params: 4,194,304 || all params: 7,241,732,096 || trainable%: 0.058
- process_sequences(sequences: torch.Tensor, input_len: int, eos_token_id: int, pad_token_id: int) Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
Called by trainer/fast_exp_maker.py.
Process generated sequences to create proper attention and action masks.
This method post-processes the generated sequences to ensure proper handling of end-of-sequence tokens and creates masks needed for reinforcement learning training. It handles edge cases like multiple EOS tokens and ensures consistent sequence formatting.
- Parameters:
sequences (torch.Tensor) – Generated token sequences
input_len (int) – Length of the input prompt
eos_token_id (int) – End-of-sequence token ID
pad_token_id (int) – Padding token ID
- Returns:
Tuple of processed sequences, attention mask, and action mask
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]