lightrft.models.actor_language¶
Actor model implementation for reinforcement learning with language models.
This module provides the ActorLanguage class, which serves as a foundation for implementing actor models in reinforcement learning scenarios. The actor is responsible for selecting actions based on learned policies and supports both vision-language (VL) and text-only models. It includes support for various optimization techniques such as LoRA adaptation, quantization, and distributed training with DeepSpeed.
The module handles model initialization from pretrained checkpoints or existing model instances, applies various optimizations like Flash Attention,and provides methods for text generation and forward passes with action probability computation.
- class lightrft.models.actor_language.ActorLanguage(*args: Any, **kwargs: Any)[source]¶
Bases:
ModuleA general Actor model for reinforcement learning that supports both Vision-Language (VL) and text-only models.
This class serves as a foundation for implementing various actor models, which are responsible for selecting actions based on the policy learned from the environment. It supports advanced features like LoRA adaptation, quantization, Flash Attention, and distributed training.
- Parameters:
pretrain_or_model (Union[str, nn.Module]) – A pretrained model path/name or a model instance to be used as the actor.
use_flash_attention_2 (bool) – Whether to utilize Flash Attention 2.0 for improved performance.
bf16 (bool) – Enable bfloat16 precision for model computations.
lora_rank (int) – Rank for LoRA adaptation. Set to 0 to disable LoRA.
lora_alpha (int) – Alpha parameter for LoRA scaling.
lora_dropout (float) – Dropout rate for LoRA layers.
target_modules (Optional[List[str]]) – List of target modules for applying LoRA. If None, auto-detects linear modules.
ds_config (Optional[dict]) – Configuration for DeepSpeed, enabling model partitioning across multiple GPUs.
device_map (Optional[dict]) – Device mapping for loading the model onto specific devices.
packing_samples (bool) – Whether to pack samples during training for efficiency.
Example:
# Initialize with a pretrained model actor = ActorLanguage( pretrain_or_model="microsoft/DialoGPT-medium", lora_rank=16, lora_alpha=32, use_flash_attention_2=True ) # Generate text input_ids = torch.tensor([[1, 2, 3, 4]]) sequences, attention_mask, action_mask = actor.generate( input_ids=input_ids, max_new_tokens=50, temperature=0.7 )
- forward(sequences: torch.LongTensor, num_actions: int | list[int] | None = None, attention_mask: torch.Tensor | None = None, return_output: bool = False, packed_seq_lens: list[int] | None = None)[source]¶
Forward pass through the actor model.
Computes action log probabilities for reinforcement learning training. Supports both regular and packed sequence processing for efficient training.
NOTE: This is a text-only model. It does NOT accept pixel_values, image_grid_thw, pixel_values_videos, or video_grid_thw parameters. Use ActorVL for multimodal inputs.
- Parameters:
sequences (torch.LongTensor) – Input token sequences.
num_actions (Optional[Union[int, List[int]]]) – Number of action tokens to extract log probabilities for.
attention_mask (Optional[torch.Tensor]) – Attention mask for the sequences.
return_output (bool) – Whether to return the full model output along with action log probabilities.
packed_seq_lens (Optional[List[int]]) – Sequence lengths for packed samples.
- Returns:
Action log probabilities, optionally with full model output.
- Return type:
Union[torch.Tensor, Tuple[torch.Tensor, dict]]
Example:
sequences = torch.tensor([[1, 2, 3, 4, 5]]) attention_mask = torch.ones_like(sequences) action_log_probs = actor.forward( sequences=sequences, num_actions=2, attention_mask=attention_mask )
- generate(input_ids: torch.Tensor, pixel_values: torch.Tensor | None = None, image_grid_thw: torch.Tensor | None = None, **kwargs) Tuple[torch.LongTensor, torch.LongTensor] | Tuple[torch.LongTensor, torch.LongTensor, torch.BoolTensor]¶
Generate text sequences using the actor model.
Performs text generation with various decoding strategies and returns processed sequences with attention masks and action masks for reinforcement learning.
- Parameters:
input_ids (torch.Tensor) – Input token IDs for generation.
pixel_values (Optional[torch.Tensor]) – Pixel values for vision-language models (currently unused).
image_grid_thw (Optional[torch.Tensor]) – Image grid dimensions for vision-language models (currently unused).
kwargs – Additional generation parameters including max_new_tokens, temperature, etc.
- Returns:
Tuple containing generated sequences, attention mask, and action mask.
- Return type:
Union[Tuple[torch.LongTensor, torch.LongTensor], Tuple[torch.LongTensor, torch.LongTensor, torch.BoolTensor]] # noqa
Example:
input_ids = torch.tensor([[1, 2, 3]]) sequences, attention_mask, action_mask = actor.generate( input_ids=input_ids, max_new_tokens=20, temperature=0.8, do_sample=True )
- gradient_checkpointing_disable()[source]¶
Disable gradient checkpointing.
Turns off gradient checkpointing to use standard backpropagation, which uses more memory but is computationally faster.
Example:
actor.gradient_checkpointing_disable()
- gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant': False})[source]¶
Enable gradient checkpointing for memory-efficient training.
Activates gradient checkpointing to reduce memory usage during backpropagation at the cost of additional computation.
- Parameters:
gradient_checkpointing_kwargs (dict) – Configuration parameters for gradient checkpointing.
Example:
actor.gradient_checkpointing_enable({"use_reentrant": False})
- modality = 'text'¶
- print_trainable_parameters()[source]¶
Print information about trainable parameters in the model.
Displays the number and percentage of trainable parameters, which is particularly useful when using techniques like LoRA that only train a subset of parameters.
Example:
actor.print_trainable_parameters() # Output: trainable params: 4,194,304 || all params: 125,000,000 || trainable%: 3.36
- process_sequences(sequences: torch.Tensor, input_len: int, eos_token_id: int, pad_token_id: int) Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
Called by trainer/fast_exp_maker.py.
Process generated sequences to create proper attention and action masks.
This method post-processes the generated sequences to ensure proper handling of end-of-sequence tokens and creates masks needed for reinforcement learning training. It handles edge cases like multiple EOS tokens and ensures consistent sequence formatting.
- Parameters:
sequences (torch.Tensor) – Generated token sequences
input_len (int) – Length of the input prompt
eos_token_id (int) – End-of-sequence token ID
pad_token_id (int) – Padding token ID
- Returns:
Tuple of processed sequences, attention mask, and action mask
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]