lightrft.strategy.deepspeed.deepspeed¶
DeepSpeed Strategy Module for lightrft.
This module provides a DeepSpeed-based strategy for training and inference in lightrft. It handles model initialization, optimization, parameter management, and checkpoint operations using DeepSpeed’s distributed training capabilities. The strategy supports various DeepSpeed features including ZeRO optimization stages, mixed precision training, and parameter offloading.
ModelOptimPair¶
- lightrft.strategy.deepspeed.deepspeed.ModelOptimPair¶
alias of
Tuple[Module,Optimizer]
ModelOrModelOptimPair¶
- lightrft.strategy.deepspeed.deepspeed.ModelOrModelOptimPair¶
alias of
Union[Module,Tuple[Module,Optimizer]]
DeepspeedStrategy¶
- class lightrft.strategy.deepspeed.deepspeed.DeepspeedStrategy(seed: int = 42, max_norm: float = 0.0, micro_train_batch_size: int = 1, train_batch_size: int = 1, zero_stage: int = 2, bf16: bool = True, args=None)[source]¶
DeepSpeed implementation of the training strategy.
Modified from https://github.com/OpenRLHF/OpenRLHF, with these changes: 1. inherits from StrategyBase, add some api 2. removed ring-attn related code
- Parameters:
seed (int) – Random seed for reproducibility
max_norm (float) – Maximum gradient norm for gradient clipping (0.0 means no clipping)
micro_train_batch_size (int) – Batch size for a single GPU/process
train_batch_size (int) – Global batch size across all GPUs/processes
zero_stage (int) – DeepSpeed ZeRO optimization stage (0, 1, 2, or 3)
bf16 (bool) – Whether to use bfloat16 precision
args (object) – Additional arguments for configuration
- __init__(seed: int = 42, max_norm: float = 0.0, micro_train_batch_size: int = 1, train_batch_size: int = 1, zero_stage: int = 2, bf16: bool = True, args=None) None[source]¶
Initialize strategy with common parameters.
- Parameters:
seed (int) – Random seed for reproducibility
max_norm (float) – Maximum gradient norm for clipping
micro_train_batch_size (int) – Batch size for each training step
train_batch_size (int) – Total batch size for training
args (Any (usually argparse.Namespace)) – Additional configuration arguments
- backward(loss: torch.Tensor, model: torch.nn.Module, optimizer: torch.optim.Optimizer, **kwargs) None[source]¶
Perform backward pass to compute gradients.
- Parameters:
loss (torch.Tensor) – The loss tensor to backpropagate
model (nn.Module) – The model being trained
optimizer (optim.Optimizer) – The optimizer for the model
kwargs – Additional arguments
- create_optimizer(model, **kwargs) torch.optim.Optimizer[source]¶
Create an optimizer for the given model.
- Parameters:
model (nn.Module) – The model to create an optimizer for
kwargs – Additional arguments for the optimizer
- Returns:
The created optimizer
- Return type:
Optimizer
- get_ds_eval_config(offload=False)[source]¶
Get the DeepSpeed configuration for evaluation.
- Parameters:
offload (bool) – Whether to offload parameters to CPU
- Returns:
DeepSpeed configuration dictionary
- Return type:
dict
- get_ds_train_config(is_actor)[source]¶
Get the DeepSpeed configuration for training.
- Parameters:
is_actor (bool) – Whether the model is an actor model
- Returns:
DeepSpeed configuration dictionary
- Return type:
dict
- init_model_context()[source]¶
Context manager for initializing a model with DeepSpeed configuration.
This sets up the HfDeepSpeedConfig for use with Hugging Face models.
- load_ckpt(model, load_dir, tag=None, load_module_strict=True, load_optimizer_states=True, load_lr_scheduler_states=True, load_module_only=False, **_kwargs)[source]¶
Load a DeepSpeed model checkpoint from the specified directory.
This function wraps DeepSpeed’s checkpoint loading functionality with error handling.
- Parameters:
model (deepspeed.DeepSpeedEngine) – The DeepSpeed model to load the checkpoint into
load_dir (str) – Directory from which to load the checkpoint
tag (str, optional) – Optional tag to specify which checkpoint to load
load_module_strict (bool, default=True) – Whether to strictly enforce that the keys in the model state dict match the keys in the checkpoint
load_optimizer_states (bool, default=True) – Whether to load optimizer states from checkpoint
load_lr_scheduler_states (bool, default=True) – Whether to load learning rate scheduler states
load_module_only (bool, default=False) – Whether to load only the module weights and not optimizer or scheduler states
- Returns:
A tuple containing the checkpoint path and loaded states
- Return type:
tuple(str, dict)
- Raises:
AssertionError – If model is not a DeepSpeedEngine
Exception – If loading the checkpoint fails
- load_model(model: torch.nn.Module, path: str, map_location='cpu', strict: bool = False, key_replace_fn=None) None[source]¶
Load model weights from a file.
- Parameters:
model (nn.Module) – The model to load weights into
path (str) – Path to the saved model weights
map_location (str or torch.device) – Device to load the weights to
strict (bool) – Whether to strictly enforce that the keys in state_dict match the model
key_replace_fn (callable) – Function to modify state dict keys
- moving_average(model, model_ema, beta=0.992, device='cpu')[source]¶
Update model_ema parameters with exponential moving average of model parameters.
- Parameters:
model (nn.Module) – Source model
model_ema (nn.Module) – Target model for EMA
beta (float) – EMA decay factor
device (str) – Device to perform operations on
- optimizer_step(optimizer: torch.optim.Optimizer, model: torch.nn.Module, scheduler, name='model', **kwargs) None[source]¶
Perform an optimization step.
- Parameters:
optimizer (optim.Optimizer) – The optimizer to step
model (nn.Module) – The model being trained
scheduler – The learning rate scheduler
name (str) – Name identifier for the model
kwargs – Additional arguments
- prepare(*models_or_model_optim_pairs: torch.nn.Module | Tuple[torch.nn.Module, torch.optim.Optimizer], is_rlhf=False) List[torch.nn.Module | Tuple[torch.nn.Module, torch.optim.Optimizer]] | torch.nn.Module | Tuple[torch.nn.Module, torch.optim.Optimizer][source]¶
Prepare models and optimizers for DeepSpeed training.
Expected input format for RLHF: tuple: ((actor, actor_optim, actor_scheduler),
(critic, critic_optim, critic_scheduler), reward_models, initial_model)
- Parameters:
models_or_model_optim_pairs – Models or (model, optimizer, scheduler) tuples
is_rlhf (bool) – Whether this is for RLHF training
- Returns:
Prepared models or model-optimizer pairs
- Return type:
Union[List[ModelOrModelOptimPair], ModelOrModelOptimPair]
- prepare_reward_models(reward_models)[source]¶
Prepare reward models for DeepSpeed evaluation.
- Parameters:
reward_models (List[nn.Module]) – List of reward models to prepare
- Returns:
List of prepared reward models
- Return type:
List[nn.Module]
- save_ckpt(model, save_dir, tag=None, max_num=3, max_mem=1000, client_state={}, save_latest=True, **_kwargs)[source]¶
Save a DeepSpeed model checkpoint with automatic management of checkpoint storage.
This function manages checkpoint storage by limiting the number of checkpoints and total storage size. It automatically removes the oldest checkpoints when limits are exceeded.
- Parameters:
model (deepspeed.DeepSpeedEngine) – The DeepSpeed model to save
save_dir (str) – Directory where checkpoints will be saved
tag (str, optional) – Optional tag for the checkpoint (e.g., iteration number)
max_num (int, default=3) – Maximum number of checkpoints to keep
max_mem (int, default=1000) – Maximum storage size for checkpoints in GB
client_state (dict, default={}) – Additional client state to save with the checkpoint
save_latest (bool, default=True) – Whether to save a symbolic link to the latest checkpoint
_kwargs – dict, not used keys, e.g. optimizer, scheduler
- Returns:
None
- Raises:
AssertionError – If model is not a DeepSpeedEngine
- save_model(model: torch.nn.Module, tokenizer, output_dir, **kwargs) None[source]¶
Save a model, its configuration, and tokenizer to the specified output directory.
Handles special cases for DeepSpeed ZeRO-2/3 parameters and PEFT models. For ZeRO parallelism, it gathers distributed parameters before saving. For PEFT models, it saves adapter weights appropriately based on the DeepSpeed stage.
- Parameters:
model (nn.Module) – The model to save
tokenizer (PreTrainedTokenizer or similar) – The tokenizer to save
output_dir (str) – Directory where the model, config, and tokenizer will be saved
kwargs – Additional arguments to pass to the model’s save_pretrained method
- Returns:
None