Shortcuts

lightrft.strategy.deepspeed.deepspeed

DeepSpeed Strategy Module for lightrft.

This module provides a DeepSpeed-based strategy for training and inference in lightrft. It handles model initialization, optimization, parameter management, and checkpoint operations using DeepSpeed’s distributed training capabilities. The strategy supports various DeepSpeed features including ZeRO optimization stages, mixed precision training, and parameter offloading.

ModelOptimPair

lightrft.strategy.deepspeed.deepspeed.ModelOptimPair

alias of Tuple[Module, Optimizer]

ModelOrModelOptimPair

lightrft.strategy.deepspeed.deepspeed.ModelOrModelOptimPair

alias of Union[Module, Tuple[Module, Optimizer]]

DeepspeedStrategy

class lightrft.strategy.deepspeed.deepspeed.DeepspeedStrategy(seed: int = 42, max_norm: float = 0.0, micro_train_batch_size: int = 1, train_batch_size: int = 1, zero_stage: int = 2, bf16: bool = True, args=None)[source]

DeepSpeed implementation of the training strategy.

Modified from https://github.com/OpenRLHF/OpenRLHF, with these changes: 1. inherits from StrategyBase, add some api 2. removed ring-attn related code

Parameters:
  • seed (int) – Random seed for reproducibility

  • max_norm (float) – Maximum gradient norm for gradient clipping (0.0 means no clipping)

  • micro_train_batch_size (int) – Batch size for a single GPU/process

  • train_batch_size (int) – Global batch size across all GPUs/processes

  • zero_stage (int) – DeepSpeed ZeRO optimization stage (0, 1, 2, or 3)

  • bf16 (bool) – Whether to use bfloat16 precision

  • args (object) – Additional arguments for configuration

__init__(seed: int = 42, max_norm: float = 0.0, micro_train_batch_size: int = 1, train_batch_size: int = 1, zero_stage: int = 2, bf16: bool = True, args=None) None[source]

Initialize strategy with common parameters.

Parameters:
  • seed (int) – Random seed for reproducibility

  • max_norm (float) – Maximum gradient norm for clipping

  • micro_train_batch_size (int) – Batch size for each training step

  • train_batch_size (int) – Total batch size for training

  • args (Any (usually argparse.Namespace)) – Additional configuration arguments

backward(loss: torch.Tensor, model: torch.nn.Module, optimizer: torch.optim.Optimizer, **kwargs) None[source]

Perform backward pass to compute gradients.

Parameters:
  • loss (torch.Tensor) – The loss tensor to backpropagate

  • model (nn.Module) – The model being trained

  • optimizer (optim.Optimizer) – The optimizer for the model

  • kwargs – Additional arguments

create_optimizer(model, **kwargs) torch.optim.Optimizer[source]

Create an optimizer for the given model.

Parameters:
  • model (nn.Module) – The model to create an optimizer for

  • kwargs – Additional arguments for the optimizer

Returns:

The created optimizer

Return type:

Optimizer

get_ds_eval_config(offload=False)[source]

Get the DeepSpeed configuration for evaluation.

Parameters:

offload (bool) – Whether to offload parameters to CPU

Returns:

DeepSpeed configuration dictionary

Return type:

dict

get_ds_train_config(is_actor)[source]

Get the DeepSpeed configuration for training.

Parameters:

is_actor (bool) – Whether the model is an actor model

Returns:

DeepSpeed configuration dictionary

Return type:

dict

init_model_context()[source]

Context manager for initializing a model with DeepSpeed configuration.

This sets up the HfDeepSpeedConfig for use with Hugging Face models.

load_ckpt(model, load_dir, tag=None, load_module_strict=True, load_optimizer_states=True, load_lr_scheduler_states=True, load_module_only=False, **_kwargs)[source]

Load a DeepSpeed model checkpoint from the specified directory.

This function wraps DeepSpeed’s checkpoint loading functionality with error handling.

Parameters:
  • model (deepspeed.DeepSpeedEngine) – The DeepSpeed model to load the checkpoint into

  • load_dir (str) – Directory from which to load the checkpoint

  • tag (str, optional) – Optional tag to specify which checkpoint to load

  • load_module_strict (bool, default=True) – Whether to strictly enforce that the keys in the model state dict match the keys in the checkpoint

  • load_optimizer_states (bool, default=True) – Whether to load optimizer states from checkpoint

  • load_lr_scheduler_states (bool, default=True) – Whether to load learning rate scheduler states

  • load_module_only (bool, default=False) – Whether to load only the module weights and not optimizer or scheduler states

Returns:

A tuple containing the checkpoint path and loaded states

Return type:

tuple(str, dict)

Raises:
  • AssertionError – If model is not a DeepSpeedEngine

  • Exception – If loading the checkpoint fails

load_model(model: torch.nn.Module, path: str, map_location='cpu', strict: bool = False, key_replace_fn=None) None[source]

Load model weights from a file.

Parameters:
  • model (nn.Module) – The model to load weights into

  • path (str) – Path to the saved model weights

  • map_location (str or torch.device) – Device to load the weights to

  • strict (bool) – Whether to strictly enforce that the keys in state_dict match the model

  • key_replace_fn (callable) – Function to modify state dict keys

moving_average(model, model_ema, beta=0.992, device='cpu')[source]

Update model_ema parameters with exponential moving average of model parameters.

Parameters:
  • model (nn.Module) – Source model

  • model_ema (nn.Module) – Target model for EMA

  • beta (float) – EMA decay factor

  • device (str) – Device to perform operations on

optimizer_step(optimizer: torch.optim.Optimizer, model: torch.nn.Module, scheduler, name='model', **kwargs) None[source]

Perform an optimization step.

Parameters:
  • optimizer (optim.Optimizer) – The optimizer to step

  • model (nn.Module) – The model being trained

  • scheduler – The learning rate scheduler

  • name (str) – Name identifier for the model

  • kwargs – Additional arguments

prepare(*models_or_model_optim_pairs: torch.nn.Module | Tuple[torch.nn.Module, torch.optim.Optimizer], is_rlhf=False) List[torch.nn.Module | Tuple[torch.nn.Module, torch.optim.Optimizer]] | torch.nn.Module | Tuple[torch.nn.Module, torch.optim.Optimizer][source]

Prepare models and optimizers for DeepSpeed training.

Expected input format for RLHF: tuple: ((actor, actor_optim, actor_scheduler),

(critic, critic_optim, critic_scheduler), reward_models, initial_model)

Parameters:
  • models_or_model_optim_pairs – Models or (model, optimizer, scheduler) tuples

  • is_rlhf (bool) – Whether this is for RLHF training

Returns:

Prepared models or model-optimizer pairs

Return type:

Union[List[ModelOrModelOptimPair], ModelOrModelOptimPair]

prepare_reward_models(reward_models)[source]

Prepare reward models for DeepSpeed evaluation.

Parameters:

reward_models (List[nn.Module]) – List of reward models to prepare

Returns:

List of prepared reward models

Return type:

List[nn.Module]

save_ckpt(model, save_dir, tag=None, max_num=3, max_mem=1000, client_state={}, save_latest=True, **_kwargs)[source]

Save a DeepSpeed model checkpoint with automatic management of checkpoint storage.

This function manages checkpoint storage by limiting the number of checkpoints and total storage size. It automatically removes the oldest checkpoints when limits are exceeded.

Parameters:
  • model (deepspeed.DeepSpeedEngine) – The DeepSpeed model to save

  • save_dir (str) – Directory where checkpoints will be saved

  • tag (str, optional) – Optional tag for the checkpoint (e.g., iteration number)

  • max_num (int, default=3) – Maximum number of checkpoints to keep

  • max_mem (int, default=1000) – Maximum storage size for checkpoints in GB

  • client_state (dict, default={}) – Additional client state to save with the checkpoint

  • save_latest (bool, default=True) – Whether to save a symbolic link to the latest checkpoint

  • _kwargs – dict, not used keys, e.g. optimizer, scheduler

Returns:

None

Raises:

AssertionError – If model is not a DeepSpeedEngine

save_model(model: torch.nn.Module, tokenizer, output_dir, **kwargs) None[source]

Save a model, its configuration, and tokenizer to the specified output directory.

Handles special cases for DeepSpeed ZeRO-2/3 parameters and PEFT models. For ZeRO parallelism, it gathers distributed parameters before saving. For PEFT models, it saves adapter weights appropriately based on the DeepSpeed stage.

Parameters:
  • model (nn.Module) – The model to save

  • tokenizer (PreTrainedTokenizer or similar) – The tokenizer to save

  • output_dir (str) – Directory where the model, config, and tokenizer will be saved

  • kwargs – Additional arguments to pass to the model’s save_pretrained method

Returns:

None

unwrap_model(model) torch.nn.Module[source]

Unwrap the model from any wrappers to access the base model.

Parameters:

model (nn.Module) – The model to unwrap

Returns:

The unwrapped model

Return type:

nn.Module