Shortcuts

lightrft.strategy.sglang_utils

This module provides functionality for initializing and configuring a SGLang generation engine for RLHF and RLVR applications. It handles distributed training setup, device coordination, and engine initialization with appropriate parameters.

The main component is the get_sglang_engine function which creates and returns a configured RLGenerationEngine instance based on the provided arguments, taking into account the distributed training environment.

get_sglang_engine

lightrft.strategy.sglang_utils.get_sglang_engine(model_name_or_path: str, engine_mem_util: float, enable_engine_sleep: bool = True, tp_size: int = 1, skip_tokenizer_init: bool = False, dtype: str = 'bfloat16', disable_cuda_graph: bool = False)[source]

Initialize and configure a SGLang generation engine with distributed processing support.

This function creates a RLGenerationEngine instance with proper distributed training configuration, including tensor parallelism setup, device coordination, and memory management. It handles the complex initialization process required for distributed inference in RLHF scenarios.

The function automatically detects the distributed environment settings from environment variables and configures the engine accordingly. It sets up tensor parallel groups, manages GPU allocation, and initializes the engine with optimized parameters for high-throughput generation.

Parameters:
  • model_name_or_path (str) – Path to the model or Hugging Face model identifier

  • engine_mem_util (float) – Memory utilization fraction for the engine (0.0 to 1.0)

  • enable_engine_sleep (bool) – Whether to enable memory saver mode that releases KV cache when memory is limited

  • tp_size (int) – Tensor parallelism size for distributed inference

  • skip_tokenizer_init (bool) – Whether to skip tokenizer initialization for faster startup. Defaults to False as the tokenizer is needed to process text inputs (required for VLM use cases)

  • dtype (str) – Data type for model weights and computations (“bfloat16” or “float16”)

  • disable_cuda_graph (bool) – Whether to disable CUDA graph optimization

Returns:

Configured RLGenerationEngine instance ready for distributed inference

Return type:

RLGenerationEngine

Raises:
  • AssertionError – If PyTorch distributed is not initialized

  • AssertionError – If world size is not evenly divisible by tensor parallelism size

Example:

>>> # Initialize engine for single GPU
>>> engine = get_sglang_engine(
...     model_name_or_path="meta-llama/Llama-2-7b-hf",
...     engine_mem_util=0.8,
...     tp_size=1
... )

>>> # Initialize engine with tensor parallelism
>>> engine = get_sglang_engine(
...     model_name_or_path="meta-llama/Llama-2-70b-hf",
...     engine_mem_util=0.9,
...     tp_size=4,
...     enable_engine_sleep=False,
...     dtype="float16"
... )