lightrft.strategy.sglang_utils¶

This module provides functionality for initializing and configuring a SGLang generation engine for RLHF and RLVR applications. It handles distributed training setup, device coordination, and engine initialization with appropriate parameters.

Note:: SGLang imports are kept lazy so the rest of LightRFT can still be imported when users only install the vLLM backend. ImportError is raised only when engine_type=”sglang” is actually requested.

get_sglang_engine¶

lightrft.strategy.sglang_utils.get_sglang_engine(model_name_or_path: str, engine_mem_util: float, enable_engine_sleep: bool = True, tp_size: int = 1, skip_tokenizer_init: bool = False, dtype: str = 'bfloat16', disable_cuda_graph: bool = False)[source]¶

Initialize and configure a SGLang generation engine with distributed processing support.

This function creates a RLGenerationEngine instance with proper distributed training configuration, including tensor parallelism setup, device coordination, and memory management. It handles the complex initialization process required for distributed inference in RLHF scenarios.

The function automatically detects the distributed environment settings from environment variables and configures the engine accordingly. It sets up tensor parallel groups, manages GPU allocation, and initializes the engine with optimized parameters for high-throughput generation.

Parameters:

model_name_or_path (str) – Path to the model or Hugging Face model identifier
engine_mem_util (float) – Memory utilization fraction for the engine (0.0 to 1.0)
enable_engine_sleep (bool) – Whether to enable memory saver mode that releases KV cache when memory is limited
tp_size (int) – Tensor parallelism size for distributed inference
skip_tokenizer_init (bool) – Whether to skip tokenizer initialization for faster startup. Defaults to False as the tokenizer is needed to process text inputs (required for VLM use cases)
dtype (str) – Data type for model weights and computations (“bfloat16” or “float16”)
disable_cuda_graph (bool) – Whether to disable CUDA graph optimization

Returns:

Configured RLGenerationEngine instance ready for distributed inference

Return type:

RLGenerationEngine

Raises:

AssertionError – If PyTorch distributed is not initialized
AssertionError – If world size is not evenly divisible by tensor parallelism size

Example:

>>> # Initialize engine for single GPU
>>> engine = get_sglang_engine(
...     model_name_or_path="meta-llama/Llama-2-7b-hf",
...     engine_mem_util=0.8,
...     tp_size=1
... )

>>> # Initialize engine with tensor parallelism
>>> engine = get_sglang_engine(
...     model_name_or_path="meta-llama/Llama-2-70b-hf",
...     engine_mem_util=0.9,
...     tp_size=4,
...     enable_engine_sleep=False,
...     dtype="float16"
... )