lightrft.strategy.sglang_utils¶
This module provides functionality for initializing and configuring a SGLang generation engine for RLHF and RLVR applications. It handles distributed training setup, device coordination, and engine initialization with appropriate parameters.
The main component is the get_sglang_engine function which creates and returns a configured RLGenerationEngine instance based on the provided arguments, taking into account the distributed training environment.
get_sglang_engine¶
- lightrft.strategy.sglang_utils.get_sglang_engine(model_name_or_path: str, engine_mem_util: float, enable_engine_sleep: bool = True, tp_size: int = 1, skip_tokenizer_init: bool = False, dtype: str = 'bfloat16', disable_cuda_graph: bool = False)[source]¶
Initialize and configure a SGLang generation engine with distributed processing support.
This function creates a RLGenerationEngine instance with proper distributed training configuration, including tensor parallelism setup, device coordination, and memory management. It handles the complex initialization process required for distributed inference in RLHF scenarios.
The function automatically detects the distributed environment settings from environment variables and configures the engine accordingly. It sets up tensor parallel groups, manages GPU allocation, and initializes the engine with optimized parameters for high-throughput generation.
- Parameters:
model_name_or_path (str) – Path to the model or Hugging Face model identifier
engine_mem_util (float) – Memory utilization fraction for the engine (0.0 to 1.0)
enable_engine_sleep (bool) – Whether to enable memory saver mode that releases KV cache when memory is limited
tp_size (int) – Tensor parallelism size for distributed inference
skip_tokenizer_init (bool) – Whether to skip tokenizer initialization for faster startup. Defaults to False as the tokenizer is needed to process text inputs (required for VLM use cases)
dtype (str) – Data type for model weights and computations (“bfloat16” or “float16”)
disable_cuda_graph (bool) – Whether to disable CUDA graph optimization
- Returns:
Configured RLGenerationEngine instance ready for distributed inference
- Return type:
- Raises:
AssertionError – If PyTorch distributed is not initialized
AssertionError – If world size is not evenly divisible by tensor parallelism size
Example:
>>> # Initialize engine for single GPU >>> engine = get_sglang_engine( ... model_name_or_path="meta-llama/Llama-2-7b-hf", ... engine_mem_util=0.8, ... tp_size=1 ... ) >>> # Initialize engine with tensor parallelism >>> engine = get_sglang_engine( ... model_name_or_path="meta-llama/Llama-2-70b-hf", ... engine_mem_util=0.9, ... tp_size=4, ... enable_engine_sleep=False, ... dtype="float16" ... )