lightrft.strategy.sglang_utils.sgl_model_saver¶
This module provides memory management utilities for SGLang model execution.
The module implements memory saving functionality by allowing temporary release and restoration of GPU memory occupied by model weights and states. This is particularly useful in scenarios where multiple models or processes need to share limited GPU memory resources efficiently.
This module is designed to be compatible with different versions of SGLang: - For SGLang v0.5.6.post2+: Uses built-in methods from SchedulerUpdateWeightsMixin - For older versions: Provides backward-compatible monkey patching
The module automatically detects which approach to use based on the SGLang version.
- lightrft.strategy.sglang_utils.sgl_model_saver.release_memory_occupation(self, recv_req: sglang.srt.managers.io_struct.ReleaseMemoryOccupationReqInput)[source]¶
Release memory occupation by stashing model weights and states to CPU memory.
This method temporarily frees GPU memory by moving model parameters and static states to CPU memory. It’s designed to be used when the model is temporarily not needed, allowing other processes or models to utilize the freed GPU memory.
Compatible with both old and new SGLang versions by detecting the model runner location.
- The method performs the following operations:
Validates the memory saver adapter
Exports and stashes the model’s static state
Clones model parameters to CPU memory if not already done
Pauses the memory saver adapter
Flushes the model cache
- Parameters:
recv_req (ReleaseMemoryOccupationReqInput) – Request input for releasing memory occupation
- Returns:
Response indicating successful memory release
- Return type:
ReleaseMemoryOccupationReqOutput
- Example::
>>> scheduler = Scheduler(...) >>> req = ReleaseMemoryOccupationReqInput() >>> response = scheduler.release_memory_occupation(req) >>> # GPU memory is now freed for other uses
- lightrft.strategy.sglang_utils.sgl_model_saver.resume_memory_occupation(self, recv_req: sglang.srt.managers.io_struct.ResumeMemoryOccupationReqInput)[source]¶
Resume memory occupation by restoring model weights and states from CPU memory.
This method restores the model to its fully operational state by loading back the previously stashed model parameters and static states from CPU memory to GPU. It should be called after release_memory_occupation() when the model needs to be used again.
Compatible with both old and new SGLang versions by detecting the model runner location.
- The method performs the following operations:
Validates the memory saver adapter
Resumes the memory saver adapter
Imports the previously stashed static state
Restores model parameters from CPU to GPU
Cleans up temporary static state storage
- Parameters:
recv_req (ResumeMemoryOccupationReqInput) – Request input for resuming memory occupation
- Returns:
Response indicating successful memory restoration
- Return type:
ResumeMemoryOccupationReqOutput
- Example::
>>> scheduler = Scheduler(...) >>> # After previously calling release_memory_occupation() >>> req = ResumeMemoryOccupationReqInput() >>> response = scheduler.resume_memory_occupation(req) >>> # Model is now ready for inference again