lightrft.strategy.utils.ckpt_utils¶
Utility module for finding the latest checkpoint directory in machine learning training workflows.
This module provides functionality to locate the most recent checkpoint directory based on a naming pattern that includes a step number. It’s commonly used in deep learning training scenarios where checkpoints are saved periodically with incremental step numbers.
find_latest_checkpoint_dir¶
- lightrft.strategy.utils.ckpt_utils.find_latest_checkpoint_dir(load_dir: str, prefix: str = 'global_step') str | None[source]¶
Finds the latest subdirectory within the specified directory whose name matches the ‘<prefix><number>’ format.
This function is particularly useful in machine learning training scenarios where checkpoints are saved with incremental step numbers. It searches through all subdirectories in the given path and returns the one with the highest step number that matches the specified prefix pattern.
If no matching subdirectory is found, returns the original load_dir.
- Parameters:
load_dir (str) – The path to the parent directory containing checkpoint subdirectories.
prefix (str, optional) – The expected prefix string at the beginning of checkpoint directory names. Defaults to “global_step”.
- Returns:
The full path to the latest checkpoint subdirectory. Returns load_dir if no matching subdirectory is found. Returns None if load_dir is invalid (does not exist or is not a directory).
- Return type:
str or None
Example:
# Find latest checkpoint with default prefix "global_step" latest_dir = find_latest_checkpoint_dir("/path/to/checkpoints") # Returns: "/path/to/checkpoints/global_step1000" (if it's the highest numbered) # Find latest checkpoint with custom prefix latest_dir = find_latest_checkpoint_dir("/path/to/models", prefix="step_") # Returns: "/path/to/models/step_500" (if it's the highest numbered) # Handle case where directory doesn't exist result = find_latest_checkpoint_dir("/nonexistent/path") # Returns: None