lightrft.datasets.utils¶
Utility functions for dataset processing.
Parts of this file are adapted from Open-Reasoner-Zero: https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero
- class lightrft.datasets.utils.BaseDataHandler[source]¶
Bases:
ABCBase class for data handlers.
- abstract get_media_info(item: Dict[str, Any]) Dict[str, Dict[str, str]][source]¶
Extract path info for all media info from the raw item.
- Parameters:
item (Dict[str, Any]) – The raw data item.
- Returns:
A dict where keys are logical names (e.g. ‘init_image’) and values are path dicts.
- Return type:
Dict[str, Dict[str, str]]
- Example::
>>> item = {'init_image_path': '/path/img.jpg', 'video_path': '/path/vid.mp4'} >>> visual_info = get_media_info(item) >>> print(visual_info) {'init_image': {'image_local_path': '/path/img.jpg'}, 'video': {'video_local_path': '/path/vid.mp4'}}
- abstract load_data(path: str) List[Dict[str, Any]][source]¶
Load all data items from a data config file, e.g. a json file, or a parquet file.
- Parameters:
path (str) – The path to load data from.
- Returns:
A list of raw data items.
- Return type:
List[Dict[str, Any]]
- abstract parse_item(item: Dict[str, Any], media_content: Dict[str, Any], config: Dict[str, Any]) Tuple[List[Dict], List[Dict], Dict] | Tuple[List[Dict], Dict][source]¶
Parse the raw item and the loaded media_content into the standard format.
- Parameters:
item (Dict[str, Any]) – The raw data item.
media_content (Dict[str, Any]) – A dict containing loaded content (e.g. PIL Images, Video paths).
config (Dict[str, Any]) – A dict of additional configuration options (e.g. prompt templates, max_pixels).
- Returns:
A tuple containing message lists and a metadata dictionary. - For point-wise scoring data (e.g., Scalar Reward Model training/evaluation):
Return (messages_chosen, messages_rejected, other)
For pair-wise ranking data (e.g., Generative Reward Model training/evaluation): Return (messages, other)
The other dictionary contains metadata, and can optionally include: - “preference”: (str) Indicates the ground truth preferred choice (“A”, “B”, or “C”). - “task_type”: (str) The type of task (e.g., “text-to-video”). - “reward_rule_label”: (str) A label used in RL to identify which reward
function or reward model to apply to this specific sample when performing reinforcement fine-tuning.
- Return type:
Union[Tuple[List[Dict], List[Dict], Dict], Tuple[List[Dict], Dict]]
- lightrft.datasets.utils.exist_and_not_none(d, key)[source]¶
Check if a key exists in dictionary and its value is not None.
- Parameters:
d (dict) – Dictionary to check.
key (Any) – Key to look for.
- Returns:
True if key exists and value is not None.
- Return type:
bool
- lightrft.datasets.utils.extract_answer(text: str) str | None[source]¶
Extract the content inside <answer>…</answer> from a given text.
- Parameters:
text (str) – The input text containing the <answer> tags.
- Returns:
The extracted string inside the <answer> tags, or None if not found.
- Return type:
Union[str, None]
- Example::
>>> text = "The result is <answer>Image 1 is better</answer> based on the evaluation." >>> answer = extract_answer(text) >>> print(answer) # Output: Image 1 is better
- lightrft.datasets.utils.find_subsequence(lst: List[int], sub: List[int]) int[source]¶
Find first index where
subappears inlst. This function is used to finda marker token sequence (e.g. assistant-start) in the token id list so prompt and response can be separated for label masking.Complexity: Implements the KMP algorithm: O(n + m) time, O(m) extra space.
- Parameters:
lst (List[int]) – Sequence to search (e.g., list of token ids).
sub (List[int]) – Subsequence (pattern) to find.
- Returns:
Index of first occurrence or -1 if not found.
- Return type:
int
- lightrft.datasets.utils.get_task_instructions(handler: Any, config: Dict[str, Any]) str[source]¶
Select task instruction based on task type from handler and config.
- Parameters:
handler – Data handler instance.
config – Configuration dictionary which contains ‘task_instruction’.
- Returns:
The selected task instruction.
- lightrft.datasets.utils.load_multimodal_content(media_info: Dict) Dict[source]¶
Load multimodal content (images, videos, audios, etc.) specified by media_info.
- Keys in each entry can include:
‘image_local_path’ | ‘image_bytes’
‘video_local_path’
‘audio_local_path’
Returns a dict mapping names to loaded objects or paths.
- Parameters:
media_info (Dict[str, Dict[str, Any]]) – Example: {‘init_image’: {‘image_local_path’: ‘/path/img.jpg’}, ‘video’: {‘video_local_path’: ‘/path/vid.mp4’}, ‘audio’: {‘audio_local_path’: ‘/path/audio.wav’}}
- Returns:
A dict mapping the same keys to loaded objects, for example: - images (from path or bytes) are returned as PIL.Image.Image - videos are returned as the original local path (str) - audios are returned as the original local path (str) If a key cannot be loaded it will be omitted from the result.
- Return type:
Dict[str, Any]
- lightrft.datasets.utils.zero_pad_sequences(sequences, side: str = 'left', value=0) torch.Tensor[source]¶
Pad a list of 1D/2D tensors on the last dimension and stack them.
- Parameters:
sequences (Iterable[torch.Tensor]) – Iterable of torch.Tensor objects. Each tensor’s last dimension is treated as the sequence length to be padded.
side (str) – Side to apply padding, either “left” or “right”
value (int | float) – Padding value
- Returns:
Stacked tensor with shape (N, …) where sequences are padded to equal length
- Return type:
torch.Tensor
Example:
>>> seqs = [torch.tensor([1,2,3]), torch.tensor([4,5])] >>> zero_pad_sequences(seqs, side="left", value=0) tensor([[1, 2, 3], [0, 4, 5]])