lightrft.datasets.utils¶

Utility functions for dataset processing.

Parts of this file are adapted from Open-Reasoner-Zero: https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero

class lightrft.datasets.utils.BaseDataHandler[source]¶

Bases: ABC

Base class for data handlers.

abstract get_media_info(item: Dict[str, Any]) → Dict[str, Dict[str, str]][source]¶

Extract path info for all media info from the raw item.

Parameters:: item (Dict[str, Any]) – The raw data item.
Returns:: A dict where keys are logical names (e.g. ‘init_image’) and values are path dicts.
Return type:: Dict[str, Dict[str, str]]

Example::

>>> item = {'init_image_path': '/path/img.jpg', 'video_path': '/path/vid.mp4'}
>>> visual_info = get_media_info(item)
>>> print(visual_info)
{'init_image': {'image_local_path': '/path/img.jpg'}, 'video': {'video_local_path': '/path/vid.mp4'}}

abstract load_data(path: str) → List[Dict[str, Any]][source]¶

Load all data items from a data config file, e.g. a json file, or a parquet file.

Parameters:: path (str) – The path to load data from.
Returns:: A list of raw data items.
Return type:: List[Dict[str, Any]]

abstract parse_item(item: Dict[str, Any], media_content: Dict[str, Any], config: Dict[str, Any]) → Tuple[List[Dict], List[Dict], Dict] | Tuple[List[Dict], Dict][source]¶

Parse the raw item and the loaded media_content into the standard format.

Parameters:

item (Dict[str, Any]) – The raw data item.
media_content (Dict[str, Any]) – A dict containing loaded content (e.g. PIL Images, Video paths).
config (Dict[str, Any]) – A dict of additional configuration options (e.g. prompt templates, max_pixels).

Returns:

A tuple containing message lists and a metadata dictionary. - For point-wise scoring data (e.g., Scalar Reward Model training/evaluation):

Return (messages_chosen, messages_rejected, other)

For pair-wise ranking data (e.g., Generative Reward Model training/evaluation): Return (messages, other)

The other dictionary contains metadata, and can optionally include: - “preference”: (str) Indicates the ground truth preferred choice (“A”, “B”, or “C”). - “task_type”: (str) The type of task (e.g., “text-to-video”). - “reward_rule_label”: (str) A label used in RL to identify which reward

function or reward model to apply to this specific sample when performing reinforcement fine-tuning.

Return type:

Union[Tuple[List[Dict], List[Dict], Dict], Tuple[List[Dict], Dict]]

lightrft.datasets.utils.exist_and_not_none(d, key)[source]¶

Check if a key exists in dictionary and its value is not None.

Parameters:

d (dict) – Dictionary to check.
key (Any) – Key to look for.

Returns:

True if key exists and value is not None.

Return type:

bool

lightrft.datasets.utils.extract_answer(text: str) → str | None[source]¶

Extract the content inside <answer>…</answer> from a given text.

Parameters:: text (str) – The input text containing the <answer> tags.
Returns:: The extracted string inside the <answer> tags, or None if not found.
Return type:: Union[str, None]

Example::

>>> text = "The result is <answer>Image 1 is better</answer> based on the evaluation."
>>> answer = extract_answer(text)
>>> print(answer)  # Output: Image 1 is better

lightrft.datasets.utils.find_subsequence(lst: List[int], sub: List[int]) → int[source]¶

Find first index where sub appears in lst. This function is used to finda marker token sequence (e.g. assistant-start) in the token id list so prompt and response can be separated for label masking.

Complexity: Implements the KMP algorithm: O(n + m) time, O(m) extra space.

Parameters:

lst (List[int]) – Sequence to search (e.g., list of token ids).
sub (List[int]) – Subsequence (pattern) to find.

Returns:

Index of first occurrence or -1 if not found.

Return type:

int

lightrft.datasets.utils.get_task_instructions(handler: Any, config: Dict[str, Any]) → str[source]¶

Select task instruction based on task type from handler and config.

Parameters:

handler – Data handler instance.
config – Configuration dictionary which contains ‘task_instruction’.

Returns:

The selected task instruction.

lightrft.datasets.utils.load_multimodal_content(media_info: Dict) → Dict[source]¶

Load multimodal content (images, videos, audios, etc.) specified by media_info.

Keys in each entry can include:

‘image_local_path’ | ‘image_bytes’
‘video_local_path’
‘audio_local_path’

Returns a dict mapping names to loaded objects or paths.

Parameters:: media_info (Dict[str, Dict[str, Any]]) – Example: {‘init_image’: {‘image_local_path’: ‘/path/img.jpg’}, ‘video’: {‘video_local_path’: ‘/path/vid.mp4’}, ‘audio’: {‘audio_local_path’: ‘/path/audio.wav’}}
Returns:: A dict mapping the same keys to loaded objects, for example: - images (from path or bytes) are returned as PIL.Image.Image - videos are returned as the original local path (str) - audios are returned as the original local path (str) If a key cannot be loaded it will be omitted from the result.
Return type:: Dict[str, Any]

lightrft.datasets.utils.zero_pad_sequences(sequences, side: str = 'left', value=0) → torch.Tensor[source]¶

Pad a list of 1D/2D tensors on the last dimension and stack them.

Parameters:

sequences (Iterable[torch.Tensor]) – Iterable of torch.Tensor objects. Each tensor’s last dimension is treated as the sequence length to be padded.
side (str) – Side to apply padding, either “left” or “right”
value (int | float) – Padding value

Returns:

Stacked tensor with shape (N, …) where sequences are padded to equal length

Return type:

torch.Tensor

Example:

>>> seqs = [torch.tensor([1,2,3]), torch.tensor([4,5])]
>>> zero_pad_sequences(seqs, side="left", value=0)
tensor([[1, 2, 3],
        [0, 4, 5]])