Shortcuts

lightrft.datasets.utils

Utility functions for dataset processing.

Parts of this file are adapted from Open-Reasoner-Zero: https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero

class lightrft.datasets.utils.BaseDataHandler[source]

Bases: ABC

Base class for data handlers.

abstract get_media_info(item: Dict[str, Any]) Dict[str, Dict[str, str]][source]

Extract path info for all media info from the raw item.

Parameters:

item (Dict[str, Any]) – The raw data item.

Returns:

A dict where keys are logical names (e.g. ‘init_image’) and values are path dicts.

Return type:

Dict[str, Dict[str, str]]

Example::
>>> item = {'init_image_path': '/path/img.jpg', 'video_path': '/path/vid.mp4'}
>>> visual_info = get_media_info(item)
>>> print(visual_info)
{'init_image': {'image_local_path': '/path/img.jpg'}, 'video': {'video_local_path': '/path/vid.mp4'}}
abstract load_data(path: str) List[Dict[str, Any]][source]

Load all data items from a data config file, e.g. a json file, or a parquet file.

Parameters:

path (str) – The path to load data from.

Returns:

A list of raw data items.

Return type:

List[Dict[str, Any]]

abstract parse_item(item: Dict[str, Any], media_content: Dict[str, Any], config: Dict[str, Any]) Tuple[List[Dict], List[Dict], Dict] | Tuple[List[Dict], Dict][source]

Parse the raw item and the loaded media_content into the standard format.

Parameters:
  • item (Dict[str, Any]) – The raw data item.

  • media_content (Dict[str, Any]) – A dict containing loaded content (e.g. PIL Images, Video paths).

  • config (Dict[str, Any]) – A dict of additional configuration options (e.g. prompt templates, max_pixels).

Returns:

A tuple containing message lists and a metadata dictionary. - For point-wise scoring data (e.g., Scalar Reward Model training/evaluation):

Return (messages_chosen, messages_rejected, other)

  • For pair-wise ranking data (e.g., Generative Reward Model training/evaluation): Return (messages, other)

The other dictionary contains metadata, and can optionally include: - “preference”: (str) Indicates the ground truth preferred choice (“A”, “B”, or “C”). - “task_type”: (str) The type of task (e.g., “text-to-video”). - “reward_rule_label”: (str) A label used in RL to identify which reward

function or reward model to apply to this specific sample when performing reinforcement fine-tuning.

Return type:

Union[Tuple[List[Dict], List[Dict], Dict], Tuple[List[Dict], Dict]]

lightrft.datasets.utils.exist_and_not_none(d, key)[source]

Check if a key exists in dictionary and its value is not None.

Parameters:
  • d (dict) – Dictionary to check.

  • key (Any) – Key to look for.

Returns:

True if key exists and value is not None.

Return type:

bool

lightrft.datasets.utils.extract_answer(text: str) str | None[source]

Extract the content inside <answer>…</answer> from a given text.

Parameters:

text (str) – The input text containing the <answer> tags.

Returns:

The extracted string inside the <answer> tags, or None if not found.

Return type:

Union[str, None]

Example::
>>> text = "The result is <answer>Image 1 is better</answer> based on the evaluation."
>>> answer = extract_answer(text)
>>> print(answer)  # Output: Image 1 is better
lightrft.datasets.utils.find_subsequence(lst: List[int], sub: List[int]) int[source]

Find first index where sub appears in lst. This function is used to finda marker token sequence (e.g. assistant-start) in the token id list so prompt and response can be separated for label masking.

Complexity: Implements the KMP algorithm: O(n + m) time, O(m) extra space.

Parameters:
  • lst (List[int]) – Sequence to search (e.g., list of token ids).

  • sub (List[int]) – Subsequence (pattern) to find.

Returns:

Index of first occurrence or -1 if not found.

Return type:

int

lightrft.datasets.utils.get_task_instructions(handler: Any, config: Dict[str, Any]) str[source]

Select task instruction based on task type from handler and config.

Parameters:
  • handler – Data handler instance.

  • config – Configuration dictionary which contains ‘task_instruction’.

Returns:

The selected task instruction.

lightrft.datasets.utils.load_multimodal_content(media_info: Dict) Dict[source]

Load multimodal content (images, videos, audios, etc.) specified by media_info.

Keys in each entry can include:
  • ‘image_local_path’ | ‘image_bytes’

  • ‘video_local_path’

  • ‘audio_local_path’

Returns a dict mapping names to loaded objects or paths.

Parameters:

media_info (Dict[str, Dict[str, Any]]) – Example: {‘init_image’: {‘image_local_path’: ‘/path/img.jpg’}, ‘video’: {‘video_local_path’: ‘/path/vid.mp4’}, ‘audio’: {‘audio_local_path’: ‘/path/audio.wav’}}

Returns:

A dict mapping the same keys to loaded objects, for example: - images (from path or bytes) are returned as PIL.Image.Image - videos are returned as the original local path (str) - audios are returned as the original local path (str) If a key cannot be loaded it will be omitted from the result.

Return type:

Dict[str, Any]

lightrft.datasets.utils.zero_pad_sequences(sequences, side: str = 'left', value=0) torch.Tensor[source]

Pad a list of 1D/2D tensors on the last dimension and stack them.

Parameters:
  • sequences (Iterable[torch.Tensor]) – Iterable of torch.Tensor objects. Each tensor’s last dimension is treated as the sequence length to be padded.

  • side (str) – Side to apply padding, either “left” or “right”

  • value (int | float) – Padding value

Returns:

Stacked tensor with shape (N, …) where sequences are padded to equal length

Return type:

torch.Tensor

Example:

>>> seqs = [torch.tensor([1,2,3]), torch.tensor([4,5])]
>>> zero_pad_sequences(seqs, side="left", value=0)
tensor([[1, 2, 3],
        [0, 4, 5]])