lightrft.utils.processor¶

lightrft.utils.processor.conditional_sft_processor(args: Any, objs: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Process data for Conditional SFT by prepending reward information to inputs.

Implements the Conditional SFT approach from the paper: “Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning” (https://arxiv.org/abs/2308.12050)

This technique conditions the model on reward scores during training, allowing it to generate outputs of varying quality based on the specified reward threshold.

Parameters:

args (Any) – Arguments object containing ‘reward_template’ and ‘normalize_reward’ flags.
objs (List[Dict[str, Any]]) – List of training examples with ‘input’, ‘output’, and ‘reward’ keys.

Returns:

Processed list of training examples.

Return type:

List[Dict[str, Any]]

lightrft.utils.processor.get_processor(name: str) → Callable[[Any, List[Dict[str, Any]]], List[Dict[str, Any]]][source]¶

Retrieve a data processor function by name.

Parameters:: name (str) – Name of the processor (‘rs’, ‘csft’, or ‘iter_dpo’).
Returns:: The corresponding processor function.
Return type:: Callable[[Any, List[Dict[str, Any]]], List[Dict[str, Any]]]
Raises:: ValueError – If the processor name doesn’t exist.

lightrft.utils.processor.iterative_dpo_processor(args: Any, objs: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Process data for Iterative DPO by creating chosen/rejected pairs per input.

Implements the Iterative DPO approach from: “Online Iterative Reinforcement Learning from Human Feedback with General Preference Model” (https://github.com/RLHFlow/Online-RLHF)

For each unique input, this technique tracks the highest-reward (chosen) and lowest-reward (rejected) outputs to create preference pairs for Direct Preference Optimization (DPO) training. This enables iterative improvement through online RLHF.

Parameters:

args (Any) – Arguments object (unused but kept for API consistency).
objs (List[Dict[str, Any]]) – List of examples with ‘input’, ‘output’, and ‘reward’ keys.

Returns:

List of preference pairs with ‘prompt’, ‘chosen’, ‘rejected’, and reward values.

Return type:

List[Dict[str, Any]]

lightrft.utils.processor.rejection_sampling_processor(args: Any, objs: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Process data using Rejection Sampling by selecting highest-reward output per input.

Implements the Rejection Sampling approach from the paper: “Llama 2: Open Foundation and Fine-Tuned Chat Models” (https://arxiv.org/abs/2307.09288)

This technique filters multiple candidate outputs per input, keeping only the one with the highest reward score. This creates a high-quality training dataset by rejecting lower-quality samples.

Parameters:

args (Any) – Arguments object (unused but kept for API consistency).
objs (List[Dict[str, Any]]) – List of examples with ‘input’, ‘output’, and ‘reward’ keys.

Returns:

List of examples with only the highest-reward output per unique input.

Return type:

List[Dict[str, Any]]

lightrft.utils.processor.reward_normalization(objs: List[Dict[str, Any]]) → None[source]¶

Normalize reward values across a list of objects using z-score normalization.

This function applies standardization (z-score normalization) to reward values, transforming them to have zero mean and unit variance. This helps stabilize training by ensuring rewards are on a consistent scale.

Parameters:: objs (List[Dict[str, Any]]) – List of dictionaries, each containing a ‘reward’ key.
Returns:: None (modifies objs in-place).
Return type:: None