lightrft.utils.processor¶
- lightrft.utils.processor.conditional_sft_processor(args: Any, objs: List[Dict[str, Any]]) List[Dict[str, Any]][source]¶
Process data for Conditional SFT by prepending reward information to inputs.
Implements the Conditional SFT approach from the paper: “Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning” (https://arxiv.org/abs/2308.12050)
This technique conditions the model on reward scores during training, allowing it to generate outputs of varying quality based on the specified reward threshold.
- Parameters:
args (Any) – Arguments object containing ‘reward_template’ and ‘normalize_reward’ flags.
objs (List[Dict[str, Any]]) – List of training examples with ‘input’, ‘output’, and ‘reward’ keys.
- Returns:
Processed list of training examples.
- Return type:
List[Dict[str, Any]]
- lightrft.utils.processor.get_processor(name: str) Callable[[Any, List[Dict[str, Any]]], List[Dict[str, Any]]][source]¶
Retrieve a data processor function by name.
- Parameters:
name (str) – Name of the processor (‘rs’, ‘csft’, or ‘iter_dpo’).
- Returns:
The corresponding processor function.
- Return type:
Callable[[Any, List[Dict[str, Any]]], List[Dict[str, Any]]]
- Raises:
ValueError – If the processor name doesn’t exist.
- lightrft.utils.processor.iterative_dpo_processor(args: Any, objs: List[Dict[str, Any]]) List[Dict[str, Any]][source]¶
Process data for Iterative DPO by creating chosen/rejected pairs per input.
Implements the Iterative DPO approach from: “Online Iterative Reinforcement Learning from Human Feedback with General Preference Model” (https://github.com/RLHFlow/Online-RLHF)
For each unique input, this technique tracks the highest-reward (chosen) and lowest-reward (rejected) outputs to create preference pairs for Direct Preference Optimization (DPO) training. This enables iterative improvement through online RLHF.
- Parameters:
args (Any) – Arguments object (unused but kept for API consistency).
objs (List[Dict[str, Any]]) – List of examples with ‘input’, ‘output’, and ‘reward’ keys.
- Returns:
List of preference pairs with ‘prompt’, ‘chosen’, ‘rejected’, and reward values.
- Return type:
List[Dict[str, Any]]
- lightrft.utils.processor.rejection_sampling_processor(args: Any, objs: List[Dict[str, Any]]) List[Dict[str, Any]][source]¶
Process data using Rejection Sampling by selecting highest-reward output per input.
Implements the Rejection Sampling approach from the paper: “Llama 2: Open Foundation and Fine-Tuned Chat Models” (https://arxiv.org/abs/2307.09288)
This technique filters multiple candidate outputs per input, keeping only the one with the highest reward score. This creates a high-quality training dataset by rejecting lower-quality samples.
- Parameters:
args (Any) – Arguments object (unused but kept for API consistency).
objs (List[Dict[str, Any]]) – List of examples with ‘input’, ‘output’, and ‘reward’ keys.
- Returns:
List of examples with only the highest-reward output per unique input.
- Return type:
List[Dict[str, Any]]
- lightrft.utils.processor.reward_normalization(objs: List[Dict[str, Any]]) None[source]¶
Normalize reward values across a list of objects using z-score normalization.
This function applies standardization (z-score normalization) to reward values, transforming them to have zero mean and unit variance. This helps stabilize training by ensuring rewards are on a consistent scale.
- Parameters:
objs (List[Dict[str, Any]]) – List of dictionaries, each containing a ‘reward’ key.
- Returns:
None (modifies objs in-place).
- Return type:
None