lightrft.trainer.kl_controller¶
- class lightrft.trainer.kl_controller.AdaptiveKLController(init_kl_coef: float, target: float, horizon: int)[source]¶
Bases:
objectAdaptive KL controller for PPO training.
Implements the adaptive KL penalty coefficient adjustment described in: “Fine-Tuning Language Models from Human Preferences” (https://arxiv.org/pdf/1909.08593.pdf)
This controller dynamically adjusts the KL penalty coefficient based on how the current KL divergence compares to a target value, helping maintain stable training while preventing the policy from deviating too far from the reference.