TREX¶

Overview¶

TREX (Trajectory-ranked Reward Extrapolation) was first proposed in Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations, which uses ranked demonstrations to inference reward function. Different from previous methods, TREX seeks a reward function that explains the ranking over demonstrations, rather than justifying the demonstrations. This approach is proved practically to have the potential of learning good policy from highly suboptimal demostrations. In addition, the predicted reward only depends on observations, without actions.

Quick Facts¶

TREX is an Inverse Reinforcement Learning (IRL) algorithm and can be combined with any RL algorithm.
Demonstrations used in TREX require ranking information.
The reward function optimization can be viewed as a simple binary classification problem.
TREX is able to infer a meaningful reward function even when noisy, time-based rankings are provided.

Pseudo-code¶

In the pseudo code, theta denotes the parameters of reward function and ti means a trajectory sampled in demonstrations. The functon L is the loss for training reward function, which is explained below.

Key Equations or Key Graphs¶

The loss for reward function:

\[\mathcal{L}(\theta)=\mathbb{E}_{\tau_i, \tau_j \sim \Pi}[\xi(\mathcal{P}(\hat J_\theta(\tau_i) < \hat J_\theta(\tau_j)),\tau_i \prec \tau_j)]\]

where ti, tj is trajectories in demonstrations. In fact, this loss function is only a binary classification loss function. Its goal is to figure out the better one of two trajectories. The function P is defined as follow:

\[\mathcal{P}(\hat J_\theta(\tau_i) < \hat J_\theta(\tau_j)) \approx \frac {exp \sum_{s \in \tau_j} \hat r_\theta(s)} {exp \sum_{s \in \tau_i} \hat r_\theta(s) + exp \sum_{s \in \tau_j} \hat r_\theta(s)}\]

The final loss function is in cross entropy form:

\[\mathcal{L}(\theta) = - \sum_{\tau_i \prec \tau_j} log \frac {exp \sum_{s \in \tau_j} \hat r_\theta(s)} {exp \sum_{s \in \tau_i} \hat r_\theta(s) + exp \sum_{s \in \tau_j} \hat r_\theta(s)}\]

Extensions¶

TREX can work with noisy ranked demonstrations and is relatively robust to noise of up to around 15% pairwise errors.
TREX can work with time-based rankings. We can just use data generated from training a certain RL agent. The rankings can be estimated by the time when the trajectory is generated.

Implementations¶

TREX can be combined with the following methods:

PPO Proximal Policy Optimization

SAC Soft Actor-Critic

Given demonstrations generated from RL algorithms or human knowledge (only observations and rankings are needed), TREX will infer the reward function of the environment. Then the reward function can be applied to RL algorithms like PPO or SAC to estimate rewards while training.

In practical, demonstrations are generated by pretrained policies in our implementation. First, we need to train a RL agent and save the checkpoints every 10000 iterations. Then, we use the different agents to generate demonstrations and rank them according to the final rewards. According to the paper, we also implement several data augmentation techniques. In the default setting, we sample a part of the trajectories randomly and generate new demonstrations, whose number is equal to the origin ones. The final dataset for training reward function contains both origin demonstrations and generated demonstrations. Note that later states in a trajectory is usually better than the beginning ones, so we ensure that :

\[for all \tau_i \prec \tau_j, \hat \tau_i, \hat \tau_j is randomly a part sampled from \tau_i and \tau_j. Then the beginning index of \hat \tau_i is no larger than that of \hat \tau_j.\]

The input of the reward model is observations and its output is the predicted reward value. The default reward model is defined as follows:

class ding.reward_model.TrexRewardModel(config: EasyDict, device: str, tb_logger: SummaryWriter)[source]

Overview:

The Trex reward model class (https://arxiv.org/pdf/1904.06387.pdf)

Interface:

estimate, train, load_expert_data, collect_data, clear_date, __init__, _train,

Config:

ID	Symbol	Type	Default Value	Description
1	`type`	str	trex	Reward model register name, refer to registry `REWARD_MODEL_REGISTRY`
3	`learning_rate`	float	0.00001	learning rate for optimizer
4	`update_per_` `collect`	int	100	Number of updates per collect
5	`num_trajs`	int	0	Number of downsampled full trajectories
6	`num_snippets`	int	6000	Number of short subtrajectories to sample

实验 Benchmark¶

environment	best mean reward	PPO	TREX+PPO	config link
Lunarlander	2M env_step, reward 200			config_link_l

We use the agents every 1000 iterations(1000-9000) in PPO training process to generate demonstrations. Although the reward of the best demonstration used is about -50 (suboptimal), the agent trained by trex can reach a much better score (200).

Reference¶

Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum: “Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations”, 2019; arXiv:1904.06387. https://arxiv.org/abs/1904.06387