Shortcuts

HER

Overview

HER(Hindsight Experience Replay) was first proposed in Hindsight Experience Replay to deal with the challenge caused by sparse reward. HER can be easily combined with any off-policy RL algorithm, allowing sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering.

HER’s core is to learn from failures. HER proposes that even if a task’s final goal is not achieved, the episode is still of use, because other goals might be achieved by the way. So HER creates a new episode from the original episode, labeling the new one with new rewards by changing the desired task goal.

Quick Facts

  1. HER is not a seperate algorithm, it can be combined with any off-policy algorithm.

  2. HER needs to define rewards after changing the desired goal, but it does not need to be the same as the env’s reward function.

  3. In DI-engine’s implementation, HER collects episode data and push into buffer, then samples episodes and modifies the reward, which will be explained in detail in Pseudo-code part.

  4. HER has different strategies to create a virtual goal. They all come from the same episode as the transition being replayed, but differ as follows:

    • future: a state after the transition

    • episode: any state in the episode

    • final: the final state of the episode

Pseudo-code

../_images/HER_psuedo_code.png

Note

Compared with psuedo code proposed in the original paper, DI-engine adopts a slightly different way to implement HER. In original paper, before the transition is pushed into buffer, it will be used to create several transitions with different goals. However there are two disadvantages: 1) Memory waste. All transitions(Original one and virtual ones) are all stored in the buffer. 2) Fixed dataset. Once the transitions are pushed into buffer, they will not be changed.

Thus, in DI-engine, we first push original transitions(in whole episode format) in the buffer. After episodes are sampled out, we will start to create more transitions based on virtual goals. Goals are randomly created at every sampling, so even if the same episode is sampled out several times, the created new training data will not be the same.

Implementations

The implementation in DI-engine is not in ding.policy as many other algorithms, but in ding.reward_model.her_reward_model:

class ding.reward_model.her_reward_model.HerRewardModel(cfg: dict, cuda: bool = False)[source]
Overview:

Hindsight Experience Replay model.

Note

  • her_strategy (str): Type of strategy that HER uses, should be in [‘final’, ‘future’, ‘episode’]

  • her_replay_k (int): Number of new episodes generated by an original episode. (Not used in episodic HER)

  • episode_size (int): Sample how many episodes in one iteration.

  • sample_per_episode (int): How many new samples are generated from an episode.

Note

In HER, we require episode trajectory to change the goals. However, episode lengths are different and may have high variance. As a result, we recommend that you only use some transitions in the complete episode by specifying episode_size and sample_per_episode in config. Therefore, in one iteration, batch_size would be episode_size * sample_per_episode.

Benchmark

Benchmark and comparison of c51 algorithm

environment

best mean reward

evaluation results

config link

BitFLip-15bit

0.9

../_images/bitflip15bit-herdqn.png

config_link

BitFLip-20bit

0.9

../_images/bitflip20bit-herdqn.png

config_link

P.S.

The above results are obtained by running the same configuration on 3 different random seeds (0, 1, 2)

Reference

  • Andrychowicz M, Wolski F, Ray A, et al. Hindsight experience replay[J]. arXiv preprint arXiv:1707.01495, 2017.

Other Public Implementations