Shortcuts

PLR

Overview

PLR was proposed in Prioritized Level Replay. PLR is a method for sampling training levels that exploits the differences in learning potential among levels to improve both sample efficiency and generalization.

Quick Facts

  1. PLR supports the multi-level environments.

  2. PLR updates policy and level score at the same time.

  3. In the implementation of DI-engine, PLR is combined with PPG algorithm.

  4. PLR supports Policy entropy, Policy min-margin, Policy least-confidence, 1-step TD error and GAE score function.

Key Graphs

Game levels are determined by a random seed and can vary in navigational layout, visual appearance, and starting positions of entities. PLR selectively samples the next training level based on an estimated learning potential of replaying each level anew. The next level is either sampled from a distribution with support over unseen levels (top), which could be the environment’s (perhaps implicit) full training-level distribution, or alternatively, sampled from the replay distribution, which prioritizes levels based on future learning potential (bottom).

../_images/PLR_pic.png

Key Equations

The Scoring Levels for Learning Potential is:

../_images/PLR_Score.png

Given level scores, we use normalized outputs of a prioritization function \(h\) evaluated over these scores and tuned using a temperature parameter \(\beta\) to define the score-prioritized distribution \(P_{S}\left(\Lambda_{\text {train }}\right)\) over the training levels, under which

\[P_{S}\left(l_{i} \mid \Lambda_{\text {seen }}, S\right)=\frac{h\left(S_{i}\right)^{1 / \beta}}{\sum_{j} h\left(S_{j}\right)^{1 / \beta}}\]
\[h\left(S_{i}\right)=1 / \operatorname{rank}\left(S_{i}\right)\]

where \(\operatorname{rank}\left(S_{i}\right)\) is the rank of level score \(S_{i}\) among all scores sorted in descending order.

As the scores used to parameterize \(P_{S}\) are a function of the state of the policy at the time the associated level was last played, they come to reflect a gradually more off-policy measure the longer they remain without an update through replay. We mitigate this drift towards “off-policy-ness” by explicitly mixing the sampling distribution with a staleness prioritized distribution \(P_{C}\) :

\[P_{C}\left(l_{i} \mid \Lambda_{\text {seen }}, C, c\right)=\frac{c-C_{i}}{\sum_{C_{j} \in C} c-C_{j}}\]
\[P_{\text {replay }}\left(l_{i}\right)=(1-\rho) \cdot P_{S}\left(l_{i} \mid \Lambda_{\text {seen }}, S\right)+\rho \cdot P_{C}\left(l_{i} \mid \Lambda_{\text {seen }}, C, c\right)\]

Pseudo-code

Policy-gradient training loop with PLR

../_images/PLR_1.png

Experience collection with PLR

../_images/PLR_2.png

Benchmark

Benchmark of PLR algorithm

environment

evaluation results

config link

BigFish
../_images/PLR_result.png

config_link_p

References

Minqi Jiang, Edward Grefenstette, Tim Rocktaschel: “Prioritized Level Replay”, 2021; arXiv:2010.03934.

Other Public Implementations