R2D2 ^^^^^^^ Overview --------- R2D2 was first proposed in `Recurrent experience replay in distributed reinforcement learning `_. In RNN training with experience replay, the RL algorithms usually face the problem of representational drift and recurrent state staleness. R2D2 utilizes two approaches: stored states and burn-in to mitigate the aforementioned effects. R2D2 agent integrates these findings to achieve significant advances in the state of the art on Atari-57 and matches the state of the art on DMLab-30. The authors claim that, Recurrent Replay Distributed DQN (R2D2) is the first agent to achieve this using a single network architecture and fixed set of hyper-parameters. Quick Facts ------------- 1. R2D2 is an **off-policy**, **model-free** and **value-based** RL algorithm, 2. R2D2 is essentially a DQN-based algorithm using a distributed framework, double Q networks, dueling architecture, n-step TD loss, and prioritized experience replay. 3. R2D2 now only supports **discrete** action spaces and uses **eps-greedy** for exploration same as DQN. 4. R2D2 uses the **stored state** and **burn_in** techniques to mitigate the effects of representational drift and recurrent state staleness. 5. The DI-engine implementation of R2D2 provides **res_link** key to support residual link in recurrent Q network. Key Equations or Key Graphs --------------------------- R2D2 agent is most similar to Ape-X, built upon prioritized distributed replay and n-step double Q-learning (with n = 5), generating experience by a large number of actors (typically 256) and learning from batches of replayed experience by a single learner. The Q network of R2D2 use the dueling network architecture and provide an LSTM layer after the convolutional stack. Instead of regular :math:`(s, a, r, s^')` transition tuples, R2D2 stores fixed-length (m = 80) sequences of :math:`(s, a, r)` in replay, with adjacent sequences overlapping each other by 40 time steps, and never crossing episode boundaries. Specifically, the n-step targets used in R2D2 is: .. image:: images/r2d2_q_targets.png :align: center :scale: 30% Here, :math:`\theta^{-}` denotes the target network parameters which are copied from the online network parameters :math:`\theta` every 2500 learner steps. R2D uses the mixture of max and mean absolute n-step TD-errors :math:`\delta_i` as prioritization metrics for prioritized experience replay over the sequence: .. image:: images/r2d2_priority.png :align: center :scale: 30% .. note:: In our DI-engine implementation, at each unroll step, the input to the `LSTM-based Q network `_ is just **observation** and the **last hidden state**, excluding reward and one-hot action. For more details about how to use RNN in DI-engine, users can refer to `How to use RNN `_, for data arrangement process in R2D2, users can refer to the section `data-arrangement `_, for the burn-in technique in R2D2, users can refer to the section `burn-in-in-r2d2 `_. .. .. math:: L(w)=\mathbb{E}\left[(\underbrace{r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}, w\right)}_{\text {Target }}-Q(s, a, w))^{2}\right] Extensions ----------- R2D2 can be combined with: - Learning from demonstrations Users can refer to `R2D3 paper `_ and `R2D3 doc `_ of our `R2D3 implementation `_. R2D3 is an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. - Transformers Transformers-based agents take advantage of their powerful attention mechanism to learn better policies in those environments where long-term memory can be beneficial. Users can refer to `GTrXL paper `_ and `r2d2_gtrxl doc `_ of our `GTrXL implementation `_. Implementations ---------------- The default config of ``R2D2Policy`` is defined as follows: .. autoclass:: ding.policy.r2d2.R2D2Policy :noindex: The network interface R2D2 used is defined as follows: .. autoclass:: ding.model.template.q_learning.DRQN :members: forward :noindex: Benchmark ----------- .. +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | environment |best mean reward | evaluation results | config link | comparison | +=====================+=================+=====================================================+==========================+======================+ | | | |`config_link_p `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | | | |`config_link_q `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | | | |`config_link_s `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ .. list-table:: Benchmark and comparison of R2D2 algorithm :widths: 25 15 30 15 15 :header-rows: 1 * - environment - best mean reward - evaluation results - config link - comparison * - | Pong (PongNoFrameskip-v4) | - 20 - .. image:: images/benchmark/pong_r2d2.png - `config_link_p `_ - | * - | Qbert (QbertNoFrameskip-v4) | - 6000 - .. image:: images/benchmark/qbert_r2d2_cfg2.png - `config_link_q `_ - | * - | SpaceInvaders (SpaceInvadersNoFrameskip-v4) | - 1400 - .. image:: images/benchmark/spaceinvaders_r2d2.png - `config_link_s `_ - | References ---------- - Kapturowski S, Ostrovski G, Quan J, et al. Recurrent experience replay in distributed reinforcement learning[C]//International conference on learning representations. 2018. - Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller: “Playing Atari with Deep Reinforcement Learning”, 2013; arXiv:1312.5602. - Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. - Van Hasselt, H., Guez, A., & Silver, D. (2016, March). Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1). - Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016, June). Dueling network architectures for deep reinforcement learning. In International conference on machine learning (pp. 1995-2003). PMLR. - Horgan D, Quan J, Budden D, et al. Distributed prioritized experience replay[J]. arXiv preprint arXiv:1803.00933, 2018. Other Public Implementations ---------------------------- `seed_rl `_ `ray `_