R2D2
^^^^^^^

Overview
---------
R2D2 was first proposed in `Recurrent experience replay in distributed reinforcement learning <https://openreview.net/pdf?id=r1lyTjAqYX>`_.
In RNN training with experience replay, the RL algorithms usually face the problem of representational drift and recurrent state staleness.
R2D2 utilizes two approaches: stored states and burn-in to mitigate the aforementioned effects.
R2D2 agent integrates these findings to achieve significant advances in the state of the art on Atari-57 and matches the state of the art on DMLab-30.
The authors claim that, Recurrent Replay Distributed DQN (R2D2) is the first agent to achieve this using a single network architecture and fixed set of hyper-parameters.

Quick Facts
-------------
1. R2D2 is an **off-policy**, **model-free** and **value-based** RL algorithm,

2. R2D2 is essentially a DQN-based algorithm using a distributed framework, double Q networks, dueling architecture, n-step TD loss, and prioritized experience replay.

3. R2D2 now only supports **discrete** action spaces and uses **eps-greedy** for exploration same as DQN.

4. R2D2 uses the **stored state** and **burn_in** techniques to mitigate the effects of representational drift and recurrent state staleness.

5. The DI-engine implementation of R2D2 provides **res_link** key to support residual link in recurrent Q network.

Key Equations or Key Graphs
---------------------------
R2D2 agent is most similar to Ape-X, built upon prioritized distributed replay and n-step double Q-learning (with n = 5),
generating experience by a large number of actors (typically 256) and learning from batches of
replayed experience by a single learner.
The Q network of R2D2 use the dueling network architecture and provide an LSTM layer after the convolutional stack.

Instead of regular :math:`(s, a, r, s^')` transition tuples, R2D2 stores fixed-length (m = 80) sequences of :math:`(s, a, r)`
in replay, with adjacent sequences overlapping each other by 40 time steps, and never crossing episode boundaries.
Specifically, the n-step targets used in R2D2 is:

.. image:: images/r2d2_q_targets.png
   :align: center
   :scale: 30%

Here, :math:`\theta^{-}` denotes the target network parameters which are copied from the online network parameters :math:`\theta` every 2500 learner steps.

R2D uses the mixture of max and mean absolute n-step TD-errors :math:`\delta_i`
as prioritization metrics for prioritized experience replay over the sequence:

.. image:: images/r2d2_priority.png
   :align: center
   :scale: 30%

.. note::
   In our DI-engine implementation, at each unroll step, the input to the `LSTM-based Q network <https://github.com/opendilab/DI-engine/blob/7630dbaa65e4ef33b07cc0f6c630fce280aa200c/ding/model/template/q_learning.py#L564>`_ is just **observation** and the **last hidden state**, excluding reward and one-hot action.

For more details about how to use RNN in DI-engine, users can refer to `How to use RNN <https://di-engine-docs.readthedocs.io/en/latest/best_practice/rnn.html>`_,
for data arrangement process in R2D2, users can refer to the section `data-arrangement <https://di-engine-docs.readthedocs.io/en/latest/best_practice/rnn.html#data-arrangement>`_,
for the burn-in technique in R2D2, users can refer to the section `burn-in-in-r2d2 <https://di-engine-docs.readthedocs.io/en/latest/best_practice/rnn.html#burn-in-in-r2d2>`_.

..
    .. math::

       L(w)=\mathbb{E}\left[(\underbrace{r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}, w\right)}_{\text {Target }}-Q(s, a, w))^{2}\right]

Extensions
-----------
R2D2 can be combined with:

    - Learning from demonstrations

      Users can refer to `R2D3 paper <https://arxiv.org/abs/1909.01387>`_ and `R2D3 doc <https://di-engine-docs.readthedocs.io/zh_CN/latest/hands_on/r2d3_zh.html>`_ of our `R2D3 implementation <https://github.com/opendilab/DI-engine/blob/main/ding/policy/r2d3.py>`_.
      R2D3 is an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions.


    - Transformers

      Transformers-based agents take advantage of their powerful attention mechanism to learn better policies in those environments where long-term memory can be beneficial.
      Users can refer to `GTrXL paper <https://arxiv.org/abs/1910.06764>`_ and `r2d2_gtrxl doc <https://di-engine-docs.readthedocs.io/zh_CN/latest/hands_on/r2d2_gtrxl.html>`_ of our `GTrXL implementation <https://github.com/opendilab/DI-engine/blob/main/ding/policy/r2d2_gtrxl.py>`_.


Implementations
----------------
The default config of ``R2D2Policy`` is defined as follows:

.. autoclass:: ding.policy.r2d2.R2D2Policy
   :noindex:

The network interface R2D2 used is defined as follows:

.. autoclass:: ding.model.template.q_learning.DRQN
   :members: forward
   :noindex:


Benchmark
-----------

..
    +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+
    | environment         |best mean reward | evaluation results                                  | config link              | comparison           |
    +=====================+=================+=====================================================+==========================+======================+
    |                     |                 |                                                     |`config_link_p <https://  |                      |
    |                     |                 |                                                     |github.com/opendilab/     |                      |
    |                     |                 |                                                     |DI-engine/tree/main/dizoo/|                      |
    |Pong                 |  20             |.. image:: images/benchmark/pong_r2d2.png            |atari/config/serial/      |                      |
    |                     |                 |                                                     |pong/pong_r2d2_config     |                      |
    |(PongNoFrameskip-v4) |                 |                                                     |.py>`_                    |                      |
    +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+
    |                     |                 |                                                     |`config_link_q <https://  |                      |
    |                     |                 |                                                     |github.com/opendilab/     |                      |
    |Qbert                |                 |                                                     |DI-engine/tree/main/dizoo/|                      |
    |                     |  6000           |.. image:: images/benchmark/qbert_r2d2_cfg2.png      |atari/config/serial/      |                      |
    |(QbertNoFrameskip-v4)|                 |                                                     |qbert/qbert_r2d2_config_2 |                      |
    |                     |                 |                                                     |.py>`_                    |                      |
    +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+
    |                     |                 |                                                     |`config_link_s <https://  |                      |
    |                     |                 |                                                     |github.com/opendilab/     |                      |
    |SpaceInvaders        |                 |                                                     |DI-engine/tree/main/dizoo/|                      |
    |                     |  1400           |.. image:: images/benchmark/spaceinvaders_r2d2.png   |atari/config/serial/space |                      |
    |(SpaceInvadersNoFrame|                 |                                                     |invaders/spaceinvaders    |                      |
    |skip-v4)             |                 |                                                     |_r2d2_config.py>`_        |                      |
    +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+


.. list-table:: Benchmark and comparison of R2D2 algorithm
   :widths: 25 15 30 15 15
   :header-rows: 1

   * - environment
     - best mean reward
     - evaluation results
     - config link
     - comparison
   * - | Pong (PongNoFrameskip-v4)
       |
     - 20
     - .. image:: images/benchmark/pong_r2d2.png
     - `config_link_p <https://github.com/opendilab/DI-engine/tree/main/dizoo/atari/config/serial/pong/pong_r2d2_config.py>`_
     - |
   * - | Qbert (QbertNoFrameskip-v4)
       |
     - 6000
     - .. image:: images/benchmark/qbert_r2d2_cfg2.png
     - `config_link_q <https://github.com/opendilab/DI-engine/tree/main/dizoo/atari/config/serial/qbert/qbert_r2d2_config_2.py>`_
     - |
   * - | SpaceInvaders (SpaceInvadersNoFrameskip-v4)
       |
     - 1400
     - .. image:: images/benchmark/spaceinvaders_r2d2.png
     - `config_link_s <https://github.com/opendilab/DI-engine/tree/main/dizoo/atari/config/serial/spaceinvaders/spaceinvaders_r2d2_config.py>`_
     - |


References
----------

- Kapturowski S, Ostrovski G, Quan J, et al. Recurrent experience replay in distributed reinforcement learning[C]//International conference on learning representations. 2018.

- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller: “Playing Atari with Deep Reinforcement Learning”, 2013; arXiv:1312.5602.

- Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.

- Van Hasselt, H., Guez, A., & Silver, D. (2016, March). Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).

- Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016, June). Dueling network architectures for deep reinforcement learning. In International conference on machine learning (pp. 1995-2003). PMLR.

- Horgan D, Quan J, Budden D, et al. Distributed prioritized experience replay[J]. arXiv preprint arXiv:1803.00933, 2018.


Other Public Implementations
----------------------------

`seed_rl <https://github.com/google-research/seed_rl/tree/master/agents/r2d2>`_

`ray <https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/r2d2.py>`_