Shortcuts

C51

Overview

C51 was first proposed in A Distributional Perspective on Reinforcement Learning, different from previous works, C51 evaluates the complete distribution of a q-value rather than only the expectation. The authors designed a distributional Bellman operator, which preserves multimodality in value distributions and is believed to achieve more stable learning and mitigates the negative effects of learning from a non-stationary policy.

Quick Facts

  1. C51 is a model-free and value-based RL algorithm.

  2. C51 only support discrete action spaces.

  3. C51 is an off-policy algorithm.

  4. Usually, C51 use eps-greedy or multinomial sample for exploration.

  5. C51 can be equipped with RNN.

Pseudo-code

../_images/C51.png

Note

C51 models the value distribution using a discrete distribution, whose support set are N atoms: \(z_i = V_\min + i * delta, i = 0,1,...,N-1\) and \(delta = (V_\max - V_\min) / N\). Each atom \(z_i\) has a parameterized probability \(p_i\). The Bellman update of C51 projects the distribution of \(r + \gamma * z_j^{\left(t+1\right)}\) onto the distribution \(z_i^t\).

Key Equations or Key Graphs

The Bellman target of C51 is derived by projecting the returned distribution \(r + \gamma * z_j\) onto the current distribution \(z_i\). Given a sample transition \((x, a, r, x')\), we compute the Bellman update \(Tˆz_j := r + \gamma z_j\) for each atom \(z_j\), then distribute its probability \(p_{j}(x', \pi(x'))\) to the immediate neighbors \(p_{i}(x, \pi(x))\):

\[\left(\Phi \hat{T} Z_{\theta}(x, a)\right)_{i}=\sum_{j=0}^{N-1}\left[1-\frac{\left|\left[\hat{\mathcal{T}} z_{j}\right]_{V_{\mathrm{MIN}}}^{V_{\mathrm{MAX}}}-z_{i}\right|}{\Delta z}\right]_{0}^{1} p_{j}\left(x^{\prime}, \pi\left(x^{\prime}\right)\right)\]

Extensions

  • C51s can be combined with:
    • PER (Prioritized Experience Replay)

    • Multi-step TD-loss

    • Double (target) network

    • Dueling head

    • RNN

Implementation

Tip

Our benchmark result of C51 uses the same hyper-parameters as DQN except the exclusive n_atom of C51, which is empirically set as 51.

The default config of C51 is defined as follows:

The network interface C51 used is defined as follows:

Benchmark

Benchmark and comparison of c51 algorithm

environment

best mean reward

evaluation results

config link

comparison

Pong
(PongNoFrameskip-v4)

20.6

../_images/c51_pong.png

config_link_p

Tianshou(20)
Qbert
(QbertNoFrameskip-v4)

20006

../_images/c51_qbert.png

config_link_q

Tianshou(16245)
SpaceInvaders
(SpaceInvadersNoFrame skip-v4)

2766

../_images/c51_spaceinvaders.png

config_link_s

Tianshou(988.5)

P.S.:

  1. The above results are obtained by running the same configuration on five different random seeds (0, 1, 2, 3, 4)

  2. For the discrete action space algorithm like DQN, the Atari environment set is generally used for testing (including sub-environments Pong), and Atari environment is generally evaluated by the highest mean reward training 10M env_step. For more details about Atari, please refer to Atari Env Tutorial .

Other Public Implementations