Shortcuts

C51

概述

C51 最初是在 A Distributional Perspective on Reinforcement Learning 中提出的,与以往的研究不同,C51 评估了 q 值的完整分布,而不仅仅是期望值。作者设计了一个分布式 Bellman 算子,它保留了值分布中的多峰性,被认为能够实现更稳定的学习,并减轻从非稳态策略学习的负面影响。

核心要点

  1. C51 是一种 无模型(model-free)基于值(value-based) 的强化学习算法。

  2. C51 仅 支持离散动作空间

  3. C51 是一种 异策略(off-policy) 算法。

  4. 通常, C51 使用 eps-greedy多项式采样 进行探索。

  5. C51 可以与循环神经网络 (RNN) 结合使用。

伪代码

../_images/C51.png

Note

C51 使用离散分布来建模值分布,其支持集合为N个原子: \(z_i = V_\min + i * delta, i = 0,1,...,N-1\) 和 \(delta = (V_\max - V_\min) / N\) 。每个原子 \(z_i\) 都有一个参数化的概率 \(p_i\) 。C51 的贝尔曼更新将 \(r + \gamma * z_j^{\left(t+1\right)}\) 的分布投影到分布 \(z_i^t\) 上。

关键方程或关键框图

C51 的贝尔曼方程的目标是通过将返回分布 \(r + \gamma * z_j\) 投影到当前分布 \(z_i\) 上来得到的. 给定一个采样出来的状态转移 \((x, a, r, x')\),我们为每个原子 \(z_j\) 计算贝尔曼更新 \(Tˆz_j := r + \gamma z_j\) ,然后将其概率 \(p_{j}(x', \pi(x'))\) 分配给其相邻的原子 \(p_{i}(x, \pi(x))\):

\[\left(\Phi \hat{T} Z_{\theta}(x, a)\right)_{i}=\sum_{j=0}^{N-1}\left[1-\frac{\left|\left[\hat{\mathcal{T}} z_{j}\right]_{V_{\mathrm{MIN}}}^{V_{\mathrm{MAX}}}-z_{i}\right|}{\Delta z}\right]_{0}^{1} p_{j}\left(x^{\prime}, \pi\left(x^{\prime}\right)\right)\]

扩展

  • C51 可以和以下模块结合:
    • 优先经验回放 (Prioritized Experience Replay)

    • 多步时序差分 (TD) 损失

    • 双目标网络 (Double Target Network)

    • Dueling head

    • 循环神经网络 (RNN)

实现

Tip

我们的 C51 基准结果使用了与 DQN 相同的超参数,(除 n_atom 这个 C51 特有的参数以外),这也是配置 C51 的通用方法。

C51 的默认配置如下:

class ding.policy.c51.C51Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of C51 algorithm.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

c51

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

model.v_min

float

-10

Value of the smallest atom
in the support set.

6

model.v_max

float

10

Value of the largest atom
in the support set.

7

model.n_atom

int

51

Number of atoms in the support set
of the value distribution.

8

other.eps
.start

float

0.95

Start value for epsilon decay.

9

other.eps
.end

float

0.1

End value for epsilon decay.

10

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

11

nstep

int

1,

N-step reward discount sum for target
q_value estimation

12

learn.update
per_collect

int

3

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

C51 使用的网络接口定义如下:

class ding.model.template.q_learning.C51DQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51)[source]
Overview:

The neural network structure and computation graph of C51DQN, which combines distributional RL and DQN. You can refer to https://arxiv.org/pdf/1707.06887.pdf for more details. The C51DQN is composed of encoder and head. encoder is used to extract the feature of observation, and head is used to compute the distribution of Q-value.

Interfaces:

__init__, forward

Note

Current C51DQN supports two types of encoder: FCEncoder and ConvEncoder.

forward(x: Tensor) Dict[source]
Overview:

C51DQN forward computation graph, input observation tensor to predict q_value and its distribution.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • outputs (Dict): The output of DQN’s forward, including q_value, and distribution.

ReturnsKeys:
  • logit (torch.Tensor): Discrete Q-value output of each possible action dimension.

  • distribution (torch.Tensor): Q-Value discretized distribution, i.e., probability of each uniformly spaced atom Q-value, such as dividing [-10, 10] into 51 uniform spaces.

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.

  • logit (torch.Tensor): \((B, M)\), where M is action_shape.

  • distribution(torch.Tensor): \((B, M, P)\), where P is n_atom.

Examples:
>>> model = C51DQN(128, 64)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 128)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> # default head_hidden_size: int = 64,
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default n_atom: int = 51
>>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

Note

For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

Note

For convenience, we recommend that the number of atoms should be odd, so that the middle atom is exactly the value of the Q-value.

基准测试

Benchmark and comparison of c51 algorithm

environment

best mean reward

evaluation results

config link

comparison

Pong
(PongNoFrameskip-v4)

20.6

../_images/c51_pong.png

config_link_p

Tianshou(20)
Qbert
(QbertNoFrameskip-v4)

20006

../_images/c51_qbert.png

config_link_q

Tianshou(16245)
SpaceInvaders
(SpaceInvadersNoFrame skip-v4)

2766

../_images/c51_spaceinvaders.png

config_link_s

Tianshou(988.5)

P.S.:

  1. 上述结果是在五个不同的随机种子(0、1、2、3、4)上运行相同配置的实验得出的。

  2. 对于像 DQN 这样的离散动作空间算法,通常使用 Atari 环境集进行测试(包括子环境 Pong ),而 Atari 环境通常通过训练10M个环境步骤的最高平均奖励来评估。关于 Atari 的更多细节,请参考 Atari Env Tutorial

其他开源实现