QRDQN¶

概述¶

QR (Quantile Regression, 分位数回归) DQN 在 Distributional Reinforcement Learning with Quantile Regression 中被提出，它继承了学习 q 值分布的思想。与使用离散原子来近似分布密度函数不同， QRDQN 直接回归 q 值的一组离散分位数。

核心要点¶

QRDQN 是一种 无模型（model-free） 和 基于值（value-based） 的强化学习算法。
QRDQN 仅支持 离散动作空间 。
QRDQN 是一种 异策略（off-policy） 算法。
通常情况下， QRDQN 使用 eps-greedy 或 多项式采样 进行探索。
QRDQN 可以与循环神经网络 (RNN) 结合使用。

关键方程或关键框图¶

C51 (Categorical 51) 使用N个固定位置来近似其概率分布，并调整它们的概率，而 QRDQN 将固定的均匀概率分配给N个可调整的位置。基于这一点， QRDQN 使用分位数回归来随机调整分布的位置，以使其与目标分布的 Wasserstein 距离最小化。

分位数回归损失是一种非对称凸损失函数，用于量化回归问题。对于给定的分位数 \(\tau \in [0, 1]\) ，该损失函数以权重 \(\tau\) 惩罚过估计误差，以权重 \(1−\tau\) 惩罚欠估计误差. 对于一个分布 \(Z\) 和给定的分位数 \(\tau\)，分位数函数 \(F_Z^{−1}(\tau)\) 的值可以被描述为分位数回归损失的最小化器：

\[\begin{split}\begin{array}{r} \mathcal{L}_{\mathrm{QR}}^{\tau}(\theta):=\mathbb{E}_{\hat{z} \sim Z}\left[\rho_{\tau}(\hat{Z}-\theta)\right], \text { where } \\ \rho_{\tau}(u)=u\left(\tau-\delta_{\{u<0\}}\right), \forall u \in \mathbb{R} \end{array}\end{split}\]

上述提到的损失在零点处不平滑，这可能会限制在使用非线性函数逼近时的性能。因此，在 QRDQN 的 Bellman 更新过程中应用了一种修改后的分位数 Huber 损失，称为 quantile huber loss 损失(即伪代码中的方程式10)。

\[\rho^{\kappa}_{\tau}(u)=L_{\kappa}(u)\lvert \tau-\delta_{\{u<0\}} \rvert\]

在这里 \(L_{\kappa}\) 是 Huber 损失.

Note

与 DQN 相比， QRDQN 具有以下区别:

神经网络架构: QRDQN 的输出层大小为M x N，其中M是离散动作空间的大小，N是一个超参数，表示分位数目标的数量。

使用分位数 Huber 损失替代 DQN 损失函数。

在原始的 QRDQN 论文中，将 RMSProp 优化器替换为 Adam 优化器。而在 DI-engine 中，我们始终使用 Adam 优化器。

伪代码¶

扩展¶

QRDQN可以与以下技术相结合使用:
- 优先经验回放 (Prioritized Experience Replay)
- 多步时序差分 (TD)损失
- 双目标网络 (Double Target Network)
- 循环神经网络 (RNN)

实现¶

Tip

在我们的基准结果中， QRDQN 使用与 DQN 相同的超参数，除了 QRDQN 的专属超参数——“分位数的数量” ，该超参数经验性地设置为32。

QRDQN 的默认配置可以如下定义:

class ding.policy.qrdqn.QRDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:

Policy class of QRDQN algorithm. QRDQN (https://arxiv.org/pdf/1710.10044.pdf) is a distributional RL algorithm, which is an extension of DQN. The main idea of QRDQN is to use quantile regression to estimate the quantile of the distribution of the return value, and then use the quantile to calculate the quantile loss.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	qrdqn	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	True	Whether use priority(PER)	priority sample, update priority
6	`other.eps` `.start`	float	0.05	Start value for epsilon decay. It’s small because rainbow use noisy net.
7	`other.eps` `.end`	float	0.05	End value for epsilon decay.
8	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env
9	`nstep`	int	3, [3, 5]	N-step reward discount sum for target q_value estimation
10	`learn.update` `per_collect`	int	3	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
11	`learn.kappa`	float	/	Threshold of Huber loss

QRDQN 使用的网络接口可以如下定义:

class ding.model.template.q_learning.QRDQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None)[source]

Overview:: The neural network structure and computation graph of QRDQN, which combines distributional RL and DQN. You can refer to Distributional Reinforcement Learning with Quantile Regression https://arxiv.org/pdf/1710.10044.pdf for more details.
Interfaces:: __init__, forward

forward(x: Tensor) → Dict[source]

Overview:

Use observation tensor to predict QRDQN’s output. Parameter updates with QRDQN’s MLPs forward setup.

Arguments:

x (torch.Tensor):
The encoded embedding tensor with (B, N=hidden_size).

Returns:

outputs (Dict):
Run with encoder and head. Return the result prediction dictionary.

ReturnsKeys:

logit (torch.Tensor): Logit tensor with same size as input x.
q (torch.Tensor): Q valye tensor tensor of size (B, N, num_quantiles)
tau (torch.Tensor): tau tensor of size (B, N, 1)

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.
logit (torch.FloatTensor): \((B, M)\), where M is action_shape.
tau (torch.Tensor): \((B, M, 1)\)

Examples:

>>> model = QRDQN(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default num_quantiles : int = 32
>>> assert outputs['q'].shape == torch.Size([4, 64, 32])
>>> assert outputs['tau'].shape == torch.Size([4, 32, 1])

QRDQN 的贝尔曼更新在ding/rl_utils/td.py模块的qrdqn_nstep_td_error函数中实现。

基准¶

Benchmark and comparison of QRDQN algorithm¶
environment	best mean reward	config link	comparison
Pong (PongNoFrameskip-v4)	20	config_link_p	Tianshou (20)
Qbert (QbertNoFrameskip-v4)	18306	config_link_q	Tianshou (14990)
SpaceInvaders (SpaceInvadersNoFrame skip-v4)	2231	config_link_s	Tianshou (938)

P.S.:

上述结果是通过在五个不同的随机种子 (0, 1, 2, 3, 4)上运行相同的配置获得的。
对于像 QRDQN 这样的离散动作空间算法，通常使用 Atari 环境集进行测试(包括子环境 Pong ) ，而 Atari 环境通常通过训练10M个环境步骤的最高平均奖励来评估。有关 Atari 的更多详细信息, 请参阅 Atari Env Tutorial .

参考文献¶

(QRDQN) Will Dabney, Mark Rowland, Marc G. Bellemare, Rémi Munos: “Distributional Reinforcement Learning with Quantile Regression”, 2017; arXiv:1710.10044. https://arxiv.org/pdf/1710.10044

其他开源实现¶

Tianshou