Shortcuts

PPG

概述

PPG 是在 Phasic Policy Gradient 中提出的。在以前的方法中,人们需要选择表示策略和价值函数是否需要分开训练还是共享全局信息。分开训练可以避免目标之间的干扰,而使用共享全局信息可以实现有用特征的共享。PPG 能够兼顾两者的优点,通过将优化分为两个阶段,一个用于推进训练,另一个用于提取特征。

核心要点

  1. PPG 是一种无模型、基于策略的强化学习算法。

  2. PP 支持离散动作空间和连续动作空间。

  3. PPG 支持离策略模式和在策略模式。

  4. PPG 中有两个价值网络。

  5. 在 DI-engine 的实现中,我们对离策略PPG使用了两个缓冲区,它们仅在数据使用次数约束(数据 “max_use” )上有所不同。

重要图示

PPG 利用分开的策略和价值网络来减少目标之间的干扰。策略网络包括一个辅助价值头部网络,用于将价值知识提取到策略网络中,具体的网络结构如下所示:

../_images/ppg_net_zh.png

重要公式

PPG 的优化分为两个阶段,策略阶段和辅助阶段。在策略阶段,策略网络和价值网络的更新方式类似于 PPO。在辅助阶段,使用联合损失将价值知识提取到策略网络中:

\[\mathcal{L}^{\mathrm{joint}}=\mathcal{L}^{aux}+\beta_{clone} \cdot \hat{\mathbb{E}}_{t}\left[K L\left[\pi_{\theta_{o l d}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]\]

联合损失函数优化辅助目标(蒸馏),同时通过 KL 散度限制(即第二项)保留原始策略。辅助损失定义如下:

\[L^{\mathrm{aux}}=\frac{1}{2} \cdot \hat{\mathbb{E}}_{t}\left[\left(V_{\theta_{\pi}}\left(s_{t}\right)-\hat{V}_{t}^{\mathrm{targ}}\right)^{2}\right]\]

伪代码

on-policy 训练流程

以下流程图展示了 PPG 如何在策略阶段和辅助阶段之间进行交替

../_images/PPG_zh.png

Note

在辅助阶段,PPG 还会对值网络进行额外的训练。

off-policy 训练流程

DI-engine 实现了采用两个不同数据使用次数约束(”max_use”)缓冲区的 PPG。其中,策略缓冲区提供策略阶段的数据,而值缓冲区提供辅助阶段的数据。整个训练过程类似于 off-policy PPO,但会以固定频率执行额外的辅助阶段。

扩展

  • PPG 可以与以下方法结合使用:

    • GAE 或其他优势估计方法

    • 多个具有不同的最大数据使用次数限制的回访缓存

  • 在 procgen 环境中,PPO(或 PPG)+ UCB-DrAC + PLR 是最好的方法之一。

实现

默认配置如下所示:

class ding.policy.ppg.PPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of PPG algorithm. PPG is a policy gradient algorithm with auxiliary phase training. The auxiliary phase training is proposed to distill the value into the policy network, while making sure the policy network does not change the action predictions (kl div loss). Paper link: https://arxiv.org/abs/2009.04416.

Interface:

_init_learn, _data_preprocess_learn, _forward_learn, _state_dict_learn, _load_state_dict_learn, _init_collect, _forward_collect, _process_transition, _get_train_sample, _get_batch_size, _init_eval, _forward_eval, default_model, _monitor_vars_learn, learn_aux.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

ppg

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

True

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update.
IS weight

6

learn.update
_per_collect

int

5

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

7

learn.value_
weight

float

1.0

The loss weight of value network
policy network weight
is set to 1

8

learn.entropy_
weight

float

0.01

The loss weight of entropy
regularization
policy network weight
is set to 1

9

learn.clip_
ratio

float

0.2

PPO clip ratio

10

learn.adv_
norm

bool

False

Whether to use advantage norm in
a whole training batch

11

learn.aux_
freq

int

5

The frequency(normal update times)
of auxiliary phase training

12

learn.aux_
train_epoch

int

6

The training epochs of auxiliary
phase

13

learn.aux_
bc_weight

int

1

The loss weight of behavioral_cloning
in auxiliary phase

14

collect.dis
count_factor

float

0.99

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

15

collect.gae_
lambda

float

0.95

GAE lambda factor for the balance
of bias and variance(1-step td and mc)

PPG 使用的网络定义如下:

class ding.model.template.ppg.PPG(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, impala_cnn_encoder: bool = False)[source]
Overview:

Phasic Policy Gradient (PPG) model from paper Phasic Policy Gradient https://arxiv.org/abs/2009.04416 This module contains VAC module and an auxiliary critic module.

Interfaces:

forward, compute_actor, compute_critic, compute_actor_critic

compute_actor(x: Tensor) Dict[source]
Overview:

Use actor to compute action logits.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • output (Dict): The output data containing action logits.

ReturnsKeys:
  • logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.

  • output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

compute_actor_critic(x: Tensor) Dict[source]
Overview:

Use actor and critic to compute action logits and value.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • outputs (Dict): The output dict of PPG’s forward computation graph for both actor and critic, including logit and value.

ReturnsKeys:
  • logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.

  • value (torch.Tensor): The predicted state value tensor.

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.

  • output (Dict): value: \((B, 1)\), where B is batch size.

  • output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

Note

compute_actor_critic interface aims to save computation when shares encoder.

compute_critic(x: Tensor) Dict[source]
Overview:

Use critic to compute value.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • output (Dict): The output dict of VAC’s forward computation graph for critic, including value.

ReturnsKeys:
  • necessary: value

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.

  • output (Dict): value: \((B, 1)\), where B is batch size.

Benchmark

Benchmark and comparison of PPG algorithm

environment

best mean reward

evaluation results

config link

comparison

Pong
(PongNoFrameskip-v4)

20

../_images/ppg_pong.png

config_link_p

DI-engine PPO off-policy(20)
Qbert
(QbertNoFrameskip-v4)

17775

../_images/ppg_qbert.png

config_link_q

DI-engine PPO off-policy(16400)
SpaceInvaders
(SpaceInvadersNoFrame skip-v4)

1213

../_images/ppg_spaceinvaders.png

config_link_s

DI-engine PPO off-policy(1200)

引用

Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman: “Phasic Policy Gradient”, 2020; arXiv:2009.04416.

其他开源实现