Shortcuts

PPG

Overview

PPG was proposed in Phasic Policy Gradient. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features.

Quick Facts

  1. PPG is a model-free and policy-based RL algorithm.

  2. PPG supports both discrete and continuous action spaces.

  3. PPG supports off-policy mode and on-policy mode.

  4. There are two value networks in PPG.

  5. In the implementation of DI-engine, we use two buffers for off-policy PPG, which are only different from maximum data usage limit (data max_use ).

Key Graphs

PPG utilizes disjoint policy and value networks to reduce interference between objectives. The policy network includes an auxiliary value head which is used to distill the knowledge of value into the policy network, the concrete network architecture is shown as follows:

../_images/ppg_net.png

Key Equations

The optimization of PPG alternates between two phases, a policy phase and an auxiliary phase. During the policy phase, the policy network and the value network are updated similar to PPO. During the auxiliary phase, the value knowledge is distilled into the policy network with the joint loss:

\[L^{j o i n t}=L^{a u x}+\beta_{c l o n e} \cdot \hat{\mathbb{E}}_{t}\left[K L\left[\pi_{\theta_{o l d}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]\]

The joint loss optimizes the auxiliary objective (distillation) while preserves the original policy with the KL-divergence restriction (i.e. the second item). And the auxiliary loss is defined as:

\[L^{a u x}=\frac{1}{2} \cdot \hat{\mathbb{E}}_{t}\left[\left(V_{\theta_{\pi}}\left(s_{t}\right)-\hat{V}_{t}^{\mathrm{targ}}\right)^{2}\right]\]

Pseudo-code

on-policy training procedure

The following flow charts show how PPG alternates between the policy phase and the auxiliary phase

../_images/PPG.png

Note

During the auxiliary phase, PPG also takes the opportunity to perform additional training on the value network.

off-policy training procedure

DI-engine also implements off-policy PPG with two buffers with different data use constraint (max_use), which policy buffer offers data for policy phase while value buffer provides auxiliary phase’s data. The whole training procedure is similar to off-policy PPO but execute additional auxiliary phase with a fixed frequency.

Extensions

  • PPG can be combined with:

    • GAE or other advantage estimation method

    • Multi-buffer, different max_use

  • PPO (or PPG) + UCB-DrAC + PLR is one of the most powerful methods in procgen environment.

Implementation

The default config is defined as follows:

class ding.policy.ppg.PPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of PPG algorithm. PPG is a policy gradient algorithm with auxiliary phase training. The auxiliary phase training is proposed to distill the value into the policy network, while making sure the policy network does not change the action predictions (kl div loss). Paper link: https://arxiv.org/abs/2009.04416.

Interface:

_init_learn, _data_preprocess_learn, _forward_learn, _state_dict_learn, _load_state_dict_learn, _init_collect, _forward_collect, _process_transition, _get_train_sample, _get_batch_size, _init_eval, _forward_eval, default_model, _monitor_vars_learn, learn_aux.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

ppg

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

True

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update.
IS weight

6

learn.update
_per_collect

int

5

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

7

learn.value_
weight

float

1.0

The loss weight of value network
policy network weight
is set to 1

8

learn.entropy_
weight

float

0.01

The loss weight of entropy
regularization
policy network weight
is set to 1

9

learn.clip_
ratio

float

0.2

PPO clip ratio

10

learn.adv_
norm

bool

False

Whether to use advantage norm in
a whole training batch

11

learn.aux_
freq

int

5

The frequency(normal update times)
of auxiliary phase training

12

learn.aux_
train_epoch

int

6

The training epochs of auxiliary
phase

13

learn.aux_
bc_weight

int

1

The loss weight of behavioral_cloning
in auxiliary phase

14

collect.dis
count_factor

float

0.99

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

15

collect.gae_
lambda

float

0.95

GAE lambda factor for the balance
of bias and variance(1-step td and mc)

The network interface PPG used is defined as follows:

class ding.model.template.ppg.PPG(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, impala_cnn_encoder: bool = False)[source]
Overview:

Phasic Policy Gradient (PPG) model from paper Phasic Policy Gradient https://arxiv.org/abs/2009.04416 This module contains VAC module and an auxiliary critic module.

Interfaces:

forward, compute_actor, compute_critic, compute_actor_critic

compute_actor(x: Tensor) Dict[source]
Overview:

Use actor to compute action logits.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • output (Dict): The output data containing action logits.

ReturnsKeys:
  • logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.

  • output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

compute_actor_critic(x: Tensor) Dict[source]
Overview:

Use actor and critic to compute action logits and value.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • outputs (Dict): The output dict of PPG’s forward computation graph for both actor and critic, including logit and value.

ReturnsKeys:
  • logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.

  • value (torch.Tensor): The predicted state value tensor.

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.

  • output (Dict): value: \((B, 1)\), where B is batch size.

  • output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

Note

compute_actor_critic interface aims to save computation when shares encoder.

compute_critic(x: Tensor) Dict[source]
Overview:

Use critic to compute value.

Arguments:
  • x (torch.Tensor): The input observation tensor data.

Returns:
  • output (Dict): The output dict of VAC’s forward computation graph for critic, including value.

ReturnsKeys:
  • necessary: value

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.

  • output (Dict): value: \((B, 1)\), where B is batch size.

Benchmark

Benchmark and comparison of PPG algorithm

environment

best mean reward

evaluation results

config link

comparison

Pong
(PongNoFrameskip-v4)

20

../_images/ppg_pong.png

config_link_p

DI-engine PPO off-policy(20)
Qbert
(QbertNoFrameskip-v4)

17775

../_images/ppg_qbert.png

config_link_q

DI-engine PPO off-policy(16400)
SpaceInvaders
(SpaceInvadersNoFrame skip-v4)

1213

../_images/ppg_spaceinvaders.png

config_link_s

DI-engine PPO off-policy(1200)

References

Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman: “Phasic Policy Gradient”, 2020; arXiv:2009.04416.

Other Public Implementations