PPG¶

Overview¶

PPG was proposed in Phasic Policy Gradient. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features.

Quick Facts¶

PPG is a model-free and policy-based RL algorithm.
PPG supports both discrete and continuous action spaces.
PPG supports off-policy mode and on-policy mode.
There are two value networks in PPG.
In the implementation of DI-engine, we use two buffers for off-policy PPG, which are only different from maximum data usage limit (data max_use ).

Key Graphs¶

PPG utilizes disjoint policy and value networks to reduce interference between objectives. The policy network includes an auxiliary value head which is used to distill the knowledge of value into the policy network, the concrete network architecture is shown as follows:

Key Equations¶

The optimization of PPG alternates between two phases, a policy phase and an auxiliary phase. During the policy phase, the policy network and the value network are updated similar to PPO. During the auxiliary phase, the value knowledge is distilled into the policy network with the joint loss:

\[L^{j o i n t}=L^{a u x}+\beta_{c l o n e} \cdot \hat{\mathbb{E}}_{t}\left[K L\left[\pi_{\theta_{o l d}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]\]

The joint loss optimizes the auxiliary objective (distillation) while preserves the original policy with the KL-divergence restriction (i.e. the second item). And the auxiliary loss is defined as:

\[L^{a u x}=\frac{1}{2} \cdot \hat{\mathbb{E}}_{t}\left[\left(V_{\theta_{\pi}}\left(s_{t}\right)-\hat{V}_{t}^{\mathrm{targ}}\right)^{2}\right]\]

Pseudo-code¶

on-policy training procedure¶

The following flow charts show how PPG alternates between the policy phase and the auxiliary phase

Note

During the auxiliary phase, PPG also takes the opportunity to perform additional training on the value network.

off-policy training procedure¶

DI-engine also implements off-policy PPG with two buffers with different data use constraint (max_use), which policy buffer offers data for policy phase while value buffer provides auxiliary phase’s data. The whole training procedure is similar to off-policy PPO but execute additional auxiliary phase with a fixed frequency.

Extensions¶

PPG can be combined with:
- GAE or other advantage estimation method
- Multi-buffer, different max_use
PPO (or PPG) + UCB-DrAC + PLR is one of the most powerful methods in procgen environment.
- PLR github repo
- UCB-DrAC repo

Implementation¶

The default config is defined as follows:

class ding.policy.ppg.PPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:

Policy class of PPG algorithm. PPG is a policy gradient algorithm with auxiliary phase training. The auxiliary phase training is proposed to distill the value into the policy network, while making sure the policy network does not change the action predictions (kl div loss). Paper link: https://arxiv.org/abs/2009.04416.

Interface:

_init_learn, _data_preprocess_learn, _forward_learn, _state_dict_learn, _load_state_dict_learn, _init_collect, _forward_collect, _process_transition, _get_train_sample, _get_batch_size, _init_eval, _forward_eval, default_model, _monitor_vars_learn, learn_aux.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	ppg	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	True	Whether the RL algorithm is on-policy or off-policy
	`priority`	bool	False	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update` `_per_collect`	int	5	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.value_` `weight`	float	1.0	The loss weight of value network	policy network weight is set to 1
8	`learn.entropy_` `weight`	float	0.01	The loss weight of entropy regularization	policy network weight is set to 1
9	`learn.clip_` `ratio`	float	0.2	PPO clip ratio
10	`learn.adv_` `norm`	bool	False	Whether to use advantage norm in a whole training batch
11	`learn.aux_` `freq`	int	5	The frequency(normal update times) of auxiliary phase training
12	`learn.aux_` `train_epoch`	int	6	The training epochs of auxiliary phase
13	`learn.aux_` `bc_weight`	int	1	The loss weight of behavioral_cloning in auxiliary phase
14	`collect.dis` `count_factor`	float	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env
15	`collect.gae_` `lambda`	float	0.95	GAE lambda factor for the balance of bias and variance(1-step td and mc)

The network interface PPG used is defined as follows:

class ding.model.template.ppg.PPG(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, impala_cnn_encoder: bool = False)[source]

Overview:: Phasic Policy Gradient (PPG) model from paper Phasic Policy Gradient https://arxiv.org/abs/2009.04416 This module contains VAC module and an auxiliary critic module.
Interfaces:: forward, compute_actor, compute_critic, compute_actor_critic

compute_actor(x: Tensor) → Dict[source]

Overview:

Use actor to compute action logits.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

output (Dict): The output data containing action logits.

ReturnsKeys:

logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.
output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

compute_actor_critic(x: Tensor) → Dict[source]

Overview:

Use actor and critic to compute action logits and value.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict): The output dict of PPG’s forward computation graph for both actor and critic, including logit and value.

ReturnsKeys:

logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.
value (torch.Tensor): The predicted state value tensor.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.
output (Dict): value: \((B, 1)\), where B is batch size.
output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

Note

compute_actor_critic interface aims to save computation when shares encoder.

compute_critic(x: Tensor) → Dict[source]

Overview:

Use critic to compute value.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

output (Dict): The output dict of VAC’s forward computation graph for critic, including value.

ReturnsKeys:

necessary: value

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.
output (Dict): value: \((B, 1)\), where B is batch size.

Benchmark¶

Benchmark and comparison of PPG algorithm¶
environment	best mean reward	config link	comparison
Pong (PongNoFrameskip-v4)	20	config_link_p	DI-engine PPO off-policy(20)
Qbert (QbertNoFrameskip-v4)	17775	config_link_q	DI-engine PPO off-policy(16400)
SpaceInvaders (SpaceInvadersNoFrame skip-v4)	1213	config_link_s	DI-engine PPO off-policy(1200)

References¶

Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman: “Phasic Policy Gradient”, 2020; arXiv:2009.04416.

Other Public Implementations¶

[openai](https://github.com/openai/phasic-policy-gradient)