Shortcuts

D4PG

Overview

D4PG, proposed in the paper Distributed Distributional Deterministic Policy Gradients, is an actor-critic, model-free policy gradient algorithm that extends upon DDPG. Improvements over DDPG include the use of N-step returns, prioritized experience replay and distributional value function. Moreover, training is parallelized with multiple distributed workers all writing into the same replay table. The authors found that these simple modifications contribute to the overall performance of the algorithm with N-step returns brining the biggest performance gain and priority buffer being the less crucial one.

Quick Facts

  1. D4PG is only used for environments with continuous action spaces.(i.e. MuJoCo)

  2. D4PG is an off-policy algorithm.

  3. D4PG uses a distributional critic.

  4. D4PG is a model-free and actor-critic RL algorithm, which optimizes actor network and critic network, respectively.

  5. Usually, D4PG uses Ornstein-Uhlenbeck process or Gaussian process (default in our implementation) for exploration.

Key Equations or Key Graphs

The D4PG algorithm maintains a distributional critic \(Z_\pi(s, a)\) which estimates the expected Q value as a random variable such that \(Q(s, a)=\mathbb{E}Z_\pi(s, a)\). \(Z\) is usually a Categorical distribution over Q with 51 supports.

Accordingly, the distributional Bellman operator can be defined as:

\[\begin{aligned} (\mathcal{T}_{\pi} Z)(s, a)=r(s, a)+\gamma\mathbb{E}[Z(s',\pi(s'))|(s, a)] \end{aligned}\]

The distributional variant of the operator takes functions which map from state-action pairs to distributions, and returns a function of the same form. The loss used to learn the critic distribution parameters is defined as \(L(\pi) = \mathbb{E}[d(\mathcal{T}_{\pi_{\theta'}}, Z_{w'}(s, a), Z_{w}(s, a)]\) for some metric \(d\) that measures the distance between two distributions.

Finally, the actor update is done by taking the expectation with respect to the action-value distribution:

\[\begin{split}\begin{aligned} \nabla_\theta J(\theta) &\approx \mathbb{E}_{\rho^\pi} [\nabla_a Q_w(s, a) \nabla_\theta \pi_{\theta}(s) \rvert_{a=\pi\theta(s)}] \\ &= \mathbb{E}_{\rho^\pi} [\mathbb{E}[\nabla_a Z_w(s, a)] \nabla_\theta \pi_{\theta}(s) \rvert_{a=\pi\theta(s)}] \end{aligned}\end{split}\]

When calculating the TD error, D4PG computes N-step in the TD target to incorporate rewards in more future steps:

\[r(s_0, a_0) + \mathbb{E}[\sum_{n=1}^{N-1} \gamma^n r(s_n, a_n) + \gamma^N Q(s_N, \mu_\theta(s_N)) \vert s_0, a_0 ]\]

D4PG samples from a prioritized replay buffer with a non-uniform probability \(p_i\). This requires the use of importance sampling, implemented by weighting the critic update by a factor of \(R_{p_i}^{-1}\).

Note

D4PG utilizes multiple parallel independent actors, gathering experience and feeding data into the same replay buffer. However, our implementation only makes use of a single actor.

Pseudocode

../_images/D4PG_algo.png

source: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/#d4pg

Implementations

The default config is defined as follows:

class ding.policy.d4pg.D4PGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of D4PG algorithm. D4PG is a variant of DDPG, which uses distributional critic. The distributional critic is implemented by using quantile regression. Paper link: https://arxiv.org/abs/1804.08617.

Property:

learn_mode, collect_mode, eval_mode

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

d4pg

RL policy register name, refer
to registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

True

Whether to use cuda for network

3

random_
collect_size

int

25000

Number of randomly collected
training samples in replay
buffer when training starts.
Default to 25000 for
DDPG/TD3, 10000 for
sac.

5

learn.learning
_rate_actor

float

1e-3

Learning rate for actor
network(aka. policy).


6

learn.learning
_rate_critic

float

1e-3

Learning rates for critic
network (aka. Q-network).


7

learn.actor_
update_freq

int

1

When critic network updates
once, how many times will actor
network update.
Default 1


8

learn.noise




bool

False

Whether to add noise on target
network’s action.



Default False for
D4PG.
Target Policy Smoo-
thing Regularization
in TD3 paper.

9

learn.-
ignore_done

bool

False

Determine whether to ignore
done flag.
Use ignore_done only
in halfcheetah env.

10

learn.-
target_theta


float

0.005

Used for soft update of the
target network.


aka. Interpolation
factor in polyak aver
aging for target
networks.

11

collect.-
noise_sigma

float

0.1

Used for add noise during co-
llection, through controlling
the sigma of distribution
Sample noise from dis
tribution, Gaussian
process.

12

model.v_min

float

-10

Value of the smallest atom
in the support set.


13

model.v_max

float

10

Value of the largest atom
in the support set.


14

model.n_atom

int

51

Number of atoms in the support
set of the value distribution.


15

nstep

int

3, [1, 5]

N-step reward discount sum for
target q_value estimation


16

priority

bool

True

Whether use priority(PER)
priority sample,
update priority

Model

Here we provide examples of QACDIST model as default model for D4PG.

class ding.model.template.qac_dist.QACDIST(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'regression', critic_head_type: str = 'categorical', actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51)[source]
Overview:

The QAC model with distributional Q-value.

Interfaces:

__init__, forward, compute_actor, compute_critic

compute_actor(inputs: Tensor) Dict[source]
Overview:

Use encoded embedding tensor to predict output. Execute parameter updates with 'compute_actor' mode Use encoded embedding tensor to predict output.

Arguments:
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = actor_head_hidden_size

  • mode (str): Name of the forward mode.

Returns:
  • outputs (Dict): Outputs of forward pass encoder and head.

ReturnsKeys (either):
  • action (torch.Tensor): Continuous action tensor with same size as action_shape.

  • logit (torch.Tensor):

    Logit tensor encoding mu and sigma, both with same size as input x.

Shapes:
  • inputs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to hidden_size

  • action (torch.Tensor): \((B, N0)\)

  • logit (list): 2 elements, mu and sigma, each is the shape of \((B, N0)\).

  • q_value (torch.FloatTensor): \((B, )\), B is batch size.

Examples:
>>> # Regression mode
>>> model = QACDIST(64, 64, 'regression')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
>>> # Reparameterization Mode
>>> model = QACDIST(64, 64, 'reparameterization')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> actor_outputs['logit'][0].shape # mu
>>> torch.Size([4, 64])
>>> actor_outputs['logit'][1].shape # sigma
>>> torch.Size([4, 64])
compute_critic(inputs: Dict) Dict[source]
Overview:

Execute parameter updates with 'compute_critic' mode Use encoded embedding tensor to predict output.

Arguments:
  • obs, action encoded tensors.

  • mode (str): Name of the forward mode.

Returns:
  • outputs (Dict): Q-value output and distribution.

ReturnKeys:
  • q_value (torch.Tensor): Q value tensor with same size as batch size.

  • distribution (torch.Tensor): Q value distribution tensor.

Shapes:
  • obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape

  • action (torch.Tensor): \((B, N2)\), where B is batch size and N2 is``action_shape``

  • q_value (torch.FloatTensor): \((B, N2)\), where B is batch size and N2 is action_shape

  • distribution (torch.FloatTensor): \((B, 1, N3)\), where B is batch size and N3 is num_atom

Examples:
>>> # Categorical mode
>>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)}
>>> model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression',             ...                 critic_head_type='categorical', n_atoms=51)
>>> q_value = model(inputs, mode='compute_critic') # q value
>>> assert q_value['q_value'].shape == torch.Size([4, 1])
>>> assert q_value['distribution'].shape == torch.Size([4, 1, 51])
forward(inputs: Tensor | Dict, mode: str) Dict[source]
Overview:

Use observation and action tensor to predict output. Parameter updates with QACDIST’s MLPs forward setup.

Arguments:
Forward with 'compute_actor':
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). Whether actor_head_hidden_size or critic_head_hidden_size depend on mode.

Forward with 'compute_critic', inputs (Dict) Necessary Keys:
  • obs, action encoded tensors.

  • mode (str): Name of the forward mode.

Returns:
  • outputs (Dict): Outputs of network forward.

    Forward with 'compute_actor', Necessary Keys (either):
    • action (torch.Tensor): Action tensor with same size as input x.

    • logit (torch.Tensor):

      Logit tensor encoding mu and sigma, both with same size as input x.

    Forward with 'compute_critic', Necessary Keys:
    • q_value (torch.Tensor): Q value tensor with same size as batch size.

    • distribution (torch.Tensor): Q value distribution tensor.

Actor Shapes:
  • inputs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to hidden_size

  • action (torch.Tensor): \((B, N0)\)

  • q_value (torch.FloatTensor): \((B, )\), where B is batch size.

Critic Shapes:
  • obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape

  • action (torch.Tensor): \((B, N2)\), where B is batch size and N2 is``action_shape``

  • q_value (torch.FloatTensor): \((B, N2)\), where B is batch size and N2 is action_shape

  • distribution (torch.FloatTensor): \((B, 1, N3)\), where B is batch size and N3 is num_atom

Actor Examples:
>>> # Regression mode
>>> model = QACDIST(64, 64, 'regression')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
>>> # Reparameterization Mode
>>> model = QACDIST(64, 64, 'reparameterization')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> actor_outputs['logit'][0].shape # mu
>>> torch.Size([4, 64])
>>> actor_outputs['logit'][1].shape # sigma
>>> torch.Size([4, 64])
Critic Examples:
>>> # Categorical mode
>>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)}
>>> model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression',             ...                 critic_head_type='categorical', n_atoms=51)
>>> q_value = model(inputs, mode='compute_critic') # q value
>>> assert q_value['q_value'].shape == torch.Size([4, 1])
>>> assert q_value['distribution'].shape == torch.Size([4, 1, 51])

Benchmark

environment

best mean reward

evaluation results

config link

comparison

Halfcheetah

(Halfcheetah-v3)

13000

../_images/halfcheetah_d4pg.png

config_link_ha

Walker2d

(Walker2d-v2)

5300

../_images/walker2d_d4pg.png

config_link_w

Hopper

(Hopper-v2)

3500

../_images/hopper_d4pg.png

config_link_ho

Other Public Implementations

References

  • Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, Timothy Lillicrap: “Distributed Distributional Deterministic Policy Gradients”, 2018; [https://arxiv.org/abs/1804.08617v1 arXiv:1804.08617v1].