ding.model¶

Common¶

Please refer to ding/model/common for more details.

create_model¶

ding.model.create_model(cfg: EasyDict) → Module[source]¶

Overview:

Create a neural network model according to the given EasyDict-type cfg.

Arguments:

cfg: (EasyDict): User’s model config. The key import_name is used to import modules, and they key type is used to indicate the model.

Returns:

(torch.nn.Module): The created neural network model.

Examples:

>>> cfg = EasyDict({
>>>     'import_names': ['ding.model.template.q_learning'],
>>>     'type': 'dqn',
>>>     'obs_shape': 4,
>>>     'action_shape': 2,
>>> })
>>> model = create_model(cfg)

Tip

This method will not modify the cfg , it will deepcopy the cfg and then modify it.

ConvEncoder¶

class ding.model.ConvEncoder(obs_shape: SequenceType, hidden_size_list: SequenceType = [32, 64, 64, 128], activation: Module | None = ReLU(), kernel_size: SequenceType = [8, 4, 3], stride: SequenceType = [4, 2, 1], padding: SequenceType | None = None, layer_norm: bool | None = False, norm_type: str | None = None)[source]¶

Overview:: The Convolution Encoder is used to encode 2-dim image observations.
Interfaces:: __init__, forward.

__init__(obs_shape: SequenceType, hidden_size_list: SequenceType = [32, 64, 64, 128], activation: Module | None = ReLU(), kernel_size: SequenceType = [8, 4, 3], stride: SequenceType = [4, 2, 1], padding: SequenceType | None = None, layer_norm: bool | None = False, norm_type: str | None = None) → None[source]¶

Overview:

Initialize the Convolution Encoder according to the provided arguments.

Arguments:

obs_shape (SequenceType): Sequence of in_channel, plus one or more input size.
hidden_size_list (SequenceType): Sequence of hidden_size of subsequent conv layers and the final dense layer.
activation (nn.Module): Type of activation to use in the conv layers and ResBlock. Default is nn.ReLU().
kernel_size (SequenceType): Sequence of kernel_size of subsequent conv layers.
stride (SequenceType): Sequence of stride of subsequent conv layers.
padding (SequenceType): Padding added to all four sides of the input for each conv layer. See nn.Conv2d for more details. Default is None.
layer_norm (bool): Whether to use DreamerLayerNorm, which is kind of special trick proposed in DreamerV3.
norm_type (str): Type of normalization to use. See ding.torch_utils.network.ResBlock for more details. Default is None.

forward(x: Tensor) → Tensor[source]¶

Overview:

Return output 1D embedding tensor of the env’s 2D image observation.

Arguments:

x (torch.Tensor): Raw 2D observation of the environment.

Returns:

outputs (torch.Tensor): Output embedding tensor.

Shapes:

x : \((B, C, H, W)\), where B is batch size, C is channel, H is height, W is width.
outputs: \((B, N)\), where N = hidden_size_list[-1] .

Examples:

>>> conv = ConvEncoder(
>>>    obs_shape=(4, 84, 84),
>>>    hidden_size_list=[32, 64, 64, 128],
>>>    activation=nn.ReLU(),
>>>    kernel_size=[8, 4, 3],
>>>    stride=[4, 2, 1],
>>>    padding=None,
>>>    layer_norm=False,
>>>    norm_type=None
>>> )
>>> x = torch.randn(1, 4, 84, 84)
>>> output = conv(x)

FCEncoder¶

class ding.model.FCEncoder(obs_shape: int, hidden_size_list: SequenceType, res_block: bool = False, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None)[source]¶

Overview:: The full connected encoder is used to encode 1-dim input variable.
Interfaces:: __init__, forward.

__init__(obs_shape: int, hidden_size_list: SequenceType, res_block: bool = False, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None) → None[source]¶

Overview:

Initialize the FC Encoder according to arguments.

Arguments:

obs_shape (int): Observation shape.
hidden_size_list (SequenceType): Sequence of hidden_size of subsequent FC layers.
res_block (bool): Whether use res_block. Default is False.
activation (nn.Module): Type of activation to use in ResFCBlock. Default is nn.ReLU().
norm_type (str): Type of normalization to use. See ding.torch_utils.network.ResFCBlock for more details. Default is None.
dropout (float): Dropout rate of the dropout layer. If None then default no dropout layer.

forward(x: Tensor) → Tensor[source]¶

Overview:

Return output embedding tensor of the env observation.

Arguments:

x (torch.Tensor): Env raw observation.

Returns:

outputs (torch.Tensor): Output embedding tensor.

Shapes:

x : \((B, M)\), where M = obs_shape.
outputs: \((B, N)\), where N = hidden_size_list[-1].

Examples:

>>> fc = FCEncoder(
>>>    obs_shape=4,
>>>    hidden_size_list=[32, 64, 64, 128],
>>>    activation=nn.ReLU(),
>>>    norm_type=None,
>>>    dropout=None
>>> )
>>> x = torch.randn(1, 4)
>>> output = fc(x)

IMPALAConvEncoder¶

class ding.model.IMPALAConvEncoder(obs_shape: SequenceType, channels: SequenceType = (16, 32, 32), outsize: int = 256, scale_ob: float = 255.0, nblock: int = 2, final_relu: bool = True, **kwargs)[source]¶

Overview:: IMPALA CNN encoder, which is used in IMPALA algorithm. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures, https://arxiv.org/pdf/1802.01561.pdf,
Interface:: __init__, forward, output_shape.

__init__(obs_shape: SequenceType, channels: SequenceType = (16, 32, 32), outsize: int = 256, scale_ob: float = 255.0, nblock: int = 2, final_relu: bool = True, **kwargs) → None[source]¶

Overview:

Initialize the IMPALA CNN encoder according to arguments.

Arguments:

obs_shape (SequenceType): 2D image observation shape.
channels (SequenceType): The channel number of a series of impala cnn blocks. Each element of the sequence is the output channel number of a impala cnn block.
outsize (int): The output size the final linear layer, which means the dimension of the 1D embedding vector.
scale_ob (float): The scale of the input observation, which is used to normalize the input observation, such as dividing 255.0 for the raw image observation.
nblock (int): The number of Residual Block in each block.
final_relu (bool): Whether to use ReLU activation in the final output of encoder.
kwargs (Dict[str, Any]): Other arguments for IMPALACnnDownStack.

DiscreteHead¶

class ding.model.DiscreteHead(hidden_size: int, output_size: int, layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, noise: bool | None = False)[source]¶

Overview:: The DiscreteHead is used to generate discrete actions logit or Q-value logit, which is often used in q-learning algorithms or actor-critic algorithms for discrete action space.
Interfaces:: __init__, forward.

__init__(hidden_size: int, output_size: int, layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, noise: bool | None = False) → None[source]¶

Overview:

Init the DiscreteHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to DiscreteHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value output.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
dropout (float): The dropout rate, default set to None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with DiscreteHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keyword logit (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.

Examples:

>>> head = DiscreteHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 64])

DistributionHead¶

class ding.model.DistributionHead(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False, eps: float | None = 1e-06)[source]¶

Overview:: The DistributionHead is used to generate distribution for Q-value. This module is used in C51 algorithm.
Interfaces:: __init__, forward.

__init__(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False, eps: float | None = 1e-06) → None[source]¶

Overview:

Init the DistributionHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to DistributionHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value distribution.
n_atom (int): The number of atoms (discrete supports). Default is 51.
v_min (int): Min value of atoms. Default is -10.
v_max (int): Max value of atoms. Default is 10.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.
eps (float): Small constant used for numerical stability.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with DistributionHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keywords logit (torch.Tensor) and distribution (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.
distribution: \((B, M, n_atom)\).

Examples:

>>> head = DistributionHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default n_atom is 51
>>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

RainbowHead¶

class ding.model.RainbowHead(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = True, eps: float | None = 1e-06)[source]¶

Overview:: The RainbowHead is used to generate distribution of Q-value. This module is used in Rainbow DQN.
Interfaces:: __init__, forward.

__init__(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = True, eps: float | None = 1e-06) → None[source]¶

Overview:

Init the RainbowHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to RainbowHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value output.
n_atom (int): The number of atoms (discrete supports). Default is 51.
v_min (int): Min value of atoms. Default is -10.
v_max (int): Max value of atoms. Default is 10.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.
eps (float): Small constant used for numerical stability.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with RainbowHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keywords logit (torch.Tensor) and distribution (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.
distribution: \((B, M, n_atom)\).

Examples:

>>> head = RainbowHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default n_atom is 51
>>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

QRDQNHead¶

class ding.model.QRDQNHead(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False)[source]¶

Overview:: The QRDQNHead (Quantile Regression DQN) is used to output action quantiles.
Interfaces:: __init__, forward.

__init__(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False) → None[source]¶

Overview:

Init the QRDQNHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to QRDQNHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value output.
num_quantiles (int): The number of quantiles. Default is 32.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with QRDQNHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keywords logit (torch.Tensor), q (torch.Tensor), and tau (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.
q: \((B, M, num_quantiles)\).
tau: \((B, M, 1)\).

Examples:

>>> head = QRDQNHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default num_quantiles is 32
>>> assert outputs['q'].shape == torch.Size([4, 64, 32])
>>> assert outputs['tau'].shape == torch.Size([4, 32, 1])

QuantileHead¶

class ding.model.QuantileHead(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, beta_function_type: str | None = 'uniform', activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False)[source]¶

Overview:: The QuantileHead is used to output action quantiles. This module is used in IQN.
Interfaces:: __init__, forward, quantile_net.

Note

The difference between QuantileHead and QRDQNHead is that QuantileHead models the state-action quantile function as a mapping from state-actions and samples from some base distribution while QRDQNHead approximates random returns by a uniform mixture of Diracs functions.

__init__(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, beta_function_type: str | None = 'uniform', activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False) → None[source]¶

Overview:

Init the QuantileHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to QuantileHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value output.
num_quantiles (int): The number of quantiles.
quantile_embedding_size (int): The embedding size of a quantile.
beta_function_type (str): Type of beta function. See ding.rl_utils.beta_function.py for more details. Default is uniform.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.

forward(x: Tensor, num_quantiles: int | None = None) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with QuantileHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keywords logit (torch.Tensor), q (torch.Tensor), and quantiles (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.
q: \((num_quantiles, B, M)\).
quantiles: \((quantile_embedding_size, 1)\).

Examples:

>>> head = QuantileHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default num_quantiles is 32
>>> assert outputs['q'].shape == torch.Size([32, 4, 64])
>>> assert outputs['quantiles'].shape == torch.Size([128, 1])

quantile_net(quantiles: Tensor) → Tensor[source]¶

Overview:

Deterministic parametric function trained to reparameterize samples from a base distribution. By repeated Bellman update iterations of Q-learning, the optimal action-value function is estimated.

Arguments:

x (torch.Tensor): The encoded embedding tensor of parametric sample.

Returns:

quantile_net (torch.Tensor): Quantile network output tensor after reparameterization.

Shapes:

quantile_net \((quantile_embedding_size, M)\), where M = output_size.

Examples:

>>> head = QuantileHead(64, 64)
>>> quantiles = torch.randn(128,1)
>>> qn_output = head.quantile_net(quantiles)
>>> assert isinstance(qn_output, torch.Tensor)
>>> # default quantile_embedding_size: int = 128,
>>> assert qn_output.shape == torch.Size([128, 64])

FQFHead¶

class ding.model.FQFHead(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False)[source]¶

Overview:: The FQFHead is used to output action quantiles. This module is used in FQF.
Interfaces:: __init__, forward, quantile_net.

Note

The implementation of FQFHead is based on the paper https://arxiv.org/abs/1911.02140. The difference between FQFHead and QuantileHead is that, in FQF, N adjustable quantile values for N adjustable quantile fractions are estimated to approximate the quantile function. The distribution of the return is approximated by a weighted mixture of N Diracs functions. While in IQN, the state-action quantile function is modeled as a mapping from state-actions and samples from some base distribution.

__init__(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False) → None[source]¶

Overview:

Init the FQFHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to FQFHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value output.
num_quantiles (int): The number of quantiles.
quantile_embedding_size (int): The embedding size of a quantile.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.

forward(x: Tensor, num_quantiles: int | None = None) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with FQFHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keywords logit (torch.Tensor), q (torch.Tensor), quantiles (torch.Tensor), quantiles_hats (torch.Tensor), q_tau_i (torch.Tensor), entropies (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.
q: \((B, num_quantiles, M)\).
quantiles: \((B, num_quantiles + 1)\).
quantiles_hats: \((B, num_quantiles)\).
q_tau_i: \((B, num_quantiles - 1, M)\).
entropies: \((B, 1)\).

Examples:

>>> head = FQFHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default num_quantiles is 32
>>> assert outputs['q'].shape == torch.Size([4, 32, 64])
>>> assert outputs['quantiles'].shape == torch.Size([4, 33])
>>> assert outputs['quantiles_hats'].shape == torch.Size([4, 32])
>>> assert outputs['q_tau_i'].shape == torch.Size([4, 31, 64])
>>> assert outputs['quantiles'].shape == torch.Size([4, 1])

quantile_net(quantiles: Tensor) → Tensor[source]¶

Overview:

Deterministic parametric function trained to reparameterize samples from the quantiles_proposal network. By repeated Bellman update iterations of Q-learning, the optimal action-value function is estimated.

Arguments:

x (torch.Tensor): The encoded embedding tensor of parametric sample.

Returns:

quantile_net (torch.Tensor): Quantile network output tensor after reparameterization.

Examples:

>>> head = FQFHead(64, 64)
>>> quantiles = torch.randn(4,32)
>>> qn_output = head.quantile_net(quantiles)
>>> assert isinstance(qn_output, torch.Tensor)
>>> # default quantile_embedding_size: int = 128,
>>> assert qn_output.shape == torch.Size([4, 32, 64])

DuelingHead¶

class ding.model.DuelingHead(hidden_size: int, output_size: int, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, noise: bool | None = False)[source]¶

Overview:: The DuelingHead is used to output discrete actions logit. This module is used in Dueling DQN.
Interfaces:: __init__, forward.

Overview:

Init the DuelingHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to DuelingHead.
output_size (int): The number of outputs.
a_layer_num (int): The number of layers used in the network to compute action output.
v_layer_num (int): The number of layers used in the network to compute value output.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
dropout (float): The dropout rate of dropout layer. Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with DuelingHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keyword logit (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.

Examples:

>>> head = DuelingHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])

StochasticDuelingHead¶

class ding.model.StochasticDuelingHead(hidden_size: int, action_shape: int, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False, last_tanh: bool | None = True)[source]¶

Overview:: The Stochastic Dueling Network is proposed in paper ACER (arxiv 1611.01224). That is to say, dueling network architecture in continuous action space.
Interfaces:: __init__, forward.

Overview:

Init the Stochastic DuelingHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to StochasticDuelingHead.
action_shape (int): The number of continuous action shape, usually integer value.
layer_num (int): The number of default layers used in the network to compute action and value output.
a_layer_num (int): The number of layers used in the network to compute action output. Default is layer_num.
v_layer_num (int): The number of layers used in the network to compute value output. Default is layer_num.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.
last_tanh (bool): If True Apply tanh to actions. Default True.

forward(s: Tensor, a: Tensor, mu: Tensor, sigma: Tensor, sample_size: int = 10) → Dict[str, Tensor][source]¶

Overview:

Use encoded embedding tensor to run MLP with StochasticDuelingHead and return the prediction dictionary.

Arguments:

s (torch.Tensor): Tensor containing input embedding.
a (torch.Tensor): The original continuous behaviour action.
mu (torch.Tensor): The mu gaussian reparameterization output of actor head at current timestep.
sigma (torch.Tensor): The sigma gaussian reparameterization output of actor head at current timestep.
sample_size (int): The number of samples for continuous action when computing the Q value.

Returns:

outputs (Dict): Dict containing keywords q_value (torch.Tensor) and v_value (torch.Tensor).

Shapes:

s: \((B, N)\), where B = batch_size and N = hidden_size.
a: \((B, A)\), where A = action_size.
mu: \((B, A)\).
sigma: \((B, A)\).
q_value: \((B, 1)\).
v_value: \((B, 1)\).

Examples:

>>> head = StochasticDuelingHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> a = torch.randn(4, 64)
>>> mu = torch.randn(4, 64)
>>> sigma = torch.ones(4, 64)
>>> outputs = head(inputs, a, mu, sigma)
>>> assert isinstance(outputs, dict)
>>> assert outputs['q_value'].shape == torch.Size([4, 1])
>>> assert outputs['v_value'].shape == torch.Size([4, 1])

BranchingHead¶

class ding.model.BranchingHead(hidden_size: int, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, norm_type: str | None = None, activation: Module | None = ReLU(), noise: bool | None = False)[source]¶

Overview:: The BranchingHead is used to generate Q-value with different branches. This module is used in Branch DQN.
Interfaces:: __init__, forward.

__init__(hidden_size: int, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, norm_type: str | None = None, activation: Module | None = ReLU(), noise: bool | None = False) → None[source]¶

Overview:

Init the BranchingHead layers according to the provided arguments. This head achieves a linear increase of the number of network outputs with the number of degrees of freedom by allowing a level of independence for each individual action. Therefore, this head is suitable for high dimensional action Spaces.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to BranchingHead.
num_branches (int): The number of branches, which is equivalent to the action dimension.
action_bins_per_branch (:obj:int): The number of action bins in each dimension.
layer_num (int): The number of layers used in the network to compute Advantage and Value output.
a_layer_num (int): The number of layers used in the network to compute Advantage output.
v_layer_num (int): The number of layers used in the network to compute Value output.
output_size (int): The number of outputs.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
noise (bool): Whether use NoiseLinearLayer as layer_fn in Q networks’ MLP. Default False.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with BranchingHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keyword logit (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, M)\), where M = output_size.

Examples:

>>> head = BranchingHead(64, 5, 2)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 5, 2])

RegressionHead¶

class ding.model.RegressionHead(input_size: int, output_size: int, layer_num: int = 2, final_tanh: bool | None = False, activation: Module | None = ReLU(), norm_type: str | None = None, hidden_size: int | None = None)[source]¶

Overview:: The RegressionHead is used to regress continuous variables. This module is used for generating Q-value (DDPG critic) of continuous actions, or state value (A2C/PPO), or directly predicting continuous action (DDPG actor).
Interfaces:: __init__, forward.

__init__(input_size: int, output_size: int, layer_num: int = 2, final_tanh: bool | None = False, activation: Module | None = ReLU(), norm_type: str | None = None, hidden_size: int | None = None) → None[source]¶

Overview:

Init the RegressionHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to RegressionHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value output.
final_tanh (bool): If True apply tanh to output. Default False.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with RegressionHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keyword pred (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
pred: \((B, M)\), where M = output_size.

Examples:

>>> head = RegressionHead(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['pred'].shape == torch.Size([4, 64])

ReparameterizationHead¶

class ding.model.ReparameterizationHead(input_size: int, output_size: int, layer_num: int = 2, sigma_type: str | None = None, fixed_sigma_value: float | None = 1.0, activation: Module | None = ReLU(), norm_type: str | None = None, bound_type: str | None = None, hidden_size: int | None = None)[source]¶

Overview:: The ReparameterizationHead is used to generate Gaussian distribution of continuous variable, which is parameterized by mu and sigma. This module is often used in stochastic policies, such as PPO and SAC.
Interfaces:: __init__, forward.

Overview:

Init the ReparameterizationHead layers according to the provided arguments.

Arguments:

hidden_size (int): The hidden_size of the MLP connected to ReparameterizationHead.
output_size (int): The number of outputs.
layer_num (int): The number of layers used in the network to compute Q value output.
sigma_type (str): Sigma type used. Choose among ['fixed', 'independent', 'conditioned']. Default is None.
fixed_sigma_value (float): When choosing fixed type, the tensor output['sigma'] is filled with this input value. Default is None.
activation (nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None.
norm_type (str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.
bound_type (str): Bound type to apply to output mu. Choose among ['tanh', None]. Default is None.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with ReparameterizationHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keywords mu (torch.Tensor) and sigma (torch.Tensor).

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
mu: \((B, M)\), where M = output_size.
sigma: \((B, M)\).

Examples:

>>> head =  ReparameterizationHead(64, 64, sigma_type='fixed')
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['mu'].shape == torch.Size([4, 64])
>>> assert outputs['sigma'].shape == torch.Size([4, 64])

AttentionPolicyHead¶

class ding.model.AttentionPolicyHead[source]¶

Overview:: Cross-attention-type discrete action policy head, which is often used in variable discrete action space.
Interfaces:: __init__, forward.

__init__() → None[source]¶: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(key: Tensor, query: Tensor) → Tensor[source]¶

Overview:

Use attention-like mechanism to combine key and query tensor to output discrete action logit.

Arguments:

key (torch.Tensor): Tensor containing key embedding.
query (torch.Tensor): Tensor containing query embedding.

Returns:

logit (torch.Tensor): Tensor containing output discrete action logit.

Shapes:

key: \((B, N, K)\), where B = batch_size, N = possible discrete action choices and K = hidden_size.
query: \((B, K)\).
logit: \((B, N)\).

Examples:

>>> head = AttentionPolicyHead()
>>> key = torch.randn(4, 5, 64)
>>> query = torch.randn(4, 64)
>>> logit = head(key, query)
>>> assert logit.shape == torch.Size([4, 5])

Note

In this head, we assume that the key and query tensor are both normalized.

MultiHead¶

class ding.model.MultiHead(head_cls: type, hidden_size: int, output_size_list: SequenceType, **head_kwargs)[source]¶

Overview:: The MultiHead is used to generate multiple similar results. For example, we can combine Distribution and MultiHead to generate multi-discrete action space logit.
Interfaces:: __init__, forward.

__init__(head_cls: type, hidden_size: int, output_size_list: SequenceType, **head_kwargs) → None[source]¶

Overview:

Init the MultiHead layers according to the provided arguments.

Arguments:

head_cls (type): The class of head, choose among [DuelingHead, DistributionHead, ‘’QuatileHead’’, …].
hidden_size (int): The hidden_size of the MLP connected to the Head.
output_size_list (int): Sequence of output_size for multi discrete action, e.g. [2, 3, 5].
head_kwargs: (dict): Dict containing class-specific arguments.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to run MLP with MultiHead and return the prediction dictionary.

Arguments:

x (torch.Tensor): Tensor containing input embedding.

Returns:

outputs (Dict): Dict containing keywords logit (torch.Tensor) corresponding to the logit of each output each accessed at ['logit'][i].

Shapes:

x: \((B, N)\), where B = batch_size and N = hidden_size.
logit: \((B, Mi)\), where Mi = output_size corresponding to output i.

Examples:

>>> head = MultiHead(DuelingHead, 64, [2, 3, 5], v_layer_num=2)
>>> inputs = torch.randn(4, 64)
>>> outputs = head(inputs)
>>> assert isinstance(outputs, dict)
>>> # output_size_list is [2, 3, 5] as set
>>> # Therefore each dim of logit is as follows
>>> outputs['logit'][0].shape
>>> torch.Size([4, 2])
>>> outputs['logit'][1].shape
>>> torch.Size([4, 3])
>>> outputs['logit'][2].shape
>>> torch.Size([4, 5])

independent_normal_dist¶

ding.model.independent_normal_dist(logits: List | Dict) → Distribution[source]¶

Overview:

Convert different types logit to independent normal distribution.

Arguments:

logits (Union[List, Dict]): The logits to be converted.

Returns:

dist (torch.distributions.Distribution): The converted normal distribution.

Examples:

>>> logits = [torch.randn(4, 5), torch.ones(4, 5)]
>>> dist = independent_normal_dist(logits)
>>> assert isinstance(dist, torch.distributions.Independent)
>>> assert isinstance(dist.base_dist, torch.distributions.Normal)
>>> assert dist.base_dist.loc.shape == torch.Size([4, 5])
>>> assert dist.base_dist.scale.shape == torch.Size([4, 5])

Raises:

TypeError: If the type of logits is not list or dict.

Template¶

Please refer to ding/model/template for more details.

DQN¶

class ding.model.DQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, init_bias: float | None = None, noise: bool = False)[source]¶

Overview:: The neural nework structure and computation graph of Deep Q Network (DQN) algorithm, which is the most classic value-based RL algorithm for discrete action. The DQN is composed of two parts: encoder and head. The encoder is used to extract the feature from various observation, and the head is used to compute the Q value of each action dimension.
Interfaces:: __init__, forward.

Note

Current DQN supports two types of encoder: FCEncoder and ConvEncoder, two types of head: DiscreteHead and DuelingHead. You can customize your own encoder or head by inheriting this class.

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, init_bias: float | None = None, noise: bool = False) → None[source]¶

Overview:

initialize the DQN (encoder + head) Model according to corresponding input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
dueling (Optional[bool]): Whether choose DuelingHead or DiscreteHead (default).
head_hidden_size (Optional[int]): The hidden_size of head network, defaults to None, then it will be set to the last element of encoder_hidden_size_list.
head_layer_num (int): The number of layers used in the head network to compute Q value output.
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]
dropout (Optional[float]): The dropout rate of the dropout layer. if None then default disable dropout layer.
init_bias (Optional[float]): The initial value of the last layer bias in the head network. - noise (bool): Whether to use NoiseLinearLayer as layer_fn to boost exploration in Q networks’ MLP. Default to False.

forward(x: Tensor) → Dict[source]¶

Overview:

DQN forward computation graph, input observation tensor to predict q_value.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict): The output of DQN’s forward, including q_value.

ReturnsKeys:

logit (torch.Tensor): Discrete Q-value output of each possible action dimension.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape
logit (torch.Tensor): \((B, M)\), where B is batch size and M is action_shape

Examples:

>>> model = DQN(32, 6)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 32)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])

Note

For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

C51DQN¶

Overview:: The neural network structure and computation graph of C51DQN, which combines distributional RL and DQN. You can refer to https://arxiv.org/pdf/1707.06887.pdf for more details. The C51DQN is composed of encoder and head. encoder is used to extract the feature of observation, and head is used to compute the distribution of Q-value.
Interfaces:: __init__, forward

Note

Current C51DQN supports two types of encoder: FCEncoder and ConvEncoder.

Overview:

initialize the C51 Model according to corresponding input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
head_hidden_size (Optional[int]): The hidden_size of head network, defaults to None, then it will be set to the last element of encoder_hidden_size_list.
head_layer_num (int): The number of layers used in the head network to compute Q value output.
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]
v_min (Optional[float]): The minimum value of the support of the distribution, which is related to the value (discounted sum of reward) scale of the specific environment. Defaults to -10.
v_max (Optional[float]): The maximum value of the support of the distribution, which is related to the value (discounted sum of reward) scale of the specific environment. Defaults to 10.
n_atom (Optional[int]): The number of atoms in the prediction distribution, 51 is the default value in the paper, you can also try other values such as 301.

forward(x: Tensor) → Dict[source]¶

Overview:

C51DQN forward computation graph, input observation tensor to predict q_value and its distribution.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict): The output of DQN’s forward, including q_value, and distribution.

ReturnsKeys:

logit (torch.Tensor): Discrete Q-value output of each possible action dimension.
distribution (torch.Tensor): Q-Value discretized distribution, i.e., probability of each uniformly spaced atom Q-value, such as dividing [-10, 10] into 51 uniform spaces.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.
logit (torch.Tensor): \((B, M)\), where M is action_shape.
distribution(torch.Tensor): \((B, M, P)\), where P is n_atom.

Examples:

>>> model = C51DQN(128, 64)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 128)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> # default head_hidden_size: int = 64,
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default n_atom: int = 51
>>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

Note

For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

Note

For convenience, we recommend that the number of atoms should be odd, so that the middle atom is exactly the value of the Q-value.

QRDQN¶

class ding.model.QRDQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The neural network structure and computation graph of QRDQN, which combines distributional RL and DQN. You can refer to Distributional Reinforcement Learning with Quantile Regression https://arxiv.org/pdf/1710.10044.pdf for more details.
Interfaces:: __init__, forward

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Initialize the QRDQN Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s space.
action_shape (Union[int, SequenceType]): Action’s space.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder
head_hidden_size (Optional[int]): The hidden_size to pass to Head.
head_layer_num (int): The num of layers used in the network to compute Q value output
num_quantiles (int): Number of quantiles in the prediction distribution.
activation (Optional[nn.Module]):
The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]):
The type of normalization to use, see ding.torch_utils.fc_block for more details`

forward(x: Tensor) → Dict[source]¶

Overview:

Use observation tensor to predict QRDQN’s output. Parameter updates with QRDQN’s MLPs forward setup.

Arguments:

x (torch.Tensor):
The encoded embedding tensor with (B, N=hidden_size).

Returns:

outputs (Dict):
Run with encoder and head. Return the result prediction dictionary.

ReturnsKeys:

logit (torch.Tensor): Logit tensor with same size as input x.
q (torch.Tensor): Q valye tensor tensor of size (B, N, num_quantiles)
tau (torch.Tensor): tau tensor of size (B, N, 1)

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.
logit (torch.FloatTensor): \((B, M)\), where M is action_shape.
tau (torch.Tensor): \((B, M, 1)\)

Examples:

>>> model = QRDQN(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default num_quantiles : int = 32
>>> assert outputs['q'].shape == torch.Size([4, 64, 32])
>>> assert outputs['tau'].shape == torch.Size([4, 32, 1])

IQN¶

class ding.model.IQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The neural network structure and computation graph of IQN, which combines distributional RL and DQN. You can refer to paper Implicit Quantile Networks for Distributional Reinforcement Learning https://arxiv.org/pdf/1806.06923.pdf for more details.
Interfaces:: __init__, forward

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Initialize the IQN Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape.
action_shape (Union[int, SequenceType]): Action space shape.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder
head_hidden_size (Optional[int]): The hidden_size to pass to Head.
head_layer_num (int): The num of layers used in the network to compute Q value output
num_quantiles (int): Number of quantiles in the prediction distribution.
activation (Optional[nn.Module]):
The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]):
The type of normalization to use, see ding.torch_utils.fc_block for more details.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to predict IQN’s output. Parameter updates with IQN’s MLPs forward setup.

Arguments:

x (torch.Tensor):
The encoded embedding tensor with (B, N=hidden_size).

Returns:

outputs (Dict):
Run with encoder and head. Return the result prediction dictionary.

ReturnsKeys:

logit (torch.Tensor): Logit tensor with same size as input x.
q (torch.Tensor): Q valye tensor tensor of size (num_quantiles, N, B)
quantiles (torch.Tensor): quantiles tensor of size (quantile_embedding_size, 1)

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.
logit (torch.FloatTensor): \((B, M)\), where M is action_shape
quantiles (torch.Tensor): \((P, 1)\), where P is quantile_embedding_size.

Examples:

>>> model = IQN(64, 64) # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default num_quantiles: int = 32
>>> assert outputs['q'].shape == torch.Size([32, 4, 64]
>>> # default quantile_embedding_size: int = 128
>>> assert outputs['quantiles'].shape == torch.Size([128, 1])

FQF¶

class ding.model.FQF(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The neural network structure and computation graph of FQF, which combines distributional RL and DQN. You can refer to paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning https://arxiv.org/pdf/1911.02140.pdf for more details.
Interface:: __init__, forward

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Initialize the FQF Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape.
action_shape (Union[int, SequenceType]): Action space shape.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder
head_hidden_size (Optional[int]): The hidden_size to pass to Head.
head_layer_num (int): The num of layers used in the network to compute Q value output
num_quantiles (int): Number of quantiles in the prediction distribution.
activation (Optional[nn.Module]):
The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]):
The type of normalization to use, see ding.torch_utils.fc_block for more details.

forward(x: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to predict FQF’s output. Parameter updates with FQF’s MLPs forward setup.

Arguments:

x (torch.Tensor):
The encoded embedding tensor with (B, N=hidden_size).

Returns:

outputs (Dict): Dict containing keywords logit (torch.Tensor), q (torch.Tensor), quantiles (torch.Tensor), quantiles_hats (torch.Tensor), q_tau_i (torch.Tensor), entropies (torch.Tensor).

Shapes:

x: \((B, N)\), where B is batch size and N is head_hidden_size.
logit: \((B, M)\), where M is action_shape.
q: \((B, num_quantiles, M)\).
quantiles: \((B, num_quantiles + 1)\).
quantiles_hats: \((B, num_quantiles)\).
q_tau_i: \((B, num_quantiles - 1, M)\).
entropies: \((B, 1)\).

Examples:

>>> model = FQF(64, 64) # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default num_quantiles: int = 32
>>> assert outputs['q'].shape == torch.Size([4, 32, 64])
>>> assert outputs['quantiles'].shape == torch.Size([4, 33])
>>> assert outputs['quantiles_hats'].shape == torch.Size([4, 32])
>>> assert outputs['q_tau_i'].shape == torch.Size([4, 31, 64])
>>> assert outputs['quantiles'].shape == torch.Size([4, 1])

BDQ¶

class ding.model.BDQ(obs_shape: int | SequenceType, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 3, a_layer_num: int | None = None, v_layer_num: int | None = None, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, norm_type: Module | None = None, activation: Module | None = ReLU())[source]¶

__init__(obs_shape: int | SequenceType, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 3, a_layer_num: int | None = None, v_layer_num: int | None = None, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, norm_type: Module | None = None, activation: Module | None = ReLU()) → None[source]¶

Overview:

Init the BDQ (encoder + head) Model according to input arguments. referenced paper Action Branching Architectures for Deep Reinforcement Learning <https://arxiv.org/pdf/1711.08946>

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
num_branches (int): The number of branches, which is equivalent to the action dimension, such as 6 in mujoco’s halfcheetah environment.
action_bins_per_branch (int): The number of actions in each dimension.
layer_num (int): The number of layers used in the network to compute Advantage and Value output.
a_layer_num (int): The number of layers used in the network to compute Advantage output.
v_layer_num (int): The number of layers used in the network to compute Value output.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
head_hidden_size (Optional[int]): The hidden_size of head network.
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details.
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU()

forward(x: Tensor) → Dict[source]¶

Overview:

BDQ forward computation graph, input observation tensor to predict q_value.

Arguments:

x (torch.Tensor): Observation inputs

Returns:

outputs (Dict): BDQ forward outputs, such as q_value.

ReturnsKeys:

logit (torch.Tensor): Discrete Q-value output of each action dimension.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape
logit (torch.FloatTensor): \((B, M)\), where B is batch size and M is
num_branches * action_bins_per_branch

Examples:

>>> model = BDQ(8, 5, 2)  # arguments: 'obs_shape', 'num_branches' and 'action_bins_per_branch'.
>>> inputs = torch.randn(4, 8)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 5, 2])

RainbowDQN¶

Overview:: The neural network structure and computation graph of RainbowDQN, which combines distributional RL and DQN. You can refer to paper Rainbow: Combining Improvements in Deep Reinforcement Learning https://arxiv.org/pdf/1710.02298.pdf for more details.
Interfaces:: __init__, forward

Note

RainbowDQN contains dueling architecture by default.

Overview:

Init the Rainbow Model according to arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape.
action_shape (Union[int, SequenceType]): Action space shape.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder
head_hidden_size (Optional[int]): The hidden_size to pass to Head.
head_layer_num (int): The num of layers used in the network to compute Q value output
activation (Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details`
n_atom (Optional[int]): Number of atoms in the prediction distribution.

forward(x: Tensor) → Dict[source]

Overview:

Use observation tensor to predict Rainbow output. Parameter updates with Rainbow’s MLPs forward setup.

Arguments:

x (torch.Tensor):
The encoded embedding tensor with (B, N=hidden_size).

Returns:

outputs (Dict):
Run MLP with RainbowHead setups and return the result prediction dictionary.

ReturnsKeys:

logit (torch.Tensor): Logit tensor with same size as input x.
distribution (torch.Tensor): Distribution tensor of size (B, N, n_atom)

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.
logit (torch.FloatTensor): \((B, M)\), where M is action_shape.
distribution(torch.FloatTensor): \((B, M, P)\), where P is n_atom.

Examples:

>>> model = RainbowDQN(64, 64) # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default n_atom: int =51
>>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

DRQN¶

class ding.model.DRQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None, res_link: bool = False)[source]¶

Overview:: The DRQN (Deep Recurrent Q-Network) is a neural network model combining DQN with RNN to handle sequential data and partially observable environments. It consists of three main components: encoder, rnn, and head. - Encoder: Extracts features from various observation inputs. - RNN: Processes sequential observations and other data. - Head: Computes Q-values for each action dimension.
Interfaces:: __init__, forward.

Note

The current implementation supports: - Two encoder types: FCEncoder and ConvEncoder. - Two head types: DiscreteHead and DuelingHead. - Three RNN types: normal (LSTM with LayerNorm), pytorch (PyTorch’s native LSTM), and gru. You can extend the model by customizing your own encoder, RNN, or head by inheriting this class.

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None, res_link: bool = False) → None[source]¶

Overview:

Initialize the DRQN model with specified parameters.

Arguments:

obs_shape (Union[int, SequenceType]): Shape of the observation space, e.g., 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Shape of the action space, e.g., 6 or [2, 3, 3].
encoder_hidden_size_list (SequenceType): List of hidden sizes for the encoder. The last element must match head_hidden_size.
dueling (Optional[bool]): Use DuelingHead if True, otherwise use DiscreteHead.
head_hidden_size (Optional[int]): Hidden size for the head network. Defaults to the last element of encoder_hidden_size_list if None.
head_layer_num (int): Number of layers in the head network to compute Q-value outputs.
lstm_type (Optional[str]): Type of RNN module. Supported types are normal, pytorch, and gru.
activation (Optional[nn.Module]): Activation function used in the network. Defaults to nn.ReLU().
norm_type (Optional[str]): Normalization type for the networks. Supported types are: [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]. See ding.torch_utils.fc_block for more details.
res_link (bool): Enables residual connections between single-frame data and sequential data. Defaults to False.

forward(inputs: Dict, inference: bool = False, saved_state_timesteps: list | None = None) → Dict[source]¶

Overview:

Defines the forward pass of the DRQN model. Takes observation and previous RNN states as inputs and predicts Q-values.

Arguments:

inputs (Dict): Input data dictionary containing observation and previous RNN state.
inference (bool): If True, unrolls one timestep (used during evaluation). If False, unrolls the entire sequence (used during training).
saved_state_timesteps (Optional[list]): When inference is False, specifies the timesteps whose hidden states are saved and returned.

ArgumentsKeys:

obs (torch.Tensor): Raw observation tensor.
prev_state (list): Previous RNN state tensor, structure depends on lstm_type.

Returns:

outputs (Dict): The output of DRQN’s forward, including logit (q_value) and next state.

ReturnsKeys:

logit (torch.Tensor): Discrete Q-value output for each action dimension.
next_state (list): Next RNN state tensor.

Shapes:

obs (torch.Tensor): \((B, N)\) where B is batch size and N is obs_shape.
logit (torch.Tensor): \((B, M)\) where B is batch size and M is action_shape.

Examples:

>>> # Initialize input keys
>>> prev_state = [[torch.randn(1, 1, 64) for __ in range(2)] for _ in range(4)] # B=4
>>> obs = torch.randn(4,64)
>>> model = DRQN(64, 64) # arguments: 'obs_shape' and 'action_shape'
>>> outputs = model({'obs': inputs, 'prev_state': prev_state}, inference=True)
>>> # Validate output keys and shapes
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == (4, 64)
>>> assert len(outputs['next_state']) == 4
>>> assert all([len(t) == 2 for t in outputs['next_state']])
>>> assert all([t[0].shape == (1, 1, 64) for t in outputs['next_state']])

GTrXLDQN¶

class ding.model.GTrXLDQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, head_layer_num: int = 1, att_head_dim: int = 16, hidden_size: int = 16, att_head_num: int = 2, att_mlp_num: int = 2, att_layer_num: int = 3, memory_len: int = 64, activation: Module | None = ReLU(), head_norm_type: str | None = None, dropout: float = 0.0, gru_gating: bool = True, gru_bias: float = 2.0, dueling: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 256], encoder_norm_type: str | None = None)[source]¶

Overview:: The neural network structure and computation graph of Gated Transformer-XL DQN algorithm, which is the enhanced version of DRQN, using Transformer-XL to improve long-term sequential modelling ability. The GTrXL-DQN is composed of three parts: encoder, head and core. The encoder is used to extract the feature from various observation, the core is used to process the sequential observation and other data, and the head is used to compute the Q value of each action dimension.
Interfaces:: __init__, forward, reset_memory, get_memory .

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, head_layer_num: int = 1, att_head_dim: int = 16, hidden_size: int = 16, att_head_num: int = 2, att_mlp_num: int = 2, att_layer_num: int = 3, memory_len: int = 64, activation: Module | None = ReLU(), head_norm_type: str | None = None, dropout: float = 0.0, gru_gating: bool = True, gru_bias: float = 2.0, dueling: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 256], encoder_norm_type: str | None = None) → None[source]¶

Overview:: Initialize the GTrXLDQN model accoding to corresponding input arguments.

Tip

You can refer to GTrXl class in ding.torch_utils.network.gtrxl for more details about the input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Used by Transformer. Observation’s space.
action_shape (:obj:Union[int, SequenceType]): Used by Head. Action’s space.
head_layer_num (int): Used by Head. Number of layers.
att_head_dim (int): Used by Transformer.
hidden_size (int): Used by Transformer and Head.
att_head_num (int): Used by Transformer.
att_mlp_num (int): Used by Transformer.
att_layer_num (int): Used by Transformer.
memory_len (int): Used by Transformer.
activation (Optional[nn.Module]): Used by Transformer and Head. if None then default set to nn.ReLU().
head_norm_type (Optional[str]): Used by Head. The type of normalization to use, see ding.torch_utils.fc_block for more details`.
dropout (bool): Used by Transformer.
gru_gating (bool): Used by Transformer.
gru_bias (float): Used by Transformer.
dueling (bool): Used by Head. Make the head dueling.
encoder_hidden_size_list(SequenceType): Used by Encoder. The collection of hidden_size if using a custom convolutional encoder.
encoder_norm_type (Optional[str]): Used by Encoder. The type of normalization to use, see ding.torch_utils.fc_block for more details`.

forward(x: Tensor) → Dict[source]¶

Overview:

Let input tensor go through GTrXl and the Head sequentially.

Arguments:

x (torch.Tensor): input tensor of shape (seq_len, bs, obs_shape).

Returns:

out (Dict): run GTrXL with DiscreteHead setups and return the result prediction dictionary.

ReturnKeys:

logit (torch.Tensor): discrete Q-value output of each action dimension, shape is (B, action_space).
memory (torch.Tensor): memory tensor of size (bs x layer_num+1 x memory_len x embedding_dim).
transformer_out (torch.Tensor): output tensor of transformer with same size as input x.

Examples:

>>> # Init input's Keys:
>>> obs_dim, seq_len, bs, action_dim = 128, 64, 32, 4
>>> obs = torch.rand(seq_len, bs, obs_dim)
>>> model = GTrXLDQN(obs_dim, action_dim)
>>> outputs = model(obs)
>>> assert isinstance(outputs, dict)

get_memory() → Tensor | None[source]¶

Overview:

Return the memory of GTrXL.

Returns:

memory: (Optional[torch.Tensor]): output memory or None if memory has not been initialized, whose shape is (layer_num, memory_len, bs, embedding_dim).

reset_memory(batch_size: int | None = None, state: Tensor | None = None) → None[source]¶

Overview:

Clear or reset the memory of GTrXL.

Arguments:

batch_size (Optional[int]): The number of samples in a training batch.
state (Optional[torch.Tensor]): The input memory data, whose shape is (layer_num, memory_len, bs, embedding_dim).

PG¶

class ding.model.PG(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The neural network and computation graph of algorithms related to Policy Gradient(PG) (https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf). The PG model is composed of two parts: encoder and head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding action logit.
Interface:: __init__, forward.

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Initialize the PG model according to corresponding input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].
action_space (str): The type of different action spaces, including [‘discrete’, ‘continuous’], then will instantiate corresponding head, including DiscreteHead and ReparameterizationHead.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
head_hidden_size (Optional[int]): The hidden_size of head network, defaults to None, it must match the last element of encoder_hidden_size_list.
head_layer_num (int): The num of layers used in the head network to compute action.
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]

Examples:

>>> model = PG((4, 84, 84), 5)
>>> inputs = torch.randn(8, 4, 84, 84)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == (8, 5)
>>> assert outputs['dist'].sample().shape == (8, )

forward(x: Tensor) → Dict[source]¶

Overview:

PG forward computation graph, input observation tensor to predict policy distribution.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (torch.distributions): The output policy distribution. If action space is discrete, the output is Categorical distribution; if action space is continuous, the output is Normal distribution.

VAC¶

class ding.model.VAC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', fixed_sigma_value: int | None = 0.3, bound_type: str | None = None, encoder: Module | None = None, impala_cnn_encoder: bool = False)[source]¶

Overview:: The neural network and computation graph of algorithms related to (state) Value Actor-Critic (VAC), such as A2C/PPO/IMPALA. This model now supports discrete, continuous and hybrid action space. The VAC is composed of four parts: actor_encoder, critic_encoder, actor_head and critic_head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding value or action logit. In high-dimensional observation space like 2D image, we often use a shared encoder for both actor_encoder and critic_encoder. In low-dimensional observation space like 1D vector, we often use different encoders.
Interfaces:: __init__, forward, compute_actor, compute_critic, compute_actor_critic.

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', fixed_sigma_value: int | None = 0.3, bound_type: str | None = None, encoder: Module | None = None, impala_cnn_encoder: bool = False) → None[source]¶

Overview:

Initialize the VAC model according to corresponding input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].
action_space (str): The type of different action spaces, including [‘discrete’, ‘continuous’, ‘hybrid’], then will instantiate corresponding head, including DiscreteHead, ReparameterizationHead, and hybrid heads.
share_encoder (bool): Whether to share observation encoders between actor and decoder.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element is used as the input size of actor_head and critic_head.
actor_head_hidden_size (Optional[int]): The hidden_size of actor_head network, defaults to 64, it is the hidden size of the last layer of the actor_head network.
actor_head_layer_num (int): The num of layers used in the actor_head network to compute action.
critic_head_hidden_size (Optional[int]): The hidden_size of critic_head network, defaults to 64, it is the hidden size of the last layer of the critic_head network.
critic_head_layer_num (int): The num of layers used in the critic_head network.
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]
sigma_type (Optional[str]): The type of sigma in continuous action space, see ding.torch_utils.network.dreamer.ReparameterizationHead for more details, in A2C/PPO, it defaults to independent, which means state-independent sigma parameters.
fixed_sigma_value (Optional[int]): If sigma_type is fixed, then use this value as sigma.
bound_type (Optional[str]): The type of action bound methods in continuous action space, defaults to None, which means no bound.
encoder (Optional[torch.nn.Module]): The encoder module, defaults to None, you can define your own encoder module and pass it into VAC to deal with different observation space.
impala_cnn_encoder (bool): Whether to use IMPALA CNN encoder, defaults to False.

compute_actor(x: Tensor | Dict) → Dict[source]¶

Overview:

VAC forward computation graph for actor part, input observation tensor to predict action logit.

Arguments:

x (Union[torch.Tensor, Dict]): The input observation tensor data. If a dictionary is provided, it should contain keys ‘observation’ and optionally ‘action_mask’.

Returns:

outputs (Dict): The output dict of VAC’s forward computation graph for actor, including logit and optionally action_mask if the input is a dictionary.

ReturnsKeys:

logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.
action_mask (Optional[torch.Tensor]): The action mask tensor, included if the input is a dictionary containing ‘action_mask’.

Shapes:

logit (torch.Tensor): \((B, N)\), where B is batch size and N is action_shape

Examples:

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 64])

compute_actor_critic(x: Tensor | Dict) → Dict[source]¶

Overview:

VAC forward computation graph for both actor and critic part, input observation tensor to predict action logit and state value.

Arguments:

x (Union[torch.Tensor, Dict]): The input observation tensor data. If a dictionary is provided, it should contain keys ‘observation’ and optionally ‘action_mask’.

Returns:

outputs (Dict): The output dict of VAC’s forward computation graph for both actor and critic, including logit, value, and optionally action_mask if the input is a dictionary.

ReturnsKeys:

logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.
value (torch.Tensor): The predicted state value tensor.
action_mask (torch.Tensor, optional): The action mask tensor, included if the input is a dictionary containing ‘action_mask’.

Shapes:

logit (torch.Tensor): \((B, N)\), where B is batch size and N is action_shape
value (torch.Tensor): \((B, )\), where B is batch size, (B, 1) is squeezed to (B, ).

Examples:

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs,'compute_actor_critic')
>>> assert critic_outputs['value'].shape == torch.Size([4])
>>> assert outputs['logit'].shape == torch.Size([4, 64])

Note

compute_actor_critic interface aims to save computation when shares encoder and return the combination dict output.

compute_critic(x: Tensor | Dict) → Dict[source]¶

Overview:

VAC forward computation graph for critic part, input observation tensor to predict state value.

Arguments:

x (Union[torch.Tensor, Dict]): The input observation tensor data. If a dictionary is provided, it should contain the key ‘observation’.

Returns:

outputs (Dict): The output dict of VAC’s forward computation graph for critic, including value.

ReturnsKeys:

value (torch.Tensor): The predicted state value tensor.

Shapes:

value (torch.Tensor): \((B, )\), where B is batch size, (B, 1) is squeezed to (B, ).

Examples:

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> critic_outputs = model(inputs,'compute_critic')
>>> assert critic_outputs['value'].shape == torch.Size([4])

forward(x: Tensor, mode: str) → Dict[source]¶

Overview:

VAC forward computation graph, input observation tensor to predict state value or action logit. Different mode will forward with different network modules to get different outputs and save computation.

Arguments:

x (torch.Tensor): The input observation tensor data.
mode (str): The forward mode, all the modes are defined in the beginning of this class.

Returns:

outputs (Dict): The output dict of VAC’s forward computation graph, whose key-values vary from different mode.

Examples (Actor):

>>> model = VAC(64, 128)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 128])

Examples (Critic):

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> critic_outputs = model(inputs,'compute_critic')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 64])

Examples (Actor-Critic):

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs,'compute_actor_critic')
>>> assert critic_outputs['value'].shape == torch.Size([4])
>>> assert outputs['logit'].shape == torch.Size([4, 64])

DREAMERVAC¶

class ding.model.DREAMERVAC(action_shape: int | SequenceType | EasyDict, dyn_stoch=32, dyn_deter=512, dyn_discrete=32, actor_layers=2, value_layers=2, units=512, act='SiLU', norm='LayerNorm', actor_dist='normal', actor_init_std=1.0, actor_min_std=0.1, actor_max_std=1.0, actor_temp=0.1, action_unimix_ratio=0.01)[source]¶

Overview:: The neural network and computation graph of DreamerV3 (state) Value Actor-Critic (VAC). This model now supports discrete, continuous action space.
Interfaces:: __init__, forward.

__init__(action_shape: int | SequenceType | EasyDict, dyn_stoch=32, dyn_deter=512, dyn_discrete=32, actor_layers=2, value_layers=2, units=512, act='SiLU', norm='LayerNorm', actor_dist='normal', actor_init_std=1.0, actor_min_std=0.1, actor_max_std=1.0, actor_temp=0.1, action_unimix_ratio=0.01) → None[source]¶

Overview:

Initialize the DREAMERVAC model according to arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].

MAVAC¶

class ding.model.MAVAC(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, agent_num: int, actor_head_hidden_size: int = 256, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 512, critic_head_layer_num: int = 1, action_space: str = 'discrete', activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', bound_type: str | None = None, encoder: Tuple[Module, Module] | None = None)[source]¶

Overview:: The neural network and computation graph of algorithms related to (state) Value Actor-Critic (VAC) for multi-agent, such as MAPPO(https://arxiv.org/abs/2103.01955). This model now supports discrete and continuous action space. The MAVAC is composed of four parts: actor_encoder, critic_encoder, actor_head and critic_head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding value or action logit.
Interfaces:: __init__, forward, compute_actor, compute_critic, compute_actor_critic.

__init__(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, agent_num: int, actor_head_hidden_size: int = 256, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 512, critic_head_layer_num: int = 1, action_space: str = 'discrete', activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', bound_type: str | None = None, encoder: Tuple[Module, Module] | None = None) → None[source]¶

Overview:

Init the MAVAC Model according to arguments.

Arguments:

agent_obs_shape (Union[int, SequenceType]): Observation’s space for single agent, such as 8 or [4, 84, 84].
global_obs_shape (Union[int, SequenceType]): Global observation’s space, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape for single agent, such as 6 or [2, 3, 3].
agent_num (int): This parameter is temporarily reserved. This parameter may be required for subsequent changes to the model
actor_head_hidden_size (Optional[int]): The hidden_size of actor_head network, defaults to 256, it must match the last element of agent_obs_shape.
actor_head_layer_num (int): The num of layers used in the actor_head network to compute action.
critic_head_hidden_size (Optional[int]): The hidden_size of critic_head network, defaults to 512, it must match the last element of global_obs_shape.
critic_head_layer_num (int): The num of layers used in the network to compute Q value output for critic’s nn.
action_space (Union[int, SequenceType]): The type of different action spaces, including [‘discrete’, ‘continuous’], then will instantiate corresponding head, including DiscreteHead and ReparameterizationHead.
activation (Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’].
sigma_type (Optional[str]): The type of sigma in continuous action space, see ding.torch_utils.network.dreamer.ReparameterizationHead for more details, in MAPPO, it defaults to independent, which means state-independent sigma parameters.
bound_type (Optional[str]): The type of action bound methods in continuous action space, defaults to None, which means no bound.
encoder (Optional[Tuple[torch.nn.Module, torch.nn.Module]]): The encoder module list, defaults to None, you can define your own actor and critic encoder module and pass it into MAVAC to deal with different observation space.

compute_actor(x: Dict) → Dict[source]¶

Overview:

MAVAC forward computation graph for actor part, predicting action logit with agent observation tensor in x.

Arguments:

x (Dict): Input data dict with keys [‘agent_state’, ‘action_mask’(optional)].
- agent_state: (torch.Tensor): Each agent local state(obs).
- action_mask(optional): (torch.Tensor): When action_space is discrete, action_mask needs to be provided to mask illegal actions.

Returns:

outputs (Dict): The output dict of the forward computation graph for actor, including logit.

ReturnsKeys:

logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions.

Shapes:

logit (torch.FloatTensor): \((B, M, N)\), where B is batch size and N is action_shape and M is agent_num.

Examples:

>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([10, 8, 14])

compute_actor_critic(x: Dict) → Dict[source]¶

Overview:

MAVAC forward computation graph for both actor and critic part, input observation to predict action logit and state value.

Arguments:

x (Dict): The input dict contains agent_state, global_state and other related info.

Returns:

outputs (Dict): The output dict of MAVAC’s forward computation graph for both actor and critic, including logit and value.

ReturnsKeys:

logit (torch.Tensor): Logit encoding tensor, with same size as input x.
value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:

logit (torch.FloatTensor): \((B, M, N)\), where B is batch size and N is action_shape and M is agent_num.
value (torch.FloatTensor): \((B, M)\), where B is batch sizeand M is agent_num.

Examples:

>>> model = MAVAC(64, 64)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> outputs = model(inputs,'compute_actor_critic')
>>> assert outputs['value'].shape == torch.Size([10, 8])
>>> assert outputs['logit'].shape == torch.Size([10, 8, 14])

compute_critic(x: Dict) → Dict[source]¶

Overview:

MAVAC forward computation graph for critic part. Predict state value with global observation tensor in x.

Arguments:

x (Dict): Input data dict with keys [‘global_state’].
- global_state: (torch.Tensor): Global state(obs).

Returns:

outputs (Dict): The output dict of MAVAC’s forward computation graph for critic, including value.

ReturnsKeys:

value (torch.Tensor): The predicted state value tensor.

Shapes:

value (torch.FloatTensor): \((B, M)\), where B is batch size and M is agent_num.

Examples:

>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> critic_outputs = model(inputs,'compute_critic')
>>> assert critic_outputs['value'].shape == torch.Size([10, 8])

forward(inputs: Tensor | Dict, mode: str) → Dict[source]¶

Overview:

MAVAC forward computation graph, input observation tensor to predict state value or action logit. mode includes compute_actor, compute_critic, compute_actor_critic. Different mode will forward with different network modules to get different outputs and save computation.

Arguments:

inputs (Dict): The input dict including observation and related info, whose key-values vary from different mode.
mode (str): The forward mode, all the modes are defined in the beginning of this class.

Returns:

outputs (Dict): The output dict of MAVAC’s forward computation graph, whose key-values vary from different mode.

Examples (Actor):

>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([10, 8, 14])

Examples (Critic):

>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> critic_outputs = model(inputs,'compute_critic')
>>> assert actor_outputs['value'].shape == torch.Size([10, 8])

Examples (Actor-Critic):

>>> model = MAVAC(64, 64)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> outputs = model(inputs,'compute_actor_critic')
>>> assert outputs['value'].shape == torch.Size([10, 8, 14])
>>> assert outputs['logit'].shape == torch.Size([10, 8])

ContinuousQAC¶

class ding.model.ContinuousQAC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False)[source]¶

Overview:: The neural network and computation graph of algorithms related to Q-value Actor-Critic (QAC), such as DDPG/TD3/SAC. This model now supports continuous and hybrid action space. The ContinuousQAC is composed of four parts: actor_encoder, critic_encoder, actor_head and critic_head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding Q-value or action logit. In high-dimensional observation space like 2D image, we often use a shared encoder for both actor_encoder and critic_encoder. In low-dimensional observation space like 1D vector, we often use different encoders.
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False) → None[source]¶

Overview:

Initailize the ContinuousQAC Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s shape, such as 128, (156, ).
action_shape (Union[int, SequenceType, EasyDict]): Action’s shape, such as 4, (3, ), EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 4}).
action_space (str): The type of action space, including [regression, reparameterization, hybrid], regression is used for DDPG/TD3, reparameterization is used for SAC and hybrid for PADDPG.
twin_critic (bool): Whether to use twin critic, one of tricks in TD3.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor head.
actor_head_layer_num (int): The num of layers used in the actor network to compute action.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic head.
critic_head_layer_num (int): The num of layers used in the critic network to compute Q-value.
activation (Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU().
norm_type (Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size, this argument is only used in image observation.
share_encoder (Optional[bool]): Whether to share encoder between actor and critic.

compute_actor(obs: Tensor) → Dict[str, Tensor | Dict[str, Tensor]][source]¶

Overview:

QAC forward computation graph for actor part, input observation tensor to predict action or action logit.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]): Actor output dict varying from action_space: regression, reparameterization, hybrid.

ReturnsKeys (regression):

action (torch.Tensor): Continuous action with same size as action_shape, usually in DDPG/TD3.

ReturnsKeys (reparameterization):

logit (Dict[str, torch.Tensor]): The predictd reparameterization action logit, usually in SAC. It is a list containing two tensors: mu and sigma. The former is the mean of the gaussian distribution, the latter is the standard deviation of the gaussian distribution.

ReturnsKeys (hybrid):

logit (torch.Tensor): The predicted discrete action type logit, it will be the same dimension as action_type_shape, i.e., all the possible discrete action types.
action_args (torch.Tensor): Continuous action arguments with same size as action_args_shape.

Shapes:

obs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to obs_shape.
action (torch.Tensor): \((B, N1)\), B is batch size and N1 corresponds to action_shape.
logit.mu (torch.Tensor): \((B, N1)\), B is batch size and N1 corresponds to action_shape.
logit.sigma (torch.Tensor): \((B, N1)\), B is batch size.
logit (torch.Tensor): \((B, N2)\), B is batch size and N2 corresponds to action_shape.action_type_shape.
action_args (torch.Tensor): \((B, N3)\), B is batch size and N3 corresponds to action_shape.action_args_shape.

Examples:

>>> # Regression mode
>>> model = ContinuousQAC(64, 6, 'regression')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 6])
>>> # Reparameterization Mode
>>> model = ContinuousQAC(64, 6, 'reparameterization')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['logit'][0].shape == torch.Size([4, 6])  # mu
>>> actor_outputs['logit'][1].shape == torch.Size([4, 6]) # sigma

compute_critic(inputs: Dict[str, Tensor]) → Dict[str, Tensor][source]¶

Overview:

QAC forward computation graph for critic part, input observation and action tensor to predict Q-value.

Arguments:

inputs (Dict[str, torch.Tensor]): The dict of input data, including obs and action tensor, also contains logit and action_args tensor in hybrid action_space.

ArgumentsKeys:

obs: (torch.Tensor): Observation tensor data, now supports a batch of 1-dim vector data.
action (Union[torch.Tensor, Dict]): Continuous action with same size as action_shape.
logit (torch.Tensor): Discrete action logit, only in hybrid action_space.
action_args (torch.Tensor): Continuous action arguments, only in hybrid action_space.

Returns:

outputs (Dict[str, torch.Tensor]): The output dict of QAC’s forward computation graph for critic, including q_value.

ReturnKeys:

q_value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:

obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape.
logit (torch.Tensor): \((B, N2)\), B is batch size and N2 corresponds to action_shape.action_type_shape.
action_args (torch.Tensor): \((B, N3)\), B is batch size and N3 corresponds to action_shape.action_args_shape.
action (torch.Tensor): \((B, N4)\), where B is batch size and N4 is action_shape.
q_value (torch.Tensor): \((B, )\), where B is batch size.

Examples:

>>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)}
>>> model = ContinuousQAC(obs_shape=(8, ),action_shape=1, action_space='regression')
>>> assert model(inputs, mode='compute_critic')['q_value'].shape == (4, )  # q value

forward(inputs: Tensor | Dict[str, Tensor], mode: str) → Dict[str, Tensor][source]¶

Overview:

QAC forward computation graph, input observation tensor to predict Q-value or action logit. Different mode will forward with different network modules to get different outputs and save computation.

Arguments:

inputs (Union[torch.Tensor, Dict[str, torch.Tensor]]): The input data for forward computation graph, for compute_actor, it is the observation tensor, for compute_critic, it is the dict data including obs and action tensor.
mode (str): The forward mode, all the modes are defined in the beginning of this class.

Returns:

output (Dict[str, torch.Tensor]): The output dict of QAC forward computation graph, whose key-values vary in different forward modes.

Examples (Actor):

>>> # Regression mode
>>> model = ContinuousQAC(64, 6, 'regression')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 6])
>>> # Reparameterization Mode
>>> model = ContinuousQAC(64, 6, 'reparameterization')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['logit'][0].shape == torch.Size([4, 6])  # mu
>>> actor_outputs['logit'][1].shape == torch.Size([4, 6]) # sigma

Examples (Critic):

>>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)}
>>> model = ContinuousQAC(obs_shape=(8, ),action_shape=1, action_space='regression')
>>> assert model(inputs, mode='compute_critic')['q_value'].shape == (4, )  # q value

DiscreteQAC¶

class ding.model.DiscreteQAC(obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False)[source]¶

Overview:: The neural network and computation graph of algorithms related to discrete action Q-value Actor-Critic (QAC), such as DiscreteSAC. This model now supports only discrete action space. The DiscreteQAC is composed of four parts: actor_encoder, critic_encoder, actor_head and critic_head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding Q-value or action logit. In high-dimensional observation space like 2D image, we often use a shared encoder for both actor_encoder and critic_encoder. In low-dimensional observation space like 1D vector, we often use different encoders.
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False) → None[source]¶

Overview:

Initailize the DiscreteQAC Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s shape, such as 128, (156, ).
action_shape (Union[int, SequenceType, EasyDict]): Action’s shape, such as 4, (3, ).
twin_critic (bool): Whether to use twin critic.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor head.
actor_head_layer_num (int): The num of layers used in the actor network to compute action.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic head.
critic_head_layer_num (int): The num of layers used in the critic network to compute Q-value.
activation (Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU().
norm_type (Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size, this argument is only used in image observation.
share_encoder (Optional[bool]): Whether to share encoder between actor and critic.

compute_actor(inputs: Tensor) → Dict[str, Tensor][source]¶

Overview:

QAC forward computation graph for actor part, input observation tensor to predict action or action logit.

Arguments:

inputs (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict[str, torch.Tensor]): The output dict of QAC forward computation graph for actor, including discrete action logit.

ReturnsKeys:

logit (torch.Tensor): The predicted discrete action type logit, it will be the same dimension as action_shape, i.e., all the possible discrete action choices.

Shapes:

inputs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to obs_shape.
logit (torch.Tensor): \((B, N2)\), B is batch size and N2 corresponds to action_shape.

Examples:

>>> model = DiscreteQAC(64, 6)
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 6])

compute_critic(inputs: Tensor) → Dict[str, Tensor][source]¶

Overview:

QAC forward computation graph for critic part, input observation to predict Q-value for each possible discrete action choices.

Arguments:

inputs (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict[str, torch.Tensor]): The output dict of QAC forward computation graph for critic, including q_value for each possible discrete action choices.

ReturnKeys:

q_value (torch.Tensor): The predicted Q-value for each possible discrete action choices, it will be the same dimension as action_shape and used to calculate the loss.

Shapes:

obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape.
q_value (torch.Tensor): \((B, N2)\), where B is batch size and N2 is action_shape.

Examples:

>>> model = DiscreteQAC(64, 6, twin_critic=False)
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_critic')
>>> assert actor_outputs['q_value'].shape == torch.Size([4, 6])

forward(inputs: Tensor, mode: str) → Dict[str, Tensor][source]¶

Overview:

QAC forward computation graph, input observation tensor to predict Q-value or action logit. Different mode will forward with different network modules to get different outputs and save computation.

Arguments:

inputs (torch.Tensor): The input observation tensor data.
mode (str): The forward mode, all the modes are defined in the beginning of this class.

Returns:

output (Dict[str, torch.Tensor]): The output dict of QAC forward computation graph, whose key-values vary in different forward modes.

Examples (Actor):

>>> model = DiscreteQAC(64, 6)
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 6])

Examples(Critic):

>>> model = DiscreteQAC(64, 6, twin_critic=False)
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_critic')
>>> assert actor_outputs['q_value'].shape == torch.Size([4, 6])

ContinuousMAQAC¶

class ding.model.ContinuousMAQAC(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The neural network and computation graph of algorithms related to continuous action Multi-Agent Q-value Actor-CritiC (MAQAC) model. The model is composed of actor and critic, where actor is a MLP network and critic is a MLP network. The actor network is used to predict the action probability distribution, and the critic network is used to predict the Q value of the state-action pair.
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Initialize the QAC Model according to arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s space.
action_shape (Union[int, SequenceType, EasyDict]): Action’s space, such as 4, (3, )
action_space (str): Whether choose regression or reparameterization.
twin_critic (bool): Whether include twin critic.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor-nn’s Head.
actor_head_layer_num (int): The num of layers used in the network to compute Q value output for actor’s nn.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic-nn’s Head.
critic_head_layer_num (int): The num of layers used in the network to compute Q value output for critic’s nn.
activation (Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details.

compute_actor(inputs: Dict) → Dict[source]¶

Overview:

Use observation tensor to predict action logits.

Arguments:

inputs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
- agent_state (torch.Tensor): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape.

Returns:

outputs (Dict): Outputs of network forward.

ReturnKeys (action_space == 'regression'):

action (torch.Tensor): Action tensor with same size as action_shape.

ReturnKeys (action_space == 'reparameterization'):

logit (list): 2 elements, each is the shape of \((B, A, N3)\), where B is batch size and A is agent num. N3 corresponds to action_shape.

Examples:

>>> B = 32
>>> agent_obs_shape = 216
>>> global_obs_shape = 264
>>> agent_num = 8
>>> action_shape = 14
>>> act_space = 'reparameterization'  # 'regression'
>>> data = {
>>>     'agent_state': torch.randn(B, agent_num, agent_obs_shape),
>>> }
>>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False)
>>> if action_space == 'regression':
>>>     action = model.compute_actor(data)['action']
>>> elif action_space == 'reparameterization':
>>>     (mu, sigma) = model.compute_actor(data)['logit']

compute_critic(inputs: Dict) → Dict[source]¶

Overview:

Use observation tensor and action tensor to predict Q value.

Arguments:

inputs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
- obs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
  
  agent_state (torch.Tensor): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape.
  
  global_state (torch.Tensor): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds to global_obs_shape.
  
  action_mask (torch.Tensor): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds to action_shape.
- action (torch.Tensor): The action tensor data, with shape \((B, A, N3)\), where B is batch size and A is agent num. N3 corresponds to action_shape.

Returns:

outputs (Dict): Outputs of network forward.

ReturnKeys (twin_critic=True):

q_value (list): 2 elements, each is the shape of \((B, A)\), where B is batch size and A is agent num.

ReturnKeys (twin_critic=False):

q_value (torch.Tensor): \((B, A)\), where B is batch size and A is agent num.

Examples:

>>> B = 32
>>> agent_obs_shape = 216
>>> global_obs_shape = 264
>>> agent_num = 8
>>> action_shape = 14
>>> act_space = 'reparameterization'  # 'regression'
>>> data = {
>>>     'obs': {
>>>         'agent_state': torch.randn(B, agent_num, agent_obs_shape),
>>>         'global_state': torch.randn(B, agent_num, global_obs_shape),
>>>         'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape))
>>>     },
>>>     'action': torch.randn(B, agent_num, squeeze(action_shape))
>>> }
>>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False)
>>> value = model.compute_critic(data)['q_value']

forward(inputs: Tensor | Dict, mode: str) → Dict[source]¶

Overview:

Use observation and action tensor to predict output in compute_actor or compute_critic mode.

Arguments:

inputs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
- obs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
  
  agent_state (torch.Tensor): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape.
  
  global_state (torch.Tensor): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds to global_obs_shape.
  
  action_mask (torch.Tensor): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds to action_shape.
- action (torch.Tensor): The action tensor data, with shape \((B, A, N3)\), where B is batch size and A is agent num. N3 corresponds to action_shape.
mode (str): Name of the forward mode.

Returns:

outputs (Dict): Outputs of network forward, whose key-values will be different for different mode, twin_critic, action_space.

Examples:

>>> B = 32
>>> agent_obs_shape = 216
>>> global_obs_shape = 264
>>> agent_num = 8
>>> action_shape = 14
>>> act_space = 'reparameterization'  # regression
>>> data = {
>>>     'obs': {
>>>         'agent_state': torch.randn(B, agent_num, agent_obs_shape),
>>>         'global_state': torch.randn(B, agent_num, global_obs_shape),
>>>         'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape))
>>>     },
>>>     'action': torch.randn(B, agent_num, squeeze(action_shape))
>>> }
>>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False)
>>> if action_space == 'regression':
>>>     action = model(data['obs'], mode='compute_actor')['action']
>>> elif action_space == 'reparameterization':
>>>     (mu, sigma) = model(data['obs'], mode='compute_actor')['logit']
>>> value = model(data, mode='compute_critic')['q_value']

DiscreteMAQAC¶

class ding.model.DiscreteMAQAC(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The neural network and computation graph of algorithms related to discrete action Multi-Agent Q-value Actor-CritiC (MAQAC) model. The model is composed of actor and critic, where actor is a MLP network and critic is a MLP network. The actor network is used to predict the action probability distribution, and the critic network is used to predict the Q value of the state-action pair.
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Initialize the DiscreteMAQAC Model according to arguments.

Arguments:

agent_obs_shape (Union[int, SequenceType]): Agent’s observation’s space.
global_obs_shape (Union[int, SequenceType]): Global observation’s space.
obs_shape (Union[int, SequenceType]): Observation’s space.
action_shape (Union[int, SequenceType]): Action’s space.
twin_critic (bool): Whether include twin critic.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor-nn’s Head.
actor_head_layer_num (int): The num of layers used in the network to compute Q value output for actor’s nn.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic-nn’s Head.
critic_head_layer_num (int): The num of layers used in the network to compute Q value output for critic’s nn.
activation (Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details.

compute_actor(inputs: Dict) → Dict[source]¶

Overview:

Use observation tensor to predict action logits.

Arguments:

inputs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
- obs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
  
  agent_state (torch.Tensor): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape.
  
  global_state (torch.Tensor): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds to global_obs_shape.
  
  action_mask (torch.Tensor): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds to action_shape.

Returns:

output (Dict[str, torch.Tensor]): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different forward modes.
- logit (torch.Tensor): Action’s output logit (real value range), whose shape is \((B, A, N2)\), where N2 corresponds to action_shape.
- action_mask (torch.Tensor): Action mask tensor with same size as action_shape.

Examples:

>>> B = 32
>>> agent_obs_shape = 216
>>> global_obs_shape = 264
>>> agent_num = 8
>>> action_shape = 14
>>> data = {
>>>     'obs': {
>>>         'agent_state': torch.randn(B, agent_num, agent_obs_shape),
>>>         'global_state': torch.randn(B, agent_num, global_obs_shape),
>>>         'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape))
>>>     }
>>> }
>>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True)
>>> logit = model.compute_actor(data)['logit']

compute_critic(inputs: Dict) → Dict[source]¶

Overview:

use observation tensor to predict Q value.

Arguments:

inputs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
- obs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
  
  agent_state (torch.Tensor): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape.
  
  global_state (torch.Tensor): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds to global_obs_shape.
  
  action_mask (torch.Tensor): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds to action_shape.

Returns:

output (Dict[str, torch.Tensor]): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different values of twin_critic.
- q_value (list): If twin_critic=True, q_value should be 2 elements, each is the shape of \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds to action_shape. Otherwise, q_value should be torch.Tensor.

Examples:

>>> B = 32
>>> agent_obs_shape = 216
>>> global_obs_shape = 264
>>> agent_num = 8
>>> action_shape = 14
>>> data = {
>>>     'obs': {
>>>         'agent_state': torch.randn(B, agent_num, agent_obs_shape),
>>>         'global_state': torch.randn(B, agent_num, global_obs_shape),
>>>         'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape))
>>>     }
>>> }
>>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True)
>>> value = model.compute_critic(data)['q_value']

forward(inputs: Tensor | Dict, mode: str) → Dict[source]¶

Overview:

Use observation tensor to predict output, with compute_actor or compute_critic mode.

Arguments:

inputs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
- obs (Dict[str, torch.Tensor]): The input dict tensor data, has keys:
  
  agent_state (torch.Tensor): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape.
  
  global_state (torch.Tensor): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds to global_obs_shape.
  
  action_mask (torch.Tensor): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds to action_shape.
mode (str): The forward mode, all the modes are defined in the beginning of this class.

Returns:

output (Dict[str, torch.Tensor]): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different forward modes.

Examples:

>>> B = 32
>>> agent_obs_shape = 216
>>> global_obs_shape = 264
>>> agent_num = 8
>>> action_shape = 14
>>> data = {
>>>     'obs': {
>>>         'agent_state': torch.randn(B, agent_num, agent_obs_shape),
>>>         'global_state': torch.randn(B, agent_num, global_obs_shape),
>>>         'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape))
>>>     }
>>> }
>>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True)
>>> logit = model(data, mode='compute_actor')['logit']
>>> value = model(data, mode='compute_critic')['q_value']

QACDIST¶

class ding.model.QACDIST(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'regression', critic_head_type: str = 'categorical', actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51)[source]¶

Overview:: The QAC model with distributional Q-value.
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'regression', critic_head_type: str = 'categorical', actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51) → None[source]¶

Overview:

Init the QAC Distributional Model according to arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s space.
action_shape (Union[int, SequenceType]): Action’s space.
action_space (str): Whether choose regression or reparameterization.
critic_head_type (str): Only categorical.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor-nn’s Head.
actor_head_layer_num (int):
The num of layers used in the network to compute Q value output for actor’s nn.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic-nn’s Head.
critic_head_layer_num (int):
The num of layers used in the network to compute Q value output for critic’s nn.
activation (Optional[nn.Module]):
The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]):
The type of normalization to use, see ding.torch_utils.fc_block for more details.
v_min (int): Value of the smallest atom
v_max (int): Value of the largest atom
n_atom (int): Number of atoms in the support

compute_actor(inputs: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to predict output. Execute parameter updates with 'compute_actor' mode Use encoded embedding tensor to predict output.

Arguments:

inputs (torch.Tensor):
The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = actor_head_hidden_size
mode (str): Name of the forward mode.

Returns:

outputs (Dict): Outputs of forward pass encoder and head.

ReturnsKeys (either):

action (torch.Tensor): Continuous action tensor with same size as action_shape.
logit (torch.Tensor):
Logit tensor encoding mu and sigma, both with same size as input x.

Shapes:

inputs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to hidden_size
action (torch.Tensor): \((B, N0)\)
logit (list): 2 elements, mu and sigma, each is the shape of \((B, N0)\).
q_value (torch.FloatTensor): \((B, )\), B is batch size.

Examples:

>>> # Regression mode
>>> model = QACDIST(64, 64, 'regression')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
>>> # Reparameterization Mode
>>> model = QACDIST(64, 64, 'reparameterization')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> actor_outputs['logit'][0].shape # mu
>>> torch.Size([4, 64])
>>> actor_outputs['logit'][1].shape # sigma
>>> torch.Size([4, 64])

compute_critic(inputs: Dict) → Dict[source]¶

Overview:

Execute parameter updates with 'compute_critic' mode Use encoded embedding tensor to predict output.

Arguments:

obs, action encoded tensors.
mode (str): Name of the forward mode.

Returns:

outputs (Dict): Q-value output and distribution.

ReturnKeys:

q_value (torch.Tensor): Q value tensor with same size as batch size.
distribution (torch.Tensor): Q value distribution tensor.

Shapes:

obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape
action (torch.Tensor): \((B, N2)\), where B is batch size and N2 is``action_shape``
q_value (torch.FloatTensor): \((B, N2)\), where B is batch size and N2 is action_shape
distribution (torch.FloatTensor): \((B, 1, N3)\), where B is batch size and N3 is num_atom

Examples:

>>> # Categorical mode
>>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)}
>>> model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression',             ...                 critic_head_type='categorical', n_atoms=51)
>>> q_value = model(inputs, mode='compute_critic') # q value
>>> assert q_value['q_value'].shape == torch.Size([4, 1])
>>> assert q_value['distribution'].shape == torch.Size([4, 1, 51])

forward(inputs: Tensor | Dict, mode: str) → Dict[source]¶

Overview:

Use observation and action tensor to predict output. Parameter updates with QACDIST’s MLPs forward setup.

Arguments:

Forward with 'compute_actor':

inputs (torch.Tensor):
The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). Whether actor_head_hidden_size or critic_head_hidden_size depend on mode.

Forward with 'compute_critic', inputs (Dict) Necessary Keys:

obs, action encoded tensors.

mode (str): Name of the forward mode.

Returns:

outputs (Dict): Outputs of network forward.
Forward with 'compute_actor', Necessary Keys (either):
action (torch.Tensor): Action tensor with same size as input x.

logit (torch.Tensor):
Logit tensor encoding mu and sigma, both with same size as input x.
Forward with 'compute_critic', Necessary Keys:
q_value (torch.Tensor): Q value tensor with same size as batch size.

distribution (torch.Tensor): Q value distribution tensor.

Actor Shapes:

inputs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to hidden_size
action (torch.Tensor): \((B, N0)\)
q_value (torch.FloatTensor): \((B, )\), where B is batch size.

Critic Shapes:

obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape
action (torch.Tensor): \((B, N2)\), where B is batch size and N2 is``action_shape``
q_value (torch.FloatTensor): \((B, N2)\), where B is batch size and N2 is action_shape
distribution (torch.FloatTensor): \((B, 1, N3)\), where B is batch size and N3 is num_atom

Actor Examples:

>>> # Regression mode
>>> model = QACDIST(64, 64, 'regression')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
>>> # Reparameterization Mode
>>> model = QACDIST(64, 64, 'reparameterization')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> actor_outputs['logit'][0].shape # mu
>>> torch.Size([4, 64])
>>> actor_outputs['logit'][1].shape # sigma
>>> torch.Size([4, 64])

Critic Examples:

>>> # Categorical mode
>>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)}
>>> model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression',             ...                 critic_head_type='categorical', n_atoms=51)
>>> q_value = model(inputs, mode='compute_critic') # q value
>>> assert q_value['q_value'].shape == torch.Size([4, 1])
>>> assert q_value['distribution'].shape == torch.Size([4, 1, 51])

DiscreteBC¶

class ding.model.DiscreteBC(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, strides: list | None = None)[source]¶

Overview:: The DiscreteBC network.
Interfaces:: __init__, forward

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, strides: list | None = None) → None[source]¶

Overview:

Init the DiscreteBC (encoder + head) Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
dueling (dueling): Whether choose DuelingHead or DiscreteHead(default).
head_hidden_size (Optional[int]): The hidden_size of head network.
head_layer_num (int): The number of layers used in the head network to compute Q value output
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details.
strides (Optional[list]): The strides for each convolution layers, such as [2, 2, 2]. The length of this argument should be the same as encoder_hidden_size_list.

forward(x: Tensor) → Dict[source]¶

Overview:

DiscreteBC forward computation graph, input observation tensor to predict q_value.

Arguments:

x (torch.Tensor): Observation inputs

Returns:

outputs (Dict): DiscreteBC forward outputs, such as q_value.

ReturnsKeys:

logit (torch.Tensor): Discrete Q-value output of each action dimension.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape
logit (torch.FloatTensor): \((B, M)\), where B is batch size and M is action_shape

Examples:

>>> model = DiscreteBC(32, 6)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 32)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])

ContinuousBC¶

class ding.model.ContinuousBC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The ContinuousBC network.
Interfaces:: __init__, forward

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Initialize the ContinuousBC Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s shape, such as 128, (156, ).
action_shape (Union[int, SequenceType, EasyDict]): Action’s shape, such as 4, (3, ), EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 4}).
action_space (str): The type of action space, including [regression, reparameterization].
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor head.
actor_head_layer_num (int): The num of layers used in the network to compute Q value output for actor head.
activation (Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU().
norm_type (Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.

forward(inputs: Tensor | Dict[str, Tensor]) → Dict[source]¶

Overview:

The unique execution (forward) method of ContinuousBC.

Arguments:

inputs (torch.Tensor): Observation data, defaults to tensor.

Returns:

output (Dict): Output dict data, including different key-values among distinct action_space.

ReturnsKeys:

action (torch.Tensor): action output of actor network, with shape \((B, action_shape)\).
logit (List[torch.Tensor]): reparameterized action output of actor network, with shape \((B, action_shape)\).

Shapes:

inputs (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape
action (torch.FloatTensor): \((B, M)\), where B is batch size and M is action_shape
logit (List[torch.FloatTensor]): \((B, M)\), where B is batch size and M is action_shape

Examples (Regression):

>>> model = ContinuousBC(32, 6, action_space='regression')
>>> inputs = torch.randn(4, 32)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['action'].shape == torch.Size([4, 6])

Examples (Reparameterization):

>>> model = ContinuousBC(32, 6, action_space='reparameterization')
>>> inputs = torch.randn(4, 32)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'][0].shape == torch.Size([4, 6])
>>> assert outputs['logit'][1].shape == torch.Size([4, 6])

PDQN¶

class ding.model.PDQN(obs_shape: int | SequenceType, action_shape: EasyDict, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, multi_pass: bool | None = False, action_mask: list | None = None)[source]¶

Overview:: The neural network and computation graph of PDQN(https://arxiv.org/abs/1810.06394v1) and MPDQN(https://arxiv.org/abs/1905.04388) algorithms for parameterized action space. This model supports parameterized action space with discrete action_type and continuous action_arg. In principle, PDQN consists of x network (continuous action parameter network) and Q network (discrete action type network). But for simplicity, the code is split into encoder and actor_head, which contain the encoder and head of the above two networks respectively.
Interface:: __init__, forward, compute_discrete, compute_continuous.

__init__(obs_shape: int | SequenceType, action_shape: EasyDict, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, multi_pass: bool | None = False, action_mask: list | None = None) → None[source]¶

Overview:

Init the PDQN (encoder + head) Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (EasyDict): Action space shape in dict type, such as EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 5}).
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
dueling (dueling): Whether choose DuelingHead or DiscreteHead(default).
head_hidden_size (Optional[int]): The hidden_size of head network.
head_layer_num (int): The number of layers used in the head network to compute Q value output.
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details.
multi_pass (Optional[bool]): Whether to use multi pass version.
action_mask: (Optional[list]): An action mask indicating how action args are associated to each discrete action. For example, if there are 3 discrete action, 4 continous action args, and the first discrete action associates with the first continuous action args, the second discrete action associates with the second continuous action args, and the third discrete action associates with the remaining 2 action args, the action mask will be like: [[1,0,0,0],[0,1,0,0],[0,0,1,1]] with shape 3*4.

compute_continuous(inputs: Tensor) → Dict[source]¶

Overview:

Use observation tensor to predict continuous action args.

Arguments:

inputs (torch.Tensor): Observation inputs.

Returns:

outputs (Dict): A dict with key ‘action_args’.
- ‘action_args’ (torch.Tensor): The continuous action args.

Shapes:

inputs (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape.
action_args (torch.Tensor): \((B, M)\), where M is action_args_shape.

Examples:

>>> act_shape = EasyDict({'action_type_shape': (3, ), 'action_args_shape': (5, )})
>>> model = PDQN(4, act_shape)
>>> inputs = torch.randn(64, 4)
>>> outputs = model.forward(inputs, mode='compute_continuous')
>>> assert outputs['action_args'].shape == torch.Size([64, 5])

compute_discrete(inputs: Dict | EasyDict) → Dict[source]¶

Overview:

Use observation tensor and continuous action args to predict discrete action types.

Arguments:

inputs (Union[Dict, EasyDict]): A dict with keys ‘state’, ‘action_args’.
- state (torch.Tensor): Observation inputs.
- action_args (torch.Tensor): Action parameters are used to concatenate with the observation and serve as input to the discrete action type network.

Returns:

outputs (Dict): A dict with keys ‘logit’, ‘action_args’.
- ‘logit’: The logit value for each discrete action.
- ‘action_args’: The continuous action args(same as the inputs[‘action_args’]) for later usage.

Examples:

>>> act_shape = EasyDict({'action_type_shape': (3, ), 'action_args_shape': (5, )})
>>> model = PDQN(4, act_shape)
>>> inputs = {'state': torch.randn(64, 4), 'action_args': torch.randn(64, 5)}
>>> outputs = model.forward(inputs, mode='compute_discrete')
>>> assert outputs['logit'].shape == torch.Size([64, 3])
>>> assert outputs['action_args'].shape == torch.Size([64, 5])

forward(inputs: Tensor | Dict | EasyDict, mode: str) → Dict[source]¶

Overview:

PDQN forward computation graph, input observation tensor to predict q_value for discrete actions and values for continuous action_args.

Arguments:

inputs (Union[torch.Tensor, Dict, EasyDict]): Inputs including observation and other info according to mode.
mode (str): Name of the forward mode.

Shapes:

inputs (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape.

DecisionTransformer¶

class ding.model.DecisionTransformer(state_dim: int | SequenceType, act_dim: int, n_blocks: int, h_dim: int, context_len: int, n_heads: int, drop_p: float, max_timestep: int = 4096, state_encoder: Module | None = None, continuous: bool = False)[source]¶

Overview:: The implementation of decision transformer.
Interfaces:: __init__, forward, configure_optimizers

__init__(state_dim: int | SequenceType, act_dim: int, n_blocks: int, h_dim: int, context_len: int, n_heads: int, drop_p: float, max_timestep: int = 4096, state_encoder: Module | None = None, continuous: bool = False)[source]¶

Overview:

Initialize the DecisionTransformer Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Dimension of state, such as 128 or (4, 84, 84).
act_dim (int): The dimension of actions, such as 6.
n_blocks (int): The number of transformer blocks in the decision transformer, such as 3.
h_dim (int): The dimension of the hidden layers, such as 128.
context_len (int): The max context length of the attention, such as 6.
n_heads (int): The number of heads in calculating attention, such as 8.
drop_p (float): The drop rate of the drop-out layer, such as 0.1.
max_timestep (int): The max length of the total sequence, defaults to be 4096.
state_encoder (Optional[nn.Module]): The encoder to pre-process the given input. If it is set to None, the raw state will be pushed into the transformer.
continuous (bool): Whether the action space is continuous, defaults to be False.

forward(timesteps: Tensor, states: Tensor, actions: Tensor, returns_to_go: Tensor, tar: int | None = None) → Tuple[Tensor, Tensor, Tensor][source]¶

Overview:

Forward computation graph of the decision transformer, input a sequence tensor and return a tensor with the same shape.

Arguments:

timesteps (torch.Tensor): The timestep for input sequence.
states (torch.Tensor): The sequence of states.
actions (torch.Tensor): The sequence of actions.
returns_to_go (torch.Tensor): The sequence of return-to-go.
tar (Optional[int]): Whether to predict action, regardless of index.

Returns:

output (Tuple[torch.Tensor, torch.Tensor, torch.Tensor]): Output contains three tensors, they are correspondingly the predicted states, predicted actions and predicted return-to-go.

Examples:

>>> B, T = 4, 6
>>> state_dim = 3
>>> act_dim = 2
>>> DT_model = DecisionTransformer(                state_dim=state_dim,                act_dim=act_dim,                n_blocks=3,                h_dim=8,                context_len=T,                n_heads=2,                drop_p=0.1,            )
>>> timesteps = torch.randint(0, 100, [B, 3 * T - 1, 1], dtype=torch.long)  # B x T
>>> states = torch.randn([B, T, state_dim])  # B x T x state_dim
>>> actions = torch.randint(0, act_dim, [B, T, 1])
>>> action_target = torch.randint(0, act_dim, [B, T, 1])
>>> returns_to_go_sample = torch.tensor([1, 0.8, 0.6, 0.4, 0.2, 0.]).repeat([B, 1]).unsqueeze(-1).float()
>>> traj_mask = torch.ones([B, T], dtype=torch.long)  # B x T
>>> actions = actions.squeeze(-1)
>>> state_preds, action_preds, return_preds = DT_model.forward(                timesteps=timesteps, states=states, actions=actions, returns_to_go=returns_to_go            )
>>> assert state_preds.shape == torch.Size([B, T, state_dim])
>>> assert return_preds.shape == torch.Size([B, T, 1])
>>> assert action_preds.shape == torch.Size([B, T, act_dim])

LanguageTransformer¶

class ding.model.LanguageTransformer(model_name: str = 'bert-base-uncased', add_linear: bool = False, embedding_size: int = 128, freeze_encoder: bool = True, hidden_dim: int = 768, norm_embedding: bool = False)[source]¶

Overview:: The LanguageTransformer network. Download a pre-trained language model and add head on it. In the default case, we use BERT model as the text encoder, whose bi-directional character is good for obtaining the embedding of the whole sentence.
Interfaces:: __init__, forward

__init__(model_name: str = 'bert-base-uncased', add_linear: bool = False, embedding_size: int = 128, freeze_encoder: bool = True, hidden_dim: int = 768, norm_embedding: bool = False) → None[source]¶

Overview:

Init the LanguageTransformer Model according to input arguments.

Arguments:

model_name (str): The base language model name in huggingface, such as “bert-base-uncased”.
add_linear (bool): Whether to add a linear layer on the top of language model, defaults to be False.
embedding_size (int): The embedding size of the added linear layer, such as 128.
freeze_encoder (bool): Whether to freeze the encoder language model while training, defaults to be True.
hidden_dim (int): The embedding dimension of the encoding model (e.g. BERT). This value should correspond to the model you use. For bert-base-uncased, this value is 768.
norm_embedding (bool): Whether to normalize the embedding vectors. Default to be False.

forward(train_samples: List[str], candidate_samples: List[str] | None = None, mode: str = 'compute_actor') → Dict[source]¶

Overview:

LanguageTransformer forward computation graph, input two lists of strings and predict their matching scores. Different mode will forward with different network modules to get different outputs.

Arguments:

train_samples (List[str]): One list of strings.
candidate_samples (Optional[List[str]]): The other list of strings to calculate matching scores.
- mode (str): The forward mode, all the modes are defined in the beginning of this class.

Returns:

output (Dict): Output dict data, including the logit of matching scores and the corresponding torch.distributions.Categorical object.

Examples:

>>> test_pids = [1]
>>> cand_pids = [0, 2, 4]
>>> problems = [                 "This is problem 0", "This is the first question", "Second problem is here", "Another problem",                 "This is the last problem"             ]
>>> ctxt_list = [problems[pid] for pid in test_pids]
>>> cands_list = [problems[pid] for pid in cand_pids]
>>> model = LanguageTransformer(model_name="bert-base-uncased", add_linear=True, embedding_size=256)
>>> scores = model(ctxt_list, cands_list)
>>> assert scores.shape == (1, 3)

Mixer¶

class ding.model.Mixer(agent_num: int, state_dim: int, mixing_embed_dim: int, hypernet_embed: int = 64, activation: Module = ReLU())[source]¶

Overview:: Mixer network in QMIX, which mix up the independent q_value of each agent to a total q_value. The weights (but not the biases) of the Mixer network are restricted to be non-negative and produced by separate hypernetworks. Each hypernetwork takes the globle state s as input and generates the weights of one layer of the Mixer network.
Interface:: __init__, forward.

__init__(agent_num: int, state_dim: int, mixing_embed_dim: int, hypernet_embed: int = 64, activation: Module = ReLU())[source]¶

Overview:

Initialize mixer network proposed in QMIX according to arguments. Each hypernetwork consists of linear layers, followed by an absolute activation function, to ensure that the Mixer network weights are non-negative.

Arguments:

agent_num (int): The number of agent, such as 8.
state_dim(int): The dimension of global observation state, such as 16.
mixing_embed_dim (int): The dimension of mixing state emdedding, such as 128.
hypernet_embed (int): The dimension of hypernet emdedding, default to 64.
activation (nn.Module): Activation function in network, defaults to nn.ReLU().

forward(agent_qs, states)[source]¶

Overview:

Forward computation graph of pymarl mixer network. Mix up the input independent q_value of each agent to a total q_value with weights generated by hypernetwork according to global states.

Arguments:

agent_qs (torch.FloatTensor): The independent q_value of each agent.
states (torch.FloatTensor): The emdedding vector of global state.

Returns:

q_tot (torch.FloatTensor): The total mixed q_value.

Shapes:

agent_qs (torch.FloatTensor): \((B, N)\), where B is batch size and N is agent_num.
states (torch.FloatTensor): \((B, M)\), where M is embedding_size.
q_tot (torch.FloatTensor): \((B, )\).

QMix¶

class ding.model.QMix(agent_num: int, obs_shape: int, global_obs_shape: int | List[int], action_shape: int, hidden_size_list: list, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False)[source]¶

Overview:: The neural network and computation graph of algorithms related to QMIX(https://arxiv.org/abs/1803.11485). The QMIX is composed of two parts: agent Q network and mixer(optional). The QMIX paper mentions that all agents share local Q network parameters, so only one Q network is initialized here. Then use summation or Mixer network to process the local Q according to the mixer settings to obtain the global Q.
Interface:: __init__, forward.

__init__(agent_num: int, obs_shape: int, global_obs_shape: int | List[int], action_shape: int, hidden_size_list: list, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False) → None[source]¶

Overview:

Initialize QMIX neural network according to arguments, i.e. agent Q network and mixer.

Arguments:

agent_num (int): The number of agent, such as 8.
obs_shape (int): The dimension of each agent’s observation state, such as 8 or [4, 84, 84].
global_obs_shape (int): The dimension of global observation state, such as 8 or [4, 84, 84].
action_shape (int): The dimension of action shape, such as 6 or [2, 3, 3].
hidden_size_list (list): The list of hidden size for q_network, the last element must match mixer’s mixing_embed_dim.
mixer (bool): Use mixer net or not, default to True. If it is false, the final local Q is added to obtain the global Q.
lstm_type (str): The type of RNN module in q_network, now support [‘normal’, ‘pytorch’, ‘gru’], default to gru.
activation (nn.Module): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU().
dueling (bool): Whether choose DuelingHead (True) or DiscreteHead (False), default to False.

forward(data: dict, single_step: bool = True) → dict[source]¶

Overview:

QMIX forward computation graph, input dict including time series observation and related data to predict total q_value and each agent q_value.

Arguments:

data (dict): Input data dict with keys [‘obs’, ‘prev_state’, ‘action’].
- agent_state (torch.Tensor): Time series local observation data of each agents.
- global_state (torch.Tensor): Time series global observation data.
- prev_state (list): Previous rnn state for q_network.
- action (torch.Tensor or None): The actions of each agent given outside the function. If action is None, use argmax q_value index as action to calculate agent_q_act.
single_step (bool): Whether single_step forward, if so, add timestep dim before forward and remove it after forward.

Returns:

ret (dict): Output data dict with keys [total_q, logit, next_state].

ReturnsKeys:

total_q (torch.Tensor): Total q_value, which is the result of mixer network.
agent_q (torch.Tensor): Each agent q_value.
next_state (list): Next rnn state for q_network.

Shapes:

agent_state (torch.Tensor): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape.
global_state (torch.Tensor): \((T, B, M)\), where M is global_obs_shape.
prev_state (list): math:(B, A), a list of length B, and each element is a list of length A.
action (torch.Tensor): \((T, B, A)\).
total_q (torch.Tensor): \((T, B)\).
agent_q (torch.Tensor): \((T, B, A, P)\), where P is action_shape.
next_state (list): math:(B, A), a list of length B, and each element is a list of length A.

COMA¶

class ding.model.COMA(agent_num: int, obs_shape: Dict, action_shape: int | SequenceType, actor_hidden_size_list: SequenceType)[source]¶

Overview:

The network of COMA algorithm, which is QAC-type actor-critic.

Interface:

__init__, forward

Properties:

mode (list): The list of forward mode, including compute_actor and compute_critic

__init__(agent_num: int, obs_shape: Dict, action_shape: int | SequenceType, actor_hidden_size_list: SequenceType) → None[source]¶

Overview:

initialize COMA network

Arguments:

agent_num (int): the number of agent
obs_shape (Dict): the observation information, including agent_state and global_state
action_shape (Union[int, SequenceType]): the dimension of action shape
actor_hidden_size_list (SequenceType): the list of hidden size

forward(inputs: Dict, mode: str) → Dict[source]¶

Overview:

forward computation graph of COMA network

Arguments:

inputs (dict): input data dict with keys [‘obs’, ‘prev_state’, ‘action’]
agent_state (torch.Tensor): each agent local state(obs)
global_state (torch.Tensor): global state(obs)
action (torch.Tensor): the masked action

ArgumentsKeys:

necessary: obs { agent_state, global_state, action_mask }, action, prev_state

ReturnsKeys:

necessary:
- compute_critic: q_value
- compute_actor: logit, next_state, action_mask

Shapes:

obs (dict): agent_state: \((T, B, A, N, D)\), action_mask: \((T, B, A, N, A)\)
prev_state (list): \([[[h, c] for _ in range(A)] for _ in range(B)]\)
logit (torch.Tensor): \((T, B, A, N, A)\)
next_state (list): \([[[h, c] for _ in range(A)] for _ in range(B)]\)
action_mask (torch.Tensor): \((T, B, A, N, A)\)
q_value (torch.Tensor): \((T, B, A, N, A)\)

Examples:

>>> agent_num, bs, T = 4, 3, 8
>>> agent_num, bs, T = 4, 3, 8
>>> obs_dim, global_obs_dim, action_dim = 32, 32 * 4, 9
>>> coma_model = COMA(
>>>     agent_num=agent_num,
>>>     obs_shape=dict(agent_state=(obs_dim, ), global_state=(global_obs_dim, )),
>>>     action_shape=action_dim,
>>>     actor_hidden_size_list=[128, 64],
>>> )
>>> prev_state = [[None for _ in range(agent_num)] for _ in range(bs)]
>>> data = {
>>>     'obs': {
>>>         'agent_state': torch.randn(T, bs, agent_num, obs_dim),
>>>         'action_mask': None,
>>>     },
>>>     'prev_state': prev_state,
>>> }
>>> output = coma_model(data, mode='compute_actor')
>>> data= {
>>>     'obs': {
>>>         'agent_state': torch.randn(T, bs, agent_num, obs_dim),
>>>         'global_state': torch.randn(T, bs, global_obs_dim),
>>>     },
>>>     'action': torch.randint(0, action_dim, size=(T, bs, agent_num)),
>>> }
>>> output = coma_model(data, mode='compute_critic')

QTran¶

class ding.model.QTran(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, embedding_size: int, lstm_type: str = 'gru', dueling: bool = False)[source]¶

Overview:: QTRAN network
Interface:: __init__, forward

__init__(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, embedding_size: int, lstm_type: str = 'gru', dueling: bool = False) → None[source]¶

Overview:

initialize QTRAN network

Arguments:

agent_num (int): the number of agent
obs_shape (int): the dimension of each agent’s observation state
global_obs_shape (int): the dimension of global observation state
action_shape (int): the dimension of action shape
hidden_size_list (list): the list of hidden size
embedding_size (int): the dimension of embedding
lstm_type (str): use lstm or gru, default to gru
dueling (bool): use dueling head or not, default to False.

forward(data: dict, single_step: bool = True) → dict[source]¶

Overview:

forward computation graph of qtran network

Arguments:

data (dict): input data dict with keys [‘obs’, ‘prev_state’, ‘action’]
- agent_state (torch.Tensor): each agent local state(obs)
- global_state (torch.Tensor): global state(obs)
- prev_state (list): previous rnn state
- action (torch.Tensor or None): if action is None, use argmax q_value index as action to calculate agent_q_act
single_step (bool): whether single_step forward, if so, add timestep dim before forward and remove it after forward

Return:

ret (dict): output data dict with keys [‘total_q’, ‘logit’, ‘next_state’]
- total_q (torch.Tensor): total q_value, which is the result of mixer network
- agent_q (torch.Tensor): each agent q_value
- next_state (list): next rnn state

Shapes:

agent_state (torch.Tensor): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape
global_state (torch.Tensor): \((T, B, M)\), where M is global_obs_shape
prev_state (list): math:(B, A), a list of length B, and each element is a list of length A
action (torch.Tensor): \((T, B, A)\)
total_q (torch.Tensor): \((T, B)\)
agent_q (torch.Tensor): \((T, B, A, P)\), where P is action_shape
next_state (list): math:(B, A), a list of length B, and each element is a list of length A

WQMix¶

class ding.model.WQMix(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, lstm_type: str = 'gru', dueling: bool = False)[source]¶

Overview:: WQMIX (https://arxiv.org/abs/2006.10800) network, There are two components: 1) Q_tot, which is same as QMIX network and composed of agent Q network and mixer network. 2) An unrestricted joint action Q_star, which is composed of agent Q network and mixer_star network. The QMIX paper mentions that all agents share local Q network parameters, so only one Q network is initialized in Q_tot or Q_star.
Interface:: __init__, forward.

__init__(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, lstm_type: str = 'gru', dueling: bool = False) → None[source]¶

Overview:

Initialize WQMIX neural network according to arguments, i.e. agent Q network and mixer, Q_star network and mixer_star.

Arguments:

agent_num (int): The number of agent, such as 8.
obs_shape (int): The dimension of each agent’s observation state, such as 8.
global_obs_shape (int): The dimension of global observation state, such as 8.
action_shape (int): The dimension of action shape, such as 6.
hidden_size_list (list): The list of hidden size for q_network, the last element must match mixer’s mixing_embed_dim.
lstm_type (str): The type of RNN module in q_network, now support [‘normal’, ‘pytorch’, ‘gru’], default to gru.
dueling (bool): Whether choose DuelingHead (True) or DiscreteHead (False), default to False.

forward(data: dict, single_step: bool = True, q_star: bool = False) → dict[source]¶

Overview:

Forward computation graph of qmix network. Input dict including time series observation and related data to predict total q_value and each agent q_value. Determine whether to calculate Q_tot or Q_star based on the q_star parameter.

Arguments:

data (dict): Input data dict with keys [‘obs’, ‘prev_state’, ‘action’].
- agent_state (torch.Tensor): Time series local observation data of each agents.
- global_state (torch.Tensor): Time series global observation data.
- prev_state (list): Previous rnn state for q_network or _q_network_star.
- action (torch.Tensor or None): If action is None, use argmax q_value index as action to calculate agent_q_act.
single_step (bool): Whether single_step forward, if so, add timestep dim before forward and remove it after forward.
Q_star (bool): Whether Q_star network forward. If True, using the Q_star network, where the agent networks have the same architecture as Q network but do not share parameters and the mixing network is a feedforward network with 3 hidden layers of 256 dim; if False, using the Q network, same as the Q network in Qmix paper.

Returns:

ret (dict): Output data dict with keys [total_q, logit, next_state].
total_q (torch.Tensor): Total q_value, which is the result of mixer network.
agent_q (torch.Tensor): Each agent q_value.
next_state (list): Next rnn state.

Shapes:

agent_state (torch.Tensor): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape.
global_state (torch.Tensor): \((T, B, M)\), where M is global_obs_shape.
prev_state (list): math:(T, B, A), a list of length B, and each element is a list of length A.
action (torch.Tensor): \((T, B, A)\).
total_q (torch.Tensor): \((T, B)\).
agent_q (torch.Tensor): \((T, B, A, P)\), where P is action_shape.
next_state (list): math:(T, B, A), a list of length B, and each element is a list of length A.

PPG¶

class ding.model.PPG(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, impala_cnn_encoder: bool = False)[source]¶

Overview:: Phasic Policy Gradient (PPG) model from paper Phasic Policy Gradient https://arxiv.org/abs/2009.04416 This module contains VAC module and an auxiliary critic module.
Interfaces:: forward, compute_actor, compute_critic, compute_actor_critic

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, impala_cnn_encoder: bool = False) → None[source]¶

Overview:

Initailize the PPG Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s shape, such as 128, (156, ).
action_shape (Union[int, SequenceType]): Action’s shape, such as 4, (3, ).
action_space (str): The action space type, such as ‘discrete’, ‘continuous’.
share_encoder (bool): Whether to share encoder.
encoder_hidden_size_list (SequenceType): The hidden size list of encoder.
actor_head_hidden_size (int): The hidden_size to pass to actor head.
actor_head_layer_num (int): The num of layers used in the network to compute Q value output for actor head.
critic_head_hidden_size (int): The hidden_size to pass to critic head.
critic_head_layer_num (int): The num of layers used in the network to compute Q value output for critic head.
activation (Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU().
norm_type (Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.
impala_cnn_encoder (bool): Whether to use impala cnn encoder.

compute_actor(x: Tensor) → Dict[source]¶

Overview:

Use actor to compute action logits.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

output (Dict): The output data containing action logits.

ReturnsKeys:

logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.
output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

compute_actor_critic(x: Tensor) → Dict[source]¶

Overview:

Use actor and critic to compute action logits and value.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict): The output dict of PPG’s forward computation graph for both actor and critic, including logit and value.

ReturnsKeys:

logit (torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.
value (torch.Tensor): The predicted state value tensor.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.
output (Dict): value: \((B, 1)\), where B is batch size.
output (Dict): logit: \((B, A)\), where B is batch size and A is the action space size.

Note

compute_actor_critic interface aims to save computation when shares encoder.

compute_critic(x: Tensor) → Dict[source]¶

Overview:

Use critic to compute value.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

output (Dict): The output dict of VAC’s forward computation graph for critic, including value.

ReturnsKeys:

necessary: value

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is the input feature size.
output (Dict): value: \((B, 1)\), where B is batch size.

forward(inputs: Tensor | Dict, mode: str) → Dict[source]¶

Overview:

Compute action logits or value according to mode being compute_actor, compute_critic or compute_actor_critic.

Arguments:

x (torch.Tensor): The input observation tensor data.
mode (str): The forward mode, all the modes are defined in the beginning of this class.

Returns:

outputs (Dict): The output dict of PPG’s forward computation graph, whose key-values vary from different mode.

ProcedureCloningBFS¶

class ding.model.ProcedureCloningBFS(obs_shape: SequenceType, action_shape: int, encoder_hidden_size_list: SequenceType = [128, 128, 256, 256])[source]¶

Overview:: The neural network introduced in procedure cloning (PC) to process 3-dim observations. Given an input, this model will perform several 3x3 convolutions and output a feature map with the same height and width of input. The channel number of output will be the action_shape.
Interfaces:: __init__, forward.

__init__(obs_shape: SequenceType, action_shape: int, encoder_hidden_size_list: SequenceType = [128, 128, 256, 256])[source]¶

Overview:

Init the BFSConvolution Encoder according to the provided arguments.

Arguments:

obs_shape (SequenceType): Sequence of in_channel, plus one or more input size, such as [4, 84, 84].
action_dim (int): Action space shape, such as 6.
cnn_hidden_list (SequenceType): The cnn channel dims for each block, such as [128, 128, 256, 256].

forward(x: Tensor) → Dict[source]¶

Overview:

The computation graph. Given a 3-dim observation, this function will return a tensor with the same height and width. The channel number of output will be the action_shape.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict): The output dict of model’s forward computation graph, only contains a single key logit.

Examples:

>>> model = ProcedureCloningBFS([3, 16, 16], 4)
>>> inputs = torch.randn(16, 16, 3).unsqueeze(0)
>>> outputs = model(inputs)
>>> assert outputs['logit'].shape == torch.Size([16, 16, 4])

ProcedureCloningMCTS¶

class ding.model.ProcedureCloningMCTS(obs_shape: SequenceType, action_dim: int, cnn_hidden_list: SequenceType = [128, 128, 256, 256, 256], cnn_activation: Module = ReLU(), cnn_kernel_size: SequenceType = [3, 3, 3, 3, 3], cnn_stride: SequenceType = [1, 1, 1, 1, 1], cnn_padding: SequenceType = [1, 1, 1, 1, 1], mlp_hidden_list: SequenceType = [256, 256], mlp_activation: Module = ReLU(), att_heads: int = 8, att_hidden: int = 128, n_att: int = 4, n_feedforward: int = 2, feedforward_hidden: int = 256, drop_p: float = 0.5, max_T: int = 17)[source]¶

Overview:: The neural network of algorithms related to Procedure cloning (PC).
Interfaces:: __init__, forward.

__init__(obs_shape: SequenceType, action_dim: int, cnn_hidden_list: SequenceType = [128, 128, 256, 256, 256], cnn_activation: Module = ReLU(), cnn_kernel_size: SequenceType = [3, 3, 3, 3, 3], cnn_stride: SequenceType = [1, 1, 1, 1, 1], cnn_padding: SequenceType = [1, 1, 1, 1, 1], mlp_hidden_list: SequenceType = [256, 256], mlp_activation: Module = ReLU(), att_heads: int = 8, att_hidden: int = 128, n_att: int = 4, n_feedforward: int = 2, feedforward_hidden: int = 256, drop_p: float = 0.5, max_T: int = 17) → None[source]¶

Overview:

Initialize the MCTS procedure cloning model according to corresponding input arguments.

Arguments:

obs_shape (SequenceType): Observation space shape, such as [4, 84, 84].
action_dim (int): Action space shape, such as 6.
cnn_hidden_list (SequenceType): The cnn channel dims for each block, such as [128, 128, 256, 256, 256].
cnn_activation (nn.Module): The activation function for cnn blocks, such as nn.ReLU().
cnn_kernel_size (SequenceType): The kernel size for each cnn block, such as [3, 3, 3, 3, 3].
cnn_stride (SequenceType): The stride for each cnn block, such as [1, 1, 1, 1, 1].
cnn_padding (SequenceType): The padding for each cnn block, such as [1, 1, 1, 1, 1].
mlp_hidden_list (SequenceType): The last dim for this must match the last dim of cnn_hidden_list, such as [256, 256].
mlp_activation (nn.Module): The activation function for mlp layers, such as nn.ReLU().
att_heads (int): The number of attention heads in transformer, such as 8.
att_hidden (int): The number of attention dimension in transformer, such as 128.
n_att (int): The number of attention blocks in transformer, such as 4.
n_feedforward (int): The number of feedforward layers in transformer, such as 2.
drop_p (float): The drop out rate of attention, such as 0.5.
max_T (int): The sequence length of procedure cloning, such as 17.

forward(states: Tensor, goals: Tensor, actions: Tensor) → Tuple[Tensor, Tensor][source]¶

Overview:

ProcedureCloningMCTS forward computation graph, input states tensor and goals tensor, calculate the predicted states and actions.

Arguments:

states (torch.Tensor): The observation of current time.
goals (torch.Tensor): The target observation after a period.
actions (torch.Tensor): The actions executed during the period.

Returns:

outputs (Tuple[torch.Tensor, torch.Tensor]): Predicted states and actions.

Examples:

>>> inputs = {                 'states': torch.randn(2, 3, 64, 64),                 'goals': torch.randn(2, 3, 64, 64),                 'actions': torch.randn(2, 15, 9)             }
>>> model = ProcedureCloningMCTS(obs_shape=(3, 64, 64), action_dim=9)
>>> goal_preds, action_preds = model(inputs['states'], inputs['goals'], inputs['actions'])
>>> assert goal_preds.shape == (2, 256)
>>> assert action_preds.shape == (2, 16, 9)

ACER¶

class ding.model.ACER(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The model of algorithmn ACER(Actor Critic with Experience Replay) Sample Efficient Actor-Critic with Experience Replay. https://arxiv.org/abs/1611.01224
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Init the ACER Model according to arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s space.
action_shape (Union[int, SequenceType]): Action’s space.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor-nn’s Head.
actor_head_layer_num (int):
The num of layers used in the network to compute Q value output for actor’s nn.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic-nn’s Head.
critic_head_layer_num (int):
The num of layers used in the network to compute Q value output for critic’s nn.
activation (Optional[nn.Module]):
The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]):
The type of normalization to use, see ding.torch_utils.fc_block for more details.

compute_actor(inputs: Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to predict output. Execute parameter updates with compute_actor mode Use encoded embedding tensor to predict output.

Arguments:

inputs (torch.Tensor):
The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = actor_head_hidden_size
mode (str): Name of the forward mode.

Returns:

outputs (Dict): Outputs of forward pass encoder and head.

ReturnsKeys (either):

logit (torch.FloatTensor): \((B, N1)\), where B is batch size and N1 is action_shape

Shapes:

inputs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to hidden_size
logit (torch.FloatTensor): \((B, N1)\), where B is batch size and N1 is action_shape

Examples:

>>> # Regression mode
>>> model = ACER(64, 64)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 64])

compute_critic(inputs: Tensor) → Dict[source]¶

Overview:

Execute parameter updates with compute_critic mode Use encoded embedding tensor to predict output.

Arguments:

obs, action encoded tensors.
mode (str): Name of the forward mode.

Returns:

outputs (Dict): Q-value output.

ReturnKeys:

q_value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:

obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape
q_value (torch.FloatTensor): \((B, N2)\), where B is batch size and N2 is action_shape.

Examples:

>>> inputs =torch.randn(4, N)
>>> model = ACER(obs_shape=(N, ),action_shape=5)
>>> model(inputs, mode='compute_critic')['q_value']

forward(inputs: Tensor | Dict, mode: str) → Dict[source]¶

Overview:

Use observation to predict output. Parameter updates with ACER’s MLPs forward setup.

Arguments:

mode (str): Name of the forward mode.

Returns:

outputs (Dict): Outputs of network forward.

Shapes (Actor):

obs (torch.Tensor): \((B, N1)\), where B is batch size and N1 is obs_shape
logit (torch.FloatTensor): \((B, N2)\), where B is batch size and N2 is action_shape

Shapes (Critic):

inputs (torch.Tensor): \((B, N1)\), B is batch size and N1 corresponds to obs_shape
q_value (torch.FloatTensor): \((B, N2)\), where B is batch size and N2 is action_shape

NGU¶

class ding.model.NGU(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], collector_env_num: int | None = 1, dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶

Overview:: The recurrent Q model for NGU(https://arxiv.org/pdf/2002.06038.pdf) policy, modified from the class DRQN in q_leaning.py. The implementation mentioned in the original paper is ‘adapt the R2D2 agent that uses the dueling network architecture with an LSTM layer after a convolutional neural network’. The NGU network includes encoder, LSTM core(rnn) and head.
Interface:: __init__, forward.

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], collector_env_num: int | None = 1, dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None) → None[source]¶

Overview:

Init the DRQN Model for NGU according to arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s space, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action’s space, such as 6 or [2, 3, 3].
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder.
collector_env_num (Optional[int]): The number of environments used to collect data simultaneously.
dueling (bool): Whether choose DuelingHead (True) or DiscreteHead (False), default to True.
head_hidden_size (Optional[int]): The hidden_size to pass to Head, should match the last element of encoder_hidden_size_list.
head_layer_num (int): The number of layers in head network.
lstm_type (Optional[str]): Version of rnn cell, now support [‘normal’, ‘pytorch’, ‘hpc’, ‘gru’], default is ‘normal’.
activation (Optional[nn.Module]):
The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU().
norm_type (Optional[str]):
The type of normalization to use, see ding.torch_utils.fc_block for more details`.

forward(inputs: Dict, inference: bool = False, saved_state_timesteps: list | None = None) → Dict[source]¶

Overview:

Forward computation graph of NGU R2D2 network. Input observation, prev_action prev_reward_extrinsic to predict NGU Q output. Parameter updates with NGU’s MLPs forward setup.

Arguments:

inputs (Dict):
- obs (torch.Tensor): Encoded observation.
- prev_state (list): Previous state’s tensor of size (B, N).
inference: (:obj:’bool’): If inference is True, we unroll the one timestep transition, if inference is False, we unroll the sequence transitions.
saved_state_timesteps: (:obj:’Optional[list]’): When inference is False, we unroll the sequence transitions, then we would save rnn hidden states at timesteps that are listed in list saved_state_timesteps.

Returns:

outputs (Dict):
Run MLP with DRQN setups and return the result prediction dictionary.

ReturnsKeys:

logit (torch.Tensor): Logit tensor with same size as input obs.
next_state (list): Next state’s tensor of size (B, N).

Shapes:

obs (torch.Tensor): \((B, N=obs_space)\), where B is batch size.
prev_state(torch.FloatTensor list): \([(B, N)]\).
logit (torch.FloatTensor): \((B, N)\).
next_state(torch.FloatTensor list): \([(B, N)]\).

BCQ¶

class ding.model.BCQ(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, actor_head_hidden_size: List = [400, 300], critic_head_hidden_size: List = [400, 300], activation: Module | None = ReLU(), vae_hidden_dims: List = [750, 750], phi: float = 0.05)[source]¶

Overview:: Model of BCQ (Batch-Constrained deep Q-learning). Off-Policy Deep Reinforcement Learning without Exploration. https://arxiv.org/abs/1812.02900
Interface:: forward, compute_actor, compute_critic, compute_vae, compute_eval
Property:: mode

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, actor_head_hidden_size: List = [400, 300], critic_head_hidden_size: List = [400, 300], activation: Module | None = ReLU(), vae_hidden_dims: List = [750, 750], phi: float = 0.05) → None[source]¶

Overview:

Initialize neural network, i.e. agent Q network and actor.

Arguments:

obs_shape (int): the dimension of observation state
action_shape (int): the dimension of action shape
actor_hidden_size (list): the list of hidden size of actor
critic_hidden_size (:obj:’list’): the list of hidden size of critic
activation (nn.Module): Activation function in network, defaults to nn.ReLU().
vae_hidden_dims (list): the list of hidden size of vae

compute_actor(inputs: Dict[str, Tensor]) → Dict[str, Tensor | Dict[str, Tensor]][source]¶

Overview:

Use actor network to compute action.

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

outputs (Dict): Dict containing keywords action (torch.Tensor).

Shapes:

inputs (Dict): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.
outputs (Dict): \((B, N)\).

Examples:

>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)}
>>> model = BCQ(32, 6)
>>> outputs = model.compute_actor(inputs)

compute_critic(inputs: Dict[str, Tensor]) → Dict[str, Tensor][source]¶

Overview:

Use critic network to compute q value.

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

outputs (Dict): Dict containing keywords q_value (torch.Tensor).

Shapes:

inputs (Dict): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.
outputs (Dict): \((B, N)\).

Examples:

>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)}
>>> model = BCQ(32, 6)
>>> outputs = model.compute_critic(inputs)

compute_eval(inputs: Dict[str, Tensor]) → Dict[str, Tensor][source]¶

Overview:

Use actor network to compute action.

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

outputs (Dict): Dict containing keywords action (torch.Tensor).

Shapes:

inputs (Dict): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.
outputs (Dict): \((B, N)\).

Examples:

>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)}
>>> model = BCQ(32, 6)
>>> outputs = model.compute_eval(inputs)

compute_vae(inputs: Dict[str, Tensor]) → Dict[str, Tensor][source]¶

Overview:

Use vae network to compute action.

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

outputs (Dict): Dict containing keywords recons_action (torch.Tensor), prediction_residual (torch.Tensor), input (torch.Tensor), mu (torch.Tensor), log_var (torch.Tensor) and z (torch.Tensor).

Shapes:

inputs (Dict): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.
outputs (Dict): \((B, N)\).

Examples:

>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)}
>>> model = BCQ(32, 6)
>>> outputs = model.compute_vae(inputs)

forward(inputs: Dict[str, Tensor], mode: str) → Dict[str, Tensor][source]¶

Overview:

The unique execution (forward) method of BCQ method, and one can indicate different modes to implement different computation graph, including compute_actor and compute_critic in BCQ.

Mode compute_actor:

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

output (Dict): Output dict data, including action tensor.

Mode compute_critic:

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

output (Dict): Output dict data, including q_value tensor.

Mode compute_vae:

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

outputs (Dict): Dict containing keywords recons_action (torch.Tensor), prediction_residual (torch.Tensor), input (torch.Tensor), mu (torch.Tensor), log_var (torch.Tensor) and z (torch.Tensor).

Mode compute_eval:

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

output (Dict): Output dict data, including action tensor.

Examples:

>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)}
>>> model = BCQ(32, 6)
>>> outputs = model(inputs, mode='compute_actor')
>>> outputs = model(inputs, mode='compute_critic')
>>> outputs = model(inputs, mode='compute_vae')
>>> outputs = model(inputs, mode='compute_eval')

Note

For specific examples, one can refer to API doc of compute_actor and compute_critic respectively.

EDAC¶

class ding.model.EDAC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, ensemble_num: int = 2, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, **kwargs)[source]¶

Overview:: The Q-value Actor-Critic network with the ensemble mechanism, which is used in EDAC.
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, ensemble_num: int = 2, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, **kwargs) → None[source]¶

Overview:

Initailize the EDAC Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s shape, such as 128, (156, ).
action_shape (Union[int, SequenceType, EasyDict]): Action’s shape, such as 4, (3, ), EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 4}).
ensemble_num (int): Q-net number.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor head.
actor_head_layer_num (int): The num of layers used in the network to compute Q value output for actor head.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic head.
critic_head_layer_num (int): The num of layers used in the network to compute Q value output for critic head.
activation (Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU().
norm_type (Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.

compute_actor(obs: Tensor) → Dict[str, Tensor | Dict[str, Tensor]][source]¶

Overview:

The forward computation graph of compute_actor mode, uses observation tensor to produce actor output, such as action, logit and so on.

Arguments:

obs (torch.Tensor): Observation tensor data, now supports a batch of 1-dim vector data, i.e. (B, obs_shape).

Returns:

outputs (Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]): Actor output varying from action_space: reparameterization.

ReturnsKeys (either):

logit (Dict[str, torch.Tensor]): Reparameterization logit, usually in SAC.
- mu (torch.Tensor): Mean of parameterization gaussion distribution.
- sigma (torch.Tensor): Standard variation of parameterization gaussion distribution.

Shapes:

obs (torch.Tensor): \((B, N0)\), B is batch size and N0 corresponds to obs_shape.
action (torch.Tensor): \((B, N1)\), B is batch size and N1 corresponds to action_shape.
logit.mu (torch.Tensor): \((B, N1)\), B is batch size and N1 corresponds to action_shape.
logit.sigma (torch.Tensor): \((B, N1)\), B is batch size.
logit (torch.Tensor): \((B, N2)\), B is batch size and N2 corresponds to action_shape.action_type_shape.
action_args (torch.Tensor): \((B, N3)\), B is batch size and N3 corresponds to action_shape.action_args_shape.

Examples:

>>> model = EDAC(64, 64,)
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['logit'][0].shape == torch.Size([4, 64])  # mu
>>> actor_outputs['logit'][1].shape == torch.Size([4, 64]) # sigma

compute_critic(inputs: Dict[str, Tensor]) → Dict[str, Tensor][source]¶

Overview:

The forward computation graph of compute_critic mode, uses observation and action tensor to produce critic output, such as q_value.

Arguments:

inputs (Dict[str, torch.Tensor]): Dict strcture of input data, including obs and action tensor

Returns:

outputs (Dict[str, torch.Tensor]): Critic output, such as q_value.

ArgumentsKeys:

obs: (torch.Tensor): Observation tensor data, now supports a batch of 1-dim vector data.
action (Union[torch.Tensor, Dict]): Continuous action with same size as action_shape.

ReturnKeys:

q_value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:

obs (torch.Tensor): \((B, N1)\) or ‘(Ensemble_num, B, N1)’, where B is batch size and N1 is obs_shape.
action (torch.Tensor): \((B, N2)\) or ‘(Ensemble_num, B, N2)’, where B is batch size and N4 is action_shape.
q_value (torch.Tensor): \((Ensemble_num, B)\), where B is batch size.

Examples:

>>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)}
>>> model = EDAC(obs_shape=(8, ),action_shape=1)
>>> model(inputs, mode='compute_critic')['q_value']  # q value
... tensor([0.0773, 0.1639, 0.0917, 0.0370], grad_fn=<SqueezeBackward1>)

forward(inputs: Tensor | Dict[str, Tensor], mode: str) → Dict[str, Tensor][source]¶

Overview:

The unique execution (forward) method of EDAC method, and one can indicate different modes to implement different computation graph, including compute_actor and compute_critic in EDAC.

Mode compute_actor:

Arguments:

inputs (torch.Tensor): Observation data, defaults to tensor.

Returns:

output (Dict): Output dict data, including differnet key-values among distinct action_space.

Mode compute_critic:

Arguments:

inputs (Dict): Input dict data, including obs and action tensor.

Returns:

output (Dict): Output dict data, including q_value tensor.

Note

For specific examples, one can refer to API doc of compute_actor and compute_critic respectively.

EBM¶

class ding.model.EBM(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4, **kwargs)[source]¶

Overview:: Energy based model.
Interface:: __init__, forward

__init__(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4, **kwargs)[source]¶

Overview:

Initialize the EBM.

Arguments:

obs_shape (int): Observation shape.
action_shape (int): Action shape.
hidden_size (int): Hidden size.
hidden_layer_num (int): Number of hidden layers.

forward(obs, action)[source]¶

Overview:

Forward computation graph of EBM.

Arguments:

obs (torch.Tensor): Observation of shape (B, N, O).
action (torch.Tensor): Action of shape (B, N, A).

Returns:

pred (torch.Tensor): Energy of shape (B, N).

Examples:

>>> obs = torch.randn(2, 3, 4)
>>> action = torch.randn(2, 3, 5)
>>> ebm = EBM(4, 5)
>>> pred = ebm(obs, action)

AutoregressiveEBM¶

class ding.model.AutoregressiveEBM(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4)[source]¶

Overview:: Autoregressive energy based model.
Interface:: __init__, forward

__init__(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4)[source]¶

Overview:

Initialize the AutoregressiveEBM.

Arguments:

obs_shape (int): Observation shape.
action_shape (int): Action shape.
hidden_size (int): Hidden size.
hidden_layer_num (int): Number of hidden layers.

forward(obs, action)[source]¶

Overview:

Forward computation graph of AutoregressiveEBM.

Arguments:

obs (torch.Tensor): Observation of shape (B, N, O).
action (torch.Tensor): Action of shape (B, N, A).

Returns:

pred (torch.Tensor): Energy of shape (B, N, A).

Examples:

>>> obs = torch.randn(2, 3, 4)
>>> action = torch.randn(2, 3, 5)
>>> arebm = AutoregressiveEBM(4, 5)
>>> pred = arebm(obs, action)

VAE¶

class ding.model.VanillaVAE(action_shape: int, obs_shape: int, latent_size: int, hidden_dims: List = [256, 256], **kwargs)[source]¶

Overview:: Implementation of Vanilla variational autoencoder for action reconstruction.
Interfaces:: __init__, encode, decode, decode_with_obs, reparameterize, forward, loss_function .

__init__(action_shape: int, obs_shape: int, latent_size: int, hidden_dims: List = [256, 256], **kwargs) → None[source]¶: Initialize internal Module state, shared by both nn.Module and ScriptModule.

decode(z: Tensor, obs_encoding: Tensor) → Dict[str, Any][source]¶

Overview:

Maps the given latent action and obs_encoding onto the original action space.

Arguments:

z (torch.Tensor): the sampled latent action
obs_encoding (torch.Tensor): observation encoding

Returns:

outputs (Dict): DQN forward outputs, such as q_value.

ReturnsKeys:

reconstruction_action (torch.Tensor): reconstruction_action.
predition_residual (torch.Tensor): predition_residual.

Shapes:

z (torch.Tensor): \((B, L)\), where B is batch size and L is latent_size
obs_encoding (torch.Tensor): \((B, H)\), where B is batch size and H is hidden dim

decode_with_obs(z: Tensor, obs: Tensor) → Dict[str, Any][source]¶

Overview:

Maps the given latent action and obs onto the original action space. Using the method self.encode_obs_head(obs) to get the obs_encoding.

Arguments:

z (torch.Tensor): the sampled latent action
obs (torch.Tensor): observation

Returns:

outputs (Dict): DQN forward outputs, such as q_value.

ReturnsKeys:

reconstruction_action (torch.Tensor): the action reconstructed by VAE .
predition_residual (torch.Tensor): the observation predicted by VAE.

Shapes:

z (torch.Tensor): \((B, L)\), where B is batch size and L is latent_size
obs (torch.Tensor): \((B, O)\), where B is batch size and O is obs_shape

encode(input: Dict[str, Tensor]) → Dict[str, Any][source]¶

Overview:

Encodes the input by passing through the encoder network and returns the latent codes.

Arguments:

input (Dict): Dict containing keywords obs (torch.Tensor) and action (torch.Tensor), representing the observation and agent’s action respectively.

Returns:

outputs (Dict): Dict containing keywords mu (torch.Tensor), log_var (torch.Tensor) and obs_encoding (torch.Tensor) representing latent codes.

Shapes:

obs (torch.Tensor): \((B, O)\), where B is batch size and O is observation dim.
action (torch.Tensor): \((B, A)\), where B is batch size and A is action dim.
mu (torch.Tensor): \((B, L)\), where B is batch size and L is latent size.
log_var (torch.Tensor): \((B, L)\), where B is batch size and L is latent size.
obs_encoding (torch.Tensor): \((B, H)\), where B is batch size and H is hidden dim.

forward(input: Dict[str, Tensor], **kwargs) → dict[source]¶

Overview:

Encode the input, reparameterize mu and log_var, decode obs_encoding.

Argumens:

input (Dict): Dict containing keywords obs (torch.Tensor) and action (torch.Tensor), representing the observation and agent’s action respectively.

Returns:

outputs (Dict): Dict containing keywords recons_action (torch.Tensor), prediction_residual (torch.Tensor), input (torch.Tensor), mu (torch.Tensor), log_var (torch.Tensor) and z (torch.Tensor).

Shapes:

recons_action (torch.Tensor): \((B, A)\), where B is batch size and A is action dim.
prediction_residual (torch.Tensor): \((B, O)\), where B is batch size and O is observation dim.
mu (torch.Tensor): \((B, L)\), where B is batch size and L is latent size.
log_var (torch.Tensor): \((B, L)\), where B is batch size and L is latent size.
z (torch.Tensor): \((B, L)\), where B is batch size and L is latent_size

loss_function(args: Dict[str, Tensor], **kwargs) → Dict[str, Tensor][source]¶

Overview:

Computes the VAE loss function.

Arguments:

args (Dict[str, Tensor]): Dict containing keywords recons_action, prediction_residual original_action, mu, log_var and true_residual.
kwargs (Dict): Dict containing keywords kld_weight and predict_weight.

Returns:

outputs (Dict[str, Tensor]): Dict containing different loss results, including loss, reconstruction_loss, kld_loss, predict_loss.

Shapes:

recons_action (torch.Tensor): \((B, A)\), where B is batch size and A is action dim.
prediction_residual (torch.Tensor): \((B, O)\), where B is batch size and O is observation dim.
original_action (torch.Tensor): \((B, A)\), where B is batch size and A is action dim.
mu (torch.Tensor): \((B, L)\), where B is batch size and L is latent size.
log_var (torch.Tensor): \((B, L)\), where B is batch size and L is latent size.
true_residual (torch.Tensor): \((B, O)\), where B is batch size and O is observation dim.

reparameterize(mu: Tensor, logvar: Tensor) → Tensor[source]¶

Overview:

Reparameterization trick to sample from N(mu, var) from N(0,1).

Arguments:

mu (torch.Tensor): Mean of the latent Gaussian
logvar (torch.Tensor): Standard deviation of the latent Gaussian

Shapes:

mu (torch.Tensor): \((B, L)\), where B is batch size and L is latnet_size
logvar (torch.Tensor): \((B, L)\), where B is batch size and L is latnet_size

Wrapper¶

Please refer to ding/model/wrapper for more details.

IModelWrapper¶

class ding.model.IModelWrapper(model: Module)[source]¶

Overview:: The basic interface class of model wrappers. Model wrapper is a wrapper class of torch.nn.Module model, which is used to add some extra operations for the wrapped model, such as hidden state maintain for RNN-base model, argmax action selection for discrete action space, etc.
Interfaces:: __init__, __getattr__, info, reset, forward.

__getattr__(key: str) → Any[source]¶

Overview:

Get original attrbutes of torch.nn.Module model, such as variables and methods defined in model.

Arguments:

key (str): The string key to query.

Returns:

ret (Any): The queried attribute.

__init__(model: Module) → None[source]¶

Overview:: Initialize model and other necessary member variabls in the model wrapper.

forward(*args, **kwargs) → Any[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

info(attr_name: str) → str[source]¶

Overview:

Get some string information of the indicated attr_name, which is used for debug wrappers. This method will recursively search for the indicated attr_name.

Arguments:

attr_name (str): The string key to query information.

Returns:

info_string (str): The information string of the indicated attr_name.

reset(data_id: List[int] | None = None, **kwargs) → None[source]¶

Overview

Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.

Arguments:

data_id (List[int]): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave this data_id argument to reset the stateful variables of the indicated data.

model_wrap¶

ding.model.model_wrap(model: Module | IModelWrapper, wrapper_name: str | None = None, **kwargs)[source]¶

Overview:

Wrap the model with the specified wrapper and return the wrappered model.

Arguments:

model (Any): The model to be wrapped.
wrapper_name (str): The name of the wrapper to be used.

Note

The arguments of the wrapper should be passed in as kwargs.

register_wrapper¶

ding.model.register_wrapper(name: str, wrapper_type: type) → None[source]¶

Overview:

Register new wrapper to wrapper_name_map. When user implements a new wrapper, they must call this function to complete the registration. Then the wrapper can be called by model_wrap.

Arguments:

name (str): The name of the new wrapper to be registered.
wrapper_type (type): The wrapper class needs to be added in wrapper_name_map. This argument should be the subclass of IModelWrapper.

BaseModelWrapper¶

class ding.model.wrapper.model_wrappers.BaseModelWrapper(model: Module)[source]¶

Overview:: Placeholder class for the model wrapper. This class is used to wrap the model without any extra operations, including a empty reset method and a forward method which directly call the wrapped model’s forward. To keep the consistency of the model wrapper interface, we use this class to wrap the model without specific operations in the implementation of DI-engine’s policy.

forward(*args, **kwargs) → Any¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

reset(data_id: List[int] | None = None, **kwargs) → None¶

Overview

Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.

Arguments:

data_id (List[int]): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave this data_id argument to reset the stateful variables of the indicated data.

ArgmaxSampleWrapper¶

class ding.model.wrapper.model_wrappers.ArgmaxSampleWrapper(model: Module)[source]¶

Overview:: Used to help the model to sample argmax action.
Interfaces:: forward.

forward(*args, **kwargs)[source]¶

Overview:: Employ model forward computation graph, and use the output logit to greedily select max action (argmax).

MultinomialSampleWrapper¶

class ding.model.wrapper.model_wrappers.MultinomialSampleWrapper(model: Module)[source]¶

Overview:: Used to help the model get the corresponding action from the output[‘logits’]self.
Interfaces:: forward.

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

EpsGreedySampleWrapper¶

class ding.model.wrapper.model_wrappers.EpsGreedySampleWrapper(model: Module)[source]¶

Overview:: Epsilon greedy sampler used in collector_model to help balance exploratin and exploitation. The type of eps can vary from different algorithms, such as: - float (i.e. python native scalar): for almost normal case - Dict[str, float]: for algorithm NGU
Interfaces:: forward.

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

EpsGreedyMultinomialSampleWrapper¶

class ding.model.wrapper.model_wrappers.EpsGreedyMultinomialSampleWrapper(model: Module)[source]¶

Overview:: Epsilon greedy sampler coupled with multinomial sample used in collector_model to help balance exploration and exploitation.
Interfaces:: forward.

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

DeterministicSampleWrapper¶

class ding.model.wrapper.model_wrappers.DeterministicSampleWrapper(model: Module)[source]¶

Overview:: Deterministic sampler (just use mu directly) used in eval_model.
Interfaces:: forward

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

ReparamSampleWrapper¶

class ding.model.wrapper.model_wrappers.ReparamSampleWrapper(model: Module)[source]¶

Overview:: Reparameterization gaussian sampler used in collector_model.
Interfaces:: forward

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

CombinationArgmaxSampleWrapper¶

class ding.model.wrapper.model_wrappers.CombinationArgmaxSampleWrapper(model: Module)[source]¶

Overview:: Used to help the model to sample combination argmax action.
Interfaces:: forward.

forward(shot_number, *args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

CombinationMultinomialSampleWrapper¶

class ding.model.wrapper.model_wrappers.CombinationMultinomialSampleWrapper(model: Module)[source]¶

Overview:: Used to help the model to sample combination multinomial action.
Interfaces:: forward.

forward(shot_number, *args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

HybridArgmaxSampleWrapper¶

class ding.model.wrapper.model_wrappers.HybridArgmaxSampleWrapper(model: Module)[source]¶

Overview:: Used to help the model to sample argmax action in hybrid action space, i.e.{‘action_type’: discrete, ‘action_args’, continuous}
Interfaces:: forward.

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

HybridEpsGreedySampleWrapper¶

class ding.model.wrapper.model_wrappers.HybridEpsGreedySampleWrapper(model: Module)[source]¶

Overview:: Epsilon greedy sampler used in collector_model to help balance exploration and exploitation. In hybrid action space, i.e.{‘action_type’: discrete, ‘action_args’, continuous}
Interfaces:: forward.

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

HybridEpsGreedyMultinomialSampleWrapper¶

class ding.model.wrapper.model_wrappers.HybridEpsGreedyMultinomialSampleWrapper(model: Module)[source]¶

Overview:: Epsilon greedy sampler coupled with multinomial sample used in collector_model to help balance exploration and exploitation. In hybrid action space, i.e.{‘action_type’: discrete, ‘action_args’, continuous}
Interfaces:: forward.

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

HybridReparamMultinomialSampleWrapper¶

class ding.model.wrapper.model_wrappers.HybridReparamMultinomialSampleWrapper(model: Module)[source]¶

Overview:: Reparameterization sampler coupled with multinomial sample used in collector_model to help balance exploration and exploitation. In hybrid action space, i.e.{‘action_type’: discrete, ‘action_args’, continuous}
Interfaces:: forward

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

HybridDeterministicArgmaxSampleWrapper¶

class ding.model.wrapper.model_wrappers.HybridDeterministicArgmaxSampleWrapper(model: Module)[source]¶

Overview:: Deterministic sampler coupled with argmax sample used in eval_model. In hybrid action space, i.e.{‘action_type’: discrete, ‘action_args’, continuous}
Interfaces:: forward

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

ActionNoiseWrapper¶

class ding.model.wrapper.model_wrappers.ActionNoiseWrapper(model: Any, noise_type: str = 'gauss', noise_kwargs: dict = {}, noise_range: dict | None = None, action_range: dict | None = {'max': 1, 'min': -1})[source]¶

Overview:

Add noise to collector’s action output; Do clips on both generated noise and action after adding noise.

Interfaces:

__init__, forward.

Arguments:

model (Any): Wrapped model class. Should contain forward method.
noise_type (str): The type of noise that should be generated, support [‘gauss’, ‘ou’].
noise_kwargs (dict): Keyword args that should be used in noise init. Depends on noise_type.
noise_range (Optional[dict]): Range of noise, used for clipping.
action_range (Optional[dict]): Range of action + noise, used for clip, default clip to [-1, 1].

__init__(model: Any, noise_type: str = 'gauss', noise_kwargs: dict = {}, noise_range: dict | None = None, action_range: dict | None = {'max': 1, 'min': -1}) → None[source]¶

Overview:: Initialize model and other necessary member variabls in the model wrapper.

forward(*args, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

TargetNetworkWrapper¶

class ding.model.wrapper.model_wrappers.TargetNetworkWrapper(model: Any, update_type: str, update_kwargs: dict)[source]¶

Overview:: Maintain and update the target network
Interfaces:: update, reset

__init__(model: Any, update_type: str, update_kwargs: dict)[source]¶

Overview:: Initialize model and other necessary member variabls in the model wrapper.

forward(*args, **kwargs) → Any¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

HiddenStateWrapper¶

class ding.model.wrapper.model_wrappers.HiddenStateWrapper(model: ~typing.Any, state_num: int, save_prev_state: bool = False, init_fn: ~typing.Callable = <function HiddenStateWrapper.<lambda>>)[source]¶

Overview:: Maintain the hidden state for RNN-base model. Each sample in a batch has its own state.
Interfaces:: __init__, reset, forward.

__init__(model: ~typing.Any, state_num: int, save_prev_state: bool = False, init_fn: ~typing.Callable = <function HiddenStateWrapper.<lambda>>) → None[source]¶

Overview:

Maintain the hidden state for RNN-base model. Each sample in a batch has its own state. Init the maintain state and state function; Then wrap the model.forward method with auto saved data [‘prev_state’] input, and create the model.reset method.

Arguments:

model(Any): Wrapped model class, should contain forward method.
state_num (int): Number of states to process.
save_prev_state (bool): Whether to output the prev state in output.
init_fn (Callable): The function which is used to init every hidden state when init and reset, default return None for hidden states.

Note

This helper must deal with an actual batch with some parts of samples, e.g: 6 samples of state_num 8.
This helper must deal with the single sample state reset.

forward(data, **kwargs)[source]¶

Overview:: Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.

reset(*args, **kwargs)[source]¶

Overview

Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.

Arguments:

data_id (List[int]): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave this data_id argument to reset the stateful variables of the indicated data.

TransformerInputWrapper¶

class ding.model.wrapper.model_wrappers.TransformerInputWrapper(model: ~typing.Any, seq_len: int, init_fn: ~typing.Callable = <function TransformerInputWrapper.<lambda>>)[source]¶

__init__(model: ~typing.Any, seq_len: int, init_fn: ~typing.Callable = <function TransformerInputWrapper.<lambda>>) → None[source]¶

Overview:

Given N the length of the sequences received by a Transformer model, maintain the last N-1 input observations. In this way we can provide at each step all the observations needed by Transformer to compute its output. We need this because some methods such as ‘collect’ and ‘evaluate’ only provide the model 1 observation per step and don’t have memory of past observations, but Transformer needs a sequence of N observations. The wrapper method forward will save the input observation in a FIFO memory of length N and the method reset will reset the memory. The empty memory spaces will be initialized with ‘init_fn’ or zero by calling the method reset_input. Since different env can terminate at different steps, the method reset_memory_entry only initializes the memory of specific environments in the batch size.

Arguments:

model (Any): Wrapped model class, should contain forward method.
seq_len (int): Number of past observations to remember.
init_fn (Callable): The function which is used to init every memory locations when init and reset.

forward(input_obs: Tensor, only_last_logit: bool = True, data_id: List | None = None, **kwargs) → Dict[str, Tensor][source]¶

Arguments:

input_obs (torch.Tensor): Input observation without sequence shape: (bs, *obs_shape).
only_last_logit (bool): if True ‘logit’ only contains the output corresponding to the current observation (shape: bs, embedding_dim), otherwise logit has shape (seq_len, bs, embedding_dim).
data_id (List): id of the envs that are currently running. Memory update and logits return has only effect for those environments. If None it is considered that all envs are running.

Returns:

Dictionary containing the input_sequence ‘input_seq’ stored in memory and the transformer output ‘logit’.

reset(*args, **kwargs)[source]¶

Overview

Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.

Arguments:

data_id (List[int]): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave this data_id argument to reset the stateful variables of the indicated data.

TransformerSegmentWrapper¶

class ding.model.wrapper.model_wrappers.TransformerSegmentWrapper(model: Any, seq_len: int)[source]¶

__init__(model: Any, seq_len: int) → None[source]¶

Overview:

Given T the length of a trajectory and N the length of the sequences received by a Transformer model, split T in sequences of N elements and forward each sequence one by one. If T % N != 0, the last sequence will be zero-padded. Usually used during Transformer training phase.

Arguments:

model (Any): Wrapped model class, should contain forward method.
seq_len (int): N, length of a sequence.

forward(obs: Tensor, **kwargs) → Dict[str, Tensor][source]¶

Arguments:

data (dict): Dict type data, including at least [‘main_obs’, ‘target_obs’, ‘action’, ‘reward’, ‘done’, ‘weight’]

Returns:

List containing a dict of the model output for each sequence.

TransformerMemoryWrapper¶

class ding.model.wrapper.model_wrappers.TransformerMemoryWrapper(model: Any, batch_size: int)[source]¶

__init__(model: Any, batch_size: int) → None[source]¶

Overview:

Stores a copy of the Transformer memory in order to be reused across different phases. To make it more: clear, suppose the training pipeline is divided into 3 phases: evaluate, collect, learn. The goal of the wrapper is to maintain the content of the memory at the end of each phase and reuse it when the same phase is executed again. In this way, it prevents different phases to interferer each other memory.

Arguments:

model (Any): Wrapped model class, should contain forward method.
batch_size (int): Memory batch size.

forward(*args, **kwargs) → Dict[str, Tensor][source]¶

Arguments:

data (dict): Dict type data, including at least [‘main_obs’, ‘target_obs’, ‘action’, ‘reward’, ‘done’, ‘weight’]

Returns:

Output of the forward method.

reset(*args, **kwargs)[source]¶

Overview

Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.

Arguments:

data_id (List[int]): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave this data_id argument to reset the stateful variables of the indicated data.