ding.model¶
Common¶
Please refer to ding/model/common
for more details.
create_model¶
- ding.model.create_model(cfg: EasyDict) Module [source]¶
- Overview:
Create a neural network model according to the given EasyDict-type
cfg
.- Arguments:
cfg: (
EasyDict
): User’s model config. The keyimport_name
is used to import modules, and they keytype
is used to indicate the model.
- Returns:
(
torch.nn.Module
): The created neural network model.
- Examples:
>>> cfg = EasyDict({ >>> 'import_names': ['ding.model.template.q_learning'], >>> 'type': 'dqn', >>> 'obs_shape': 4, >>> 'action_shape': 2, >>> }) >>> model = create_model(cfg)
Tip
This method will not modify the
cfg
, it will deepcopy thecfg
and then modify it.
ConvEncoder¶
- class ding.model.ConvEncoder(obs_shape: SequenceType, hidden_size_list: SequenceType = [32, 64, 64, 128], activation: Module | None = ReLU(), kernel_size: SequenceType = [8, 4, 3], stride: SequenceType = [4, 2, 1], padding: SequenceType | None = None, layer_norm: bool | None = False, norm_type: str | None = None)[source]¶
- Overview:
The Convolution Encoder is used to encode 2-dim image observations.
- Interfaces:
__init__
,forward
.
- __init__(obs_shape: SequenceType, hidden_size_list: SequenceType = [32, 64, 64, 128], activation: Module | None = ReLU(), kernel_size: SequenceType = [8, 4, 3], stride: SequenceType = [4, 2, 1], padding: SequenceType | None = None, layer_norm: bool | None = False, norm_type: str | None = None) None [source]¶
- Overview:
Initialize the
Convolution Encoder
according to the provided arguments.- Arguments:
obs_shape (
SequenceType
): Sequence ofin_channel
, plus one or moreinput size
.hidden_size_list (
SequenceType
): Sequence ofhidden_size
of subsequent conv layers and the final dense layer.activation (
nn.Module
): Type of activation to use in the convlayers
andResBlock
. Default isnn.ReLU()
.kernel_size (
SequenceType
): Sequence ofkernel_size
of subsequent conv layers.stride (
SequenceType
): Sequence ofstride
of subsequent conv layers.padding (
SequenceType
): Padding added to all four sides of the input for each conv layer. Seenn.Conv2d
for more details. Default isNone
.layer_norm (
bool
): Whether to useDreamerLayerNorm
, which is kind of special trick proposed in DreamerV3.norm_type (
str
): Type of normalization to use. Seeding.torch_utils.network.ResBlock
for more details. Default isNone
.
- forward(x: Tensor) Tensor [source]¶
- Overview:
Return output 1D embedding tensor of the env’s 2D image observation.
- Arguments:
x (
torch.Tensor
): Raw 2D observation of the environment.
- Returns:
outputs (
torch.Tensor
): Output embedding tensor.
- Shapes:
x : \((B, C, H, W)\), where
B
is batch size,C
is channel,H
is height,W
is width.outputs: \((B, N)\), where
N = hidden_size_list[-1]
.
- Examples:
>>> conv = ConvEncoder( >>> obs_shape=(4, 84, 84), >>> hidden_size_list=[32, 64, 64, 128], >>> activation=nn.ReLU(), >>> kernel_size=[8, 4, 3], >>> stride=[4, 2, 1], >>> padding=None, >>> layer_norm=False, >>> norm_type=None >>> ) >>> x = torch.randn(1, 4, 84, 84) >>> output = conv(x)
FCEncoder¶
- class ding.model.FCEncoder(obs_shape: int, hidden_size_list: SequenceType, res_block: bool = False, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None)[source]¶
- Overview:
The full connected encoder is used to encode 1-dim input variable.
- Interfaces:
__init__
,forward
.
- __init__(obs_shape: int, hidden_size_list: SequenceType, res_block: bool = False, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None) None [source]¶
- Overview:
Initialize the FC Encoder according to arguments.
- Arguments:
obs_shape (
int
): Observation shape.hidden_size_list (
SequenceType
): Sequence ofhidden_size
of subsequent FC layers.res_block (
bool
): Whether useres_block
. Default isFalse
.activation (
nn.Module
): Type of activation to use inResFCBlock
. Default isnn.ReLU()
.norm_type (
str
): Type of normalization to use. Seeding.torch_utils.network.ResFCBlock
for more details. Default isNone
.dropout (
float
): Dropout rate of the dropout layer. IfNone
then default no dropout layer.
- forward(x: Tensor) Tensor [source]¶
- Overview:
Return output embedding tensor of the env observation.
- Arguments:
x (
torch.Tensor
): Env raw observation.
- Returns:
outputs (
torch.Tensor
): Output embedding tensor.
- Shapes:
x : \((B, M)\), where
M = obs_shape
.outputs: \((B, N)\), where
N = hidden_size_list[-1]
.
- Examples:
>>> fc = FCEncoder( >>> obs_shape=4, >>> hidden_size_list=[32, 64, 64, 128], >>> activation=nn.ReLU(), >>> norm_type=None, >>> dropout=None >>> ) >>> x = torch.randn(1, 4) >>> output = fc(x)
IMPALAConvEncoder¶
- class ding.model.IMPALAConvEncoder(obs_shape: SequenceType, channels: SequenceType = (16, 32, 32), outsize: int = 256, scale_ob: float = 255.0, nblock: int = 2, final_relu: bool = True, **kwargs)[source]¶
- Overview:
IMPALA CNN encoder, which is used in IMPALA algorithm. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures, https://arxiv.org/pdf/1802.01561.pdf,
- Interface:
__init__
,forward
,output_shape
.
- __init__(obs_shape: SequenceType, channels: SequenceType = (16, 32, 32), outsize: int = 256, scale_ob: float = 255.0, nblock: int = 2, final_relu: bool = True, **kwargs) None [source]¶
- Overview:
Initialize the IMPALA CNN encoder according to arguments.
- Arguments:
obs_shape (
SequenceType
): 2D image observation shape.channels (
SequenceType
): The channel number of a series of impala cnn blocks. Each element of the sequence is the output channel number of a impala cnn block.outsize (
int
): The output size the final linear layer, which means the dimension of the 1D embedding vector.scale_ob (
float
): The scale of the input observation, which is used to normalize the input observation, such as dividing 255.0 for the raw image observation.nblock (
int
): The number of Residual Block in each block.final_relu (
bool
): Whether to use ReLU activation in the final output of encoder.kwargs (
Dict[str, Any]
): Other arguments forIMPALACnnDownStack
.
DiscreteHead¶
- class ding.model.DiscreteHead(hidden_size: int, output_size: int, layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, noise: bool | None = False)[source]¶
- Overview:
The
DiscreteHead
is used to generate discrete actions logit or Q-value logit, which is often used in q-learning algorithms or actor-critic algorithms for discrete action space.- Interfaces:
__init__
,forward
.
- __init__(hidden_size: int, output_size: int, layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, noise: bool | None = False) None [source]¶
- Overview:
Init the
DiscreteHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toDiscreteHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value output.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.dropout (
float
): The dropout rate, default set to None.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
DiscreteHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordlogit
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.
- Examples:
>>> head = DiscreteHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 64])
DistributionHead¶
- class ding.model.DistributionHead(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False, eps: float | None = 1e-06)[source]¶
- Overview:
The
DistributionHead
is used to generate distribution for Q-value. This module is used in C51 algorithm.- Interfaces:
__init__
,forward
.
- __init__(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False, eps: float | None = 1e-06) None [source]¶
- Overview:
Init the
DistributionHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toDistributionHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value distribution.n_atom (
int
): The number of atoms (discrete supports). Default is51
.v_min (
int
): Min value of atoms. Default is-10
.v_max (
int
): Max value of atoms. Default is10
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.eps (
float
): Small constant used for numerical stability.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
DistributionHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordslogit
(torch.Tensor
) anddistribution
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.distribution: \((B, M, n_atom)\).
- Examples:
>>> head = DistributionHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default n_atom is 51 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])
RainbowHead¶
- class ding.model.RainbowHead(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = True, eps: float | None = 1e-06)[source]¶
- Overview:
The
RainbowHead
is used to generate distribution of Q-value. This module is used in Rainbow DQN.- Interfaces:
__init__
,forward
.
- __init__(hidden_size: int, output_size: int, layer_num: int = 1, n_atom: int = 51, v_min: float = -10, v_max: float = 10, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = True, eps: float | None = 1e-06) None [source]¶
- Overview:
Init the
RainbowHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toRainbowHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value output.n_atom (
int
): The number of atoms (discrete supports). Default is51
.v_min (
int
): Min value of atoms. Default is-10
.v_max (
int
): Max value of atoms. Default is10
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.eps (
float
): Small constant used for numerical stability.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
RainbowHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordslogit
(torch.Tensor
) anddistribution
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.distribution: \((B, M, n_atom)\).
- Examples:
>>> head = RainbowHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default n_atom is 51 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])
QRDQNHead¶
- class ding.model.QRDQNHead(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False)[source]¶
- Overview:
The
QRDQNHead
(Quantile Regression DQN) is used to output action quantiles.- Interfaces:
__init__
,forward
.
- __init__(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False) None [source]¶
- Overview:
Init the
QRDQNHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toQRDQNHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value output.num_quantiles (
int
): The number of quantiles. Default is32
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
QRDQNHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordslogit
(torch.Tensor
),q
(torch.Tensor
), andtau
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.q: \((B, M, num_quantiles)\).
tau: \((B, M, 1)\).
- Examples:
>>> head = QRDQNHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles is 32 >>> assert outputs['q'].shape == torch.Size([4, 64, 32]) >>> assert outputs['tau'].shape == torch.Size([4, 32, 1])
QuantileHead¶
- class ding.model.QuantileHead(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, beta_function_type: str | None = 'uniform', activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False)[source]¶
- Overview:
The
QuantileHead
is used to output action quantiles. This module is used in IQN.- Interfaces:
__init__
,forward
,quantile_net
.
Note
The difference between
QuantileHead
andQRDQNHead
is thatQuantileHead
models the state-action quantile function as a mapping from state-actions and samples from some base distribution whileQRDQNHead
approximates random returns by a uniform mixture of Diracs functions.- __init__(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, beta_function_type: str | None = 'uniform', activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False) None [source]¶
- Overview:
Init the
QuantileHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toQuantileHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value output.num_quantiles (
int
): The number of quantiles.quantile_embedding_size (
int
): The embedding size of a quantile.beta_function_type (
str
): Type of beta function. Seeding.rl_utils.beta_function.py
for more details. Default isuniform
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.
- forward(x: Tensor, num_quantiles: int | None = None) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
QuantileHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordslogit
(torch.Tensor
),q
(torch.Tensor
), andquantiles
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.q: \((num_quantiles, B, M)\).
quantiles: \((quantile_embedding_size, 1)\).
- Examples:
>>> head = QuantileHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles is 32 >>> assert outputs['q'].shape == torch.Size([32, 4, 64]) >>> assert outputs['quantiles'].shape == torch.Size([128, 1])
- quantile_net(quantiles: Tensor) Tensor [source]¶
- Overview:
Deterministic parametric function trained to reparameterize samples from a base distribution. By repeated Bellman update iterations of Q-learning, the optimal action-value function is estimated.
- Arguments:
x (
torch.Tensor
): The encoded embedding tensor of parametric sample.
- Returns:
quantile_net (
torch.Tensor
): Quantile network output tensor after reparameterization.
- Shapes:
quantile_net \((quantile_embedding_size, M)\), where
M = output_size
.
- Examples:
>>> head = QuantileHead(64, 64) >>> quantiles = torch.randn(128,1) >>> qn_output = head.quantile_net(quantiles) >>> assert isinstance(qn_output, torch.Tensor) >>> # default quantile_embedding_size: int = 128, >>> assert qn_output.shape == torch.Size([128, 64])
FQFHead¶
- class ding.model.FQFHead(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False)[source]¶
- Overview:
The
FQFHead
is used to output action quantiles. This module is used in FQF.- Interfaces:
__init__
,forward
,quantile_net
.
Note
The implementation of FQFHead is based on the paper https://arxiv.org/abs/1911.02140. The difference between FQFHead and QuantileHead is that, in FQF, N adjustable quantile values for N adjustable quantile fractions are estimated to approximate the quantile function. The distribution of the return is approximated by a weighted mixture of N Diracs functions. While in IQN, the state-action quantile function is modeled as a mapping from state-actions and samples from some base distribution.
- __init__(hidden_size: int, output_size: int, layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False) None [source]¶
- Overview:
Init the
FQFHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toFQFHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value output.num_quantiles (
int
): The number of quantiles.quantile_embedding_size (
int
): The embedding size of a quantile.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.
- forward(x: Tensor, num_quantiles: int | None = None) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
FQFHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordslogit
(torch.Tensor
),q
(torch.Tensor
),quantiles
(torch.Tensor
),quantiles_hats
(torch.Tensor
),q_tau_i
(torch.Tensor
),entropies
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.q: \((B, num_quantiles, M)\).
quantiles: \((B, num_quantiles + 1)\).
quantiles_hats: \((B, num_quantiles)\).
q_tau_i: \((B, num_quantiles - 1, M)\).
entropies: \((B, 1)\).
- Examples:
>>> head = FQFHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles is 32 >>> assert outputs['q'].shape == torch.Size([4, 32, 64]) >>> assert outputs['quantiles'].shape == torch.Size([4, 33]) >>> assert outputs['quantiles_hats'].shape == torch.Size([4, 32]) >>> assert outputs['q_tau_i'].shape == torch.Size([4, 31, 64]) >>> assert outputs['quantiles'].shape == torch.Size([4, 1])
- quantile_net(quantiles: Tensor) Tensor [source]¶
- Overview:
Deterministic parametric function trained to reparameterize samples from the quantiles_proposal network. By repeated Bellman update iterations of Q-learning, the optimal action-value function is estimated.
- Arguments:
x (
torch.Tensor
): The encoded embedding tensor of parametric sample.
- Returns:
quantile_net (
torch.Tensor
): Quantile network output tensor after reparameterization.
- Examples:
>>> head = FQFHead(64, 64) >>> quantiles = torch.randn(4,32) >>> qn_output = head.quantile_net(quantiles) >>> assert isinstance(qn_output, torch.Tensor) >>> # default quantile_embedding_size: int = 128, >>> assert qn_output.shape == torch.Size([4, 32, 64])
DuelingHead¶
- class ding.model.DuelingHead(hidden_size: int, output_size: int, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, noise: bool | None = False)[source]¶
- Overview:
The
DuelingHead
is used to output discrete actions logit. This module is used in Dueling DQN.- Interfaces:
__init__
,forward
.
- __init__(hidden_size: int, output_size: int, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, noise: bool | None = False) None [source]¶
- Overview:
Init the
DuelingHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toDuelingHead
.output_size (
int
): The number of outputs.a_layer_num (
int
): The number of layers used in the network to compute action output.v_layer_num (
int
): The number of layers used in the network to compute value output.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.dropout (
float
): The dropout rate of dropout layer. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
DuelingHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordlogit
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.
- Examples:
>>> head = DuelingHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64])
StochasticDuelingHead¶
- class ding.model.StochasticDuelingHead(hidden_size: int, action_shape: int, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False, last_tanh: bool | None = True)[source]¶
- Overview:
The
Stochastic Dueling Network
is proposed in paper ACER (arxiv 1611.01224). That is to say, dueling network architecture in continuous action space.- Interfaces:
__init__
,forward
.
- __init__(hidden_size: int, action_shape: int, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, activation: Module | None = ReLU(), norm_type: str | None = None, noise: bool | None = False, last_tanh: bool | None = True) None [source]¶
- Overview:
Init the
Stochastic DuelingHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toStochasticDuelingHead
.action_shape (
int
): The number of continuous action shape, usually integer value.layer_num (
int
): The number of default layers used in the network to compute action and value output.a_layer_num (
int
): The number of layers used in the network to compute action output. Default islayer_num
.v_layer_num (
int
): The number of layers used in the network to compute value output. Default islayer_num
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.last_tanh (
bool
): IfTrue
Applytanh
to actions. DefaultTrue
.
- forward(s: Tensor, a: Tensor, mu: Tensor, sigma: Tensor, sample_size: int = 10) Dict[str, Tensor] [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
StochasticDuelingHead
and return the prediction dictionary.- Arguments:
s (
torch.Tensor
): Tensor containing input embedding.a (
torch.Tensor
): The original continuous behaviour action.mu (
torch.Tensor
): Themu
gaussian reparameterization output of actor head at current timestep.sigma (
torch.Tensor
): Thesigma
gaussian reparameterization output of actor head at current timestep.sample_size (
int
): The number of samples for continuous action when computing the Q value.
- Returns:
outputs (
Dict
): Dict containing keywordsq_value
(torch.Tensor
) andv_value
(torch.Tensor
).
- Shapes:
s: \((B, N)\), where
B = batch_size
andN = hidden_size
.a: \((B, A)\), where
A = action_size
.mu: \((B, A)\).
sigma: \((B, A)\).
q_value: \((B, 1)\).
v_value: \((B, 1)\).
- Examples:
>>> head = StochasticDuelingHead(64, 64) >>> inputs = torch.randn(4, 64) >>> a = torch.randn(4, 64) >>> mu = torch.randn(4, 64) >>> sigma = torch.ones(4, 64) >>> outputs = head(inputs, a, mu, sigma) >>> assert isinstance(outputs, dict) >>> assert outputs['q_value'].shape == torch.Size([4, 1]) >>> assert outputs['v_value'].shape == torch.Size([4, 1])
BranchingHead¶
- class ding.model.BranchingHead(hidden_size: int, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, norm_type: str | None = None, activation: Module | None = ReLU(), noise: bool | None = False)[source]¶
- Overview:
The
BranchingHead
is used to generate Q-value with different branches. This module is used in Branch DQN.- Interfaces:
__init__
,forward
.
- __init__(hidden_size: int, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 1, a_layer_num: int | None = None, v_layer_num: int | None = None, norm_type: str | None = None, activation: Module | None = ReLU(), noise: bool | None = False) None [source]¶
- Overview:
Init the
BranchingHead
layers according to the provided arguments. This head achieves a linear increase of the number of network outputs with the number of degrees of freedom by allowing a level of independence for each individual action. Therefore, this head is suitable for high dimensional action Spaces.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toBranchingHead
.num_branches (
int
): The number of branches, which is equivalent to the action dimension.action_bins_per_branch (:obj:int): The number of action bins in each dimension.
layer_num (
int
): The number of layers used in the network to compute Advantage and Value output.a_layer_num (
int
): The number of layers used in the network to compute Advantage output.v_layer_num (
int
): The number of layers used in the network to compute Value output.output_size (
int
): The number of outputs.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.noise (
bool
): Whether useNoiseLinearLayer
aslayer_fn
in Q networks’ MLP. DefaultFalse
.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
BranchingHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordlogit
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, M)\), where
M = output_size
.
- Examples:
>>> head = BranchingHead(64, 5, 2) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 5, 2])
RegressionHead¶
- class ding.model.RegressionHead(input_size: int, output_size: int, layer_num: int = 2, final_tanh: bool | None = False, activation: Module | None = ReLU(), norm_type: str | None = None, hidden_size: int | None = None)[source]¶
- Overview:
The
RegressionHead
is used to regress continuous variables. This module is used for generating Q-value (DDPG critic) of continuous actions, or state value (A2C/PPO), or directly predicting continuous action (DDPG actor).- Interfaces:
__init__
,forward
.
- __init__(input_size: int, output_size: int, layer_num: int = 2, final_tanh: bool | None = False, activation: Module | None = ReLU(), norm_type: str | None = None, hidden_size: int | None = None) None [source]¶
- Overview:
Init the
RegressionHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toRegressionHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value output.final_tanh (
bool
): IfTrue
applytanh
to output. DefaultFalse
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
RegressionHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordpred
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.pred: \((B, M)\), where
M = output_size
.
- Examples:
>>> head = RegressionHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['pred'].shape == torch.Size([4, 64])
ReparameterizationHead¶
- class ding.model.ReparameterizationHead(input_size: int, output_size: int, layer_num: int = 2, sigma_type: str | None = None, fixed_sigma_value: float | None = 1.0, activation: Module | None = ReLU(), norm_type: str | None = None, bound_type: str | None = None, hidden_size: int | None = None)[source]¶
- Overview:
The
ReparameterizationHead
is used to generate Gaussian distribution of continuous variable, which is parameterized bymu
andsigma
. This module is often used in stochastic policies, such as PPO and SAC.- Interfaces:
__init__
,forward
.
- __init__(input_size: int, output_size: int, layer_num: int = 2, sigma_type: str | None = None, fixed_sigma_value: float | None = 1.0, activation: Module | None = ReLU(), norm_type: str | None = None, bound_type: str | None = None, hidden_size: int | None = None) None [source]¶
- Overview:
Init the
ReparameterizationHead
layers according to the provided arguments.- Arguments:
hidden_size (
int
): Thehidden_size
of the MLP connected toReparameterizationHead
.output_size (
int
): The number of outputs.layer_num (
int
): The number of layers used in the network to compute Q value output.sigma_type (
str
): Sigma type used. Choose among['fixed', 'independent', 'conditioned']
. Default isNone
.fixed_sigma_value (
float
): When choosingfixed
type, the tensoroutput['sigma']
is filled with this input value. Default isNone
.activation (
nn.Module
): The type of activation function to use in MLP. IfNone
, then default set activation tonn.ReLU()
. DefaultNone
.norm_type (
str
): The type of normalization to use. Seeding.torch_utils.network.fc_block
for more details. DefaultNone
.bound_type (
str
): Bound type to apply to outputmu
. Choose among['tanh', None]
. Default isNone
.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
ReparameterizationHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordsmu
(torch.Tensor
) andsigma
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.mu: \((B, M)\), where
M = output_size
.sigma: \((B, M)\).
- Examples:
>>> head = ReparameterizationHead(64, 64, sigma_type='fixed') >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['mu'].shape == torch.Size([4, 64]) >>> assert outputs['sigma'].shape == torch.Size([4, 64])
AttentionPolicyHead¶
- class ding.model.AttentionPolicyHead[source]¶
- Overview:
Cross-attention-type discrete action policy head, which is often used in variable discrete action space.
- Interfaces:
__init__
,forward
.
- __init__() None [source]¶
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(key: Tensor, query: Tensor) Tensor [source]¶
- Overview:
Use attention-like mechanism to combine key and query tensor to output discrete action logit.
- Arguments:
key (
torch.Tensor
): Tensor containing key embedding.query (
torch.Tensor
): Tensor containing query embedding.
- Returns:
logit (
torch.Tensor
): Tensor containing output discrete action logit.
- Shapes:
key: \((B, N, K)\), where
B = batch_size
,N = possible discrete action choices
andK = hidden_size
.query: \((B, K)\).
logit: \((B, N)\).
- Examples:
>>> head = AttentionPolicyHead() >>> key = torch.randn(4, 5, 64) >>> query = torch.randn(4, 64) >>> logit = head(key, query) >>> assert logit.shape == torch.Size([4, 5])
Note
In this head, we assume that the
key
andquery
tensor are both normalized.
MultiHead¶
- class ding.model.MultiHead(head_cls: type, hidden_size: int, output_size_list: SequenceType, **head_kwargs)[source]¶
- Overview:
The
MultiHead
is used to generate multiple similar results. For example, we can combineDistribution
andMultiHead
to generate multi-discrete action space logit.- Interfaces:
__init__
,forward
.
- __init__(head_cls: type, hidden_size: int, output_size_list: SequenceType, **head_kwargs) None [source]¶
- Overview:
Init the
MultiHead
layers according to the provided arguments.- Arguments:
head_cls (
type
): The class of head, choose among [DuelingHead
,DistributionHead
, ‘’QuatileHead’’, …].hidden_size (
int
): Thehidden_size
of the MLP connected to theHead
.output_size_list (
int
): Sequence ofoutput_size
for multi discrete action, e.g.[2, 3, 5]
.head_kwargs: (
dict
): Dict containing class-specific arguments.
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to run MLP with
MultiHead
and return the prediction dictionary.- Arguments:
x (
torch.Tensor
): Tensor containing input embedding.
- Returns:
outputs (
Dict
): Dict containing keywordslogit
(torch.Tensor
) corresponding to the logit of eachoutput
each accessed at['logit'][i]
.
- Shapes:
x: \((B, N)\), where
B = batch_size
andN = hidden_size
.logit: \((B, Mi)\), where
Mi = output_size
corresponding to outputi
.
- Examples:
>>> head = MultiHead(DuelingHead, 64, [2, 3, 5], v_layer_num=2) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> # output_size_list is [2, 3, 5] as set >>> # Therefore each dim of logit is as follows >>> outputs['logit'][0].shape >>> torch.Size([4, 2]) >>> outputs['logit'][1].shape >>> torch.Size([4, 3]) >>> outputs['logit'][2].shape >>> torch.Size([4, 5])
independent_normal_dist¶
- ding.model.independent_normal_dist(logits: List | Dict) Distribution [source]¶
- Overview:
Convert different types logit to independent normal distribution.
- Arguments:
logits (
Union[List, Dict]
): The logits to be converted.
- Returns:
dist (
torch.distributions.Distribution
): The converted normal distribution.
- Examples:
>>> logits = [torch.randn(4, 5), torch.ones(4, 5)] >>> dist = independent_normal_dist(logits) >>> assert isinstance(dist, torch.distributions.Independent) >>> assert isinstance(dist.base_dist, torch.distributions.Normal) >>> assert dist.base_dist.loc.shape == torch.Size([4, 5]) >>> assert dist.base_dist.scale.shape == torch.Size([4, 5])
- Raises:
TypeError: If the type of logits is not
list
ordict
.
Template¶
Please refer to ding/model/template
for more details.
DQN¶
- class ding.model.DQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, init_bias: float | None = None)[source]¶
- Overview:
The neural nework structure and computation graph of Deep Q Network (DQN) algorithm, which is the most classic value-based RL algorithm for discrete action. The DQN is composed of two parts:
encoder
andhead
. Theencoder
is used to extract the feature from various observation, and thehead
is used to compute the Q value of each action dimension.- Interfaces:
__init__
,forward
.
Note
Current
DQN
supports two types of encoder:FCEncoder
andConvEncoder
, two types of head:DiscreteHead
andDuelingHead
. You can customize your own encoder or head by inheriting this class.- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, init_bias: float | None = None) None [source]¶
- Overview:
initialize the DQN (encoder + head) Model according to corresponding input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape, such as 6 or [2, 3, 3].encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
.dueling (
Optional[bool]
): Whether chooseDuelingHead
orDiscreteHead (default)
.head_hidden_size (
Optional[int]
): Thehidden_size
of head network, defaults to None, then it will be set to the last element ofencoder_hidden_size_list
.head_layer_num (
int
): The number of layers used in the head network to compute Q value output.activation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]dropout (
Optional[float]
): The dropout rate of the dropout layer. ifNone
then default disable dropout layer.init_bias (
Optional[float]
): The initial value of the last layer bias in the head network.
- forward(x: Tensor) Dict [source]¶
- Overview:
DQN forward computation graph, input observation tensor to predict q_value.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict
): The output of DQN’s forward, including q_value.
- ReturnsKeys:
logit (
torch.Tensor
): Discrete Q-value output of each possible action dimension.
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N isobs_shape
logit (
torch.Tensor
): \((B, M)\), where B is batch size and M isaction_shape
- Examples:
>>> model = DQN(32, 6) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 32) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])
Note
For consistency and compatibility, we name all the outputs of the network which are related to action selections as
logit
.
C51DQN¶
- class ding.model.C51DQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51)[source]¶
- Overview:
The neural network structure and computation graph of C51DQN, which combines distributional RL and DQN. You can refer to https://arxiv.org/pdf/1707.06887.pdf for more details. The C51DQN is composed of
encoder
andhead
.encoder
is used to extract the feature of observation, andhead
is used to compute the distribution of Q-value.- Interfaces:
__init__
,forward
Note
Current C51DQN supports two types of encoder:
FCEncoder
andConvEncoder
.- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51) None [source]¶
- Overview:
initialize the C51 Model according to corresponding input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape, such as 6 or [2, 3, 3].encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
.head_hidden_size (
Optional[int]
): Thehidden_size
of head network, defaults to None, then it will be set to the last element ofencoder_hidden_size_list
.head_layer_num (
int
): The number of layers used in the head network to compute Q value output.activation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]v_min (
Optional[float]
): The minimum value of the support of the distribution, which is related to the value (discounted sum of reward) scale of the specific environment. Defaults to -10.v_max (
Optional[float]
): The maximum value of the support of the distribution, which is related to the value (discounted sum of reward) scale of the specific environment. Defaults to 10.n_atom (
Optional[int]
): The number of atoms in the prediction distribution, 51 is the default value in the paper, you can also try other values such as 301.
- forward(x: Tensor) Dict [source]¶
- Overview:
C51DQN forward computation graph, input observation tensor to predict q_value and its distribution.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict
): The output of DQN’s forward, including q_value, and distribution.
- ReturnsKeys:
logit (
torch.Tensor
): Discrete Q-value output of each possible action dimension.distribution (
torch.Tensor
): Q-Value discretized distribution, i.e., probability of each uniformly spaced atom Q-value, such as dividing [-10, 10] into 51 uniform spaces.
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N is head_hidden_size.logit (
torch.Tensor
): \((B, M)\), where M is action_shape.distribution(
torch.Tensor
): \((B, M, P)\), where P is n_atom.
- Examples:
>>> model = C51DQN(128, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 128) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> # default head_hidden_size: int = 64, >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default n_atom: int = 51 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])
Note
For consistency and compatibility, we name all the outputs of the network which are related to action selections as
logit
.Note
For convenience, we recommend that the number of atoms should be odd, so that the middle atom is exactly the value of the Q-value.
QRDQN¶
- class ding.model.QRDQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The neural network structure and computation graph of QRDQN, which combines distributional RL and DQN. You can refer to Distributional Reinforcement Learning with Quantile Regression https://arxiv.org/pdf/1710.10044.pdf for more details.
- Interfaces:
__init__
,forward
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Initialize the QRDQN Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s space.action_shape (
Union[int, SequenceType]
): Action’s space.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
head_hidden_size (
Optional[int]
): Thehidden_size
to pass toHead
.head_layer_num (
int
): The num of layers used in the network to compute Q value outputnum_quantiles (
int
): Number of quantiles in the prediction distribution.- activation (
Optional[nn.Module]
): The type of activation function to use in
MLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
- activation (
- norm_type (
Optional[str]
): The type of normalization to use, see
ding.torch_utils.fc_block
for more details`
- norm_type (
- forward(x: Tensor) Dict [source]¶
- Overview:
Use observation tensor to predict QRDQN’s output. Parameter updates with QRDQN’s MLPs forward setup.
- Arguments:
- x (
torch.Tensor
): The encoded embedding tensor with
(B, N=hidden_size)
.
- x (
- Returns:
- outputs (
Dict
): Run with encoder and head. Return the result prediction dictionary.
- outputs (
- ReturnsKeys:
logit (
torch.Tensor
): Logit tensor with same size as inputx
.q (
torch.Tensor
): Q valye tensor tensor of size(B, N, num_quantiles)
tau (
torch.Tensor
): tau tensor of size(B, N, 1)
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N is head_hidden_size.logit (
torch.FloatTensor
): \((B, M)\), where M is action_shape.tau (
torch.Tensor
): \((B, M, 1)\)
- Examples:
>>> model = QRDQN(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles : int = 32 >>> assert outputs['q'].shape == torch.Size([4, 64, 32]) >>> assert outputs['tau'].shape == torch.Size([4, 32, 1])
IQN¶
- class ding.model.IQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The neural network structure and computation graph of IQN, which combines distributional RL and DQN. You can refer to paper Implicit Quantile Networks for Distributional Reinforcement Learning https://arxiv.org/pdf/1806.06923.pdf for more details.
- Interfaces:
__init__
,forward
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Initialize the IQN Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape.action_shape (
Union[int, SequenceType]
): Action space shape.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
head_hidden_size (
Optional[int]
): Thehidden_size
to pass toHead
.head_layer_num (
int
): The num of layers used in the network to compute Q value outputnum_quantiles (
int
): Number of quantiles in the prediction distribution.- activation (
Optional[nn.Module]
): The type of activation function to use in
MLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
- activation (
- norm_type (
Optional[str]
): The type of normalization to use, see
ding.torch_utils.fc_block
for more details.
- norm_type (
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to predict IQN’s output. Parameter updates with IQN’s MLPs forward setup.
- Arguments:
- x (
torch.Tensor
): The encoded embedding tensor with
(B, N=hidden_size)
.
- x (
- Returns:
- outputs (
Dict
): Run with encoder and head. Return the result prediction dictionary.
- outputs (
- ReturnsKeys:
logit (
torch.Tensor
): Logit tensor with same size as inputx
.q (
torch.Tensor
): Q valye tensor tensor of size(num_quantiles, N, B)
quantiles (
torch.Tensor
): quantiles tensor of size(quantile_embedding_size, 1)
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N is head_hidden_size.logit (
torch.FloatTensor
): \((B, M)\), where M is action_shapequantiles (
torch.Tensor
): \((P, 1)\), where P is quantile_embedding_size.
- Examples:
>>> model = IQN(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles: int = 32 >>> assert outputs['q'].shape == torch.Size([32, 4, 64] >>> # default quantile_embedding_size: int = 128 >>> assert outputs['quantiles'].shape == torch.Size([128, 1])
FQF¶
- class ding.model.FQF(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The neural network structure and computation graph of FQF, which combines distributional RL and DQN. You can refer to paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning https://arxiv.org/pdf/1911.02140.pdf for more details.
- Interface:
__init__
,forward
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, num_quantiles: int = 32, quantile_embedding_size: int = 128, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Initialize the FQF Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape.action_shape (
Union[int, SequenceType]
): Action space shape.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
head_hidden_size (
Optional[int]
): Thehidden_size
to pass toHead
.head_layer_num (
int
): The num of layers used in the network to compute Q value outputnum_quantiles (
int
): Number of quantiles in the prediction distribution.- activation (
Optional[nn.Module]
): The type of activation function to use in
MLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
- activation (
- norm_type (
Optional[str]
): The type of normalization to use, see
ding.torch_utils.fc_block
for more details.
- norm_type (
- forward(x: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to predict FQF’s output. Parameter updates with FQF’s MLPs forward setup.
- Arguments:
- x (
torch.Tensor
): The encoded embedding tensor with
(B, N=hidden_size)
.
- x (
- Returns:
outputs (
Dict
): Dict containing keywordslogit
(torch.Tensor
),q
(torch.Tensor
),quantiles
(torch.Tensor
),quantiles_hats
(torch.Tensor
),q_tau_i
(torch.Tensor
),entropies
(torch.Tensor
).
- Shapes:
x: \((B, N)\), where B is batch size and N is head_hidden_size.
logit: \((B, M)\), where M is action_shape.
q: \((B, num_quantiles, M)\).
quantiles: \((B, num_quantiles + 1)\).
quantiles_hats: \((B, num_quantiles)\).
q_tau_i: \((B, num_quantiles - 1, M)\).
entropies: \((B, 1)\).
- Examples:
>>> model = FQF(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles: int = 32 >>> assert outputs['q'].shape == torch.Size([4, 32, 64]) >>> assert outputs['quantiles'].shape == torch.Size([4, 33]) >>> assert outputs['quantiles_hats'].shape == torch.Size([4, 32]) >>> assert outputs['q_tau_i'].shape == torch.Size([4, 31, 64]) >>> assert outputs['quantiles'].shape == torch.Size([4, 1])
BDQ¶
- class ding.model.BDQ(obs_shape: int | SequenceType, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 3, a_layer_num: int | None = None, v_layer_num: int | None = None, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, norm_type: Module | None = None, activation: Module | None = ReLU())[source]¶
- __init__(obs_shape: int | SequenceType, num_branches: int = 0, action_bins_per_branch: int = 2, layer_num: int = 3, a_layer_num: int | None = None, v_layer_num: int | None = None, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, norm_type: Module | None = None, activation: Module | None = ReLU()) None [source]¶
- Overview:
Init the BDQ (encoder + head) Model according to input arguments. referenced paper Action Branching Architectures for Deep Reinforcement Learning <https://arxiv.org/pdf/1711.08946>
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].num_branches (
int
): The number of branches, which is equivalent to the action dimension, such as 6 in mujoco’s halfcheetah environment.action_bins_per_branch (
int
): The number of actions in each dimension.layer_num (
int
): The number of layers used in the network to compute Advantage and Value output.a_layer_num (
int
): The number of layers used in the network to compute Advantage output.v_layer_num (
int
): The number of layers used in the network to compute Value output.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
.head_hidden_size (
Optional[int]
): Thehidden_size
of head network.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details.activation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
- forward(x: Tensor) Dict [source]¶
- Overview:
BDQ forward computation graph, input observation tensor to predict q_value.
- Arguments:
x (
torch.Tensor
): Observation inputs
- Returns:
outputs (
Dict
): BDQ forward outputs, such as q_value.
- ReturnsKeys:
logit (
torch.Tensor
): Discrete Q-value output of each action dimension.
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N isobs_shape
- logit (
torch.FloatTensor
): \((B, M)\), where B is batch size and M is num_branches * action_bins_per_branch
- logit (
- Examples:
>>> model = BDQ(8, 5, 2) # arguments: 'obs_shape', 'num_branches' and 'action_bins_per_branch'. >>> inputs = torch.randn(4, 8) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 5, 2])
RainbowDQN¶
- class ding.model.RainbowDQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51)[source]
- Overview:
The neural network structure and computation graph of RainbowDQN, which combines distributional RL and DQN. You can refer to paper Rainbow: Combining Improvements in Deep Reinforcement Learning https://arxiv.org/pdf/1710.02298.pdf for more details.
- Interfaces:
__init__
,forward
Note
RainbowDQN contains dueling architecture by default.
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51) None [source]
- Overview:
Init the Rainbow Model according to arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape.action_shape (
Union[int, SequenceType]
): Action space shape.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
head_hidden_size (
Optional[int]
): Thehidden_size
to pass toHead
.head_layer_num (
int
): The num of layers used in the network to compute Q value outputactivation (
Optional[nn.Module]
): The type of activation function to use inMLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
norm_type (
Optional[str]
): The type of normalization to use, seeding.torch_utils.fc_block
for more details`n_atom (
Optional[int]
): Number of atoms in the prediction distribution.
- forward(x: Tensor) Dict [source]
- Overview:
Use observation tensor to predict Rainbow output. Parameter updates with Rainbow’s MLPs forward setup.
- Arguments:
- x (
torch.Tensor
): The encoded embedding tensor with
(B, N=hidden_size)
.
- x (
- Returns:
- outputs (
Dict
): Run
MLP
withRainbowHead
setups and return the result prediction dictionary.
- outputs (
- ReturnsKeys:
logit (
torch.Tensor
): Logit tensor with same size as inputx
.distribution (
torch.Tensor
): Distribution tensor of size(B, N, n_atom)
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N is head_hidden_size.logit (
torch.FloatTensor
): \((B, M)\), where M is action_shape.distribution(
torch.FloatTensor
): \((B, M, P)\), where P is n_atom.
- Examples:
>>> model = RainbowDQN(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default n_atom: int =51 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])
DRQN¶
- class ding.model.DRQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None, res_link: bool = False)[source]¶
- Overview:
The neural network structure and computation graph of DRQN (DQN + RNN = DRQN) algorithm, which is the most common DQN variant for sequential data and paratially observable environment. The DRQN is composed of three parts:
encoder
,head
andrnn
. Theencoder
is used to extract the feature from various observation, thernn
is used to process the sequential observation and other data, and thehead
is used to compute the Q value of each action dimension.- Interfaces:
__init__
,forward
.
Note
Current
DRQN
supports two types of encoder:FCEncoder
andConvEncoder
, two types of head:DiscreteHead
andDuelingHead
, three types of rnn:normal (LSTM with LayerNorm)
,pytorch
andgru
. You can customize your own encoder, rnn or head by inheriting this class.- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None, res_link: bool = False) None [source]¶
- Overview:
Initialize the DRQN Model according to the corresponding input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape, such as 6 or [2, 3, 3].encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
.dueling (
Optional[bool]
): Whether chooseDuelingHead
orDiscreteHead (default)
.head_hidden_size (
Optional[int]
): Thehidden_size
of head network, defaults to None, then it will be set to the last element ofencoder_hidden_size_list
.head_layer_num (
int
): The number of layers used in the head network to compute Q value output.lstm_type (
Optional[str]
): The type of RNN module, now support [‘normal’, ‘pytorch’, ‘gru’].activation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]res_link (
bool
): Whether to enable the residual link, which is the skip connnection between single frame data and the sequential data, defaults to False.
- forward(inputs: Dict, inference: bool = False, saved_state_timesteps: list | None = None) Dict [source]¶
- Overview:
DRQN forward computation graph, input observation tensor to predict q_value.
- Arguments:
inputs (
torch.Tensor
): The dict of input data, including observation and previous rnn state.inference: (:obj:’bool’): Whether to enable inference forward mode, if True, we unroll the one timestep transition, otherwise, we unroll the eentire sequence transitions.
saved_state_timesteps: (:obj:’Optional[list]’): When inference is False, we unroll the sequence transitions, then we would use this list to indicate how to save and return hidden state.
- ArgumentsKeys:
obs (
torch.Tensor
): The raw observation tensor.prev_state (
list
): The previous rnn state tensor, whose structure depends onlstm_type
.
- Returns:
outputs (
Dict
): The output of DRQN’s forward, including logit (q_value) and next state.
- ReturnsKeys:
logit (
torch.Tensor
): Discrete Q-value output of each possible action dimension.next_state (
list
): The next rnn state tensor, whose structure depends onlstm_type
.
- Shapes:
obs (
torch.Tensor
): \((B, N)\), where B is batch size and N isobs_shape
logit (
torch.Tensor
): \((B, M)\), where B is batch size and M isaction_shape
- Examples:
>>> # Init input's Keys: >>> prev_state = [[torch.randn(1, 1, 64) for __ in range(2)] for _ in range(4)] # B=4 >>> obs = torch.randn(4,64) >>> model = DRQN(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> outputs = model({'obs': inputs, 'prev_state': prev_state}, inference=True) >>> # Check outputs's Keys >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == (4, 64) >>> assert len(outputs['next_state']) == 4 >>> assert all([len(t) == 2 for t in outputs['next_state']]) >>> assert all([t[0].shape == (1, 1, 64) for t in outputs['next_state']])
GTrXLDQN¶
- class ding.model.GTrXLDQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, head_layer_num: int = 1, att_head_dim: int = 16, hidden_size: int = 16, att_head_num: int = 2, att_mlp_num: int = 2, att_layer_num: int = 3, memory_len: int = 64, activation: Module | None = ReLU(), head_norm_type: str | None = None, dropout: float = 0.0, gru_gating: bool = True, gru_bias: float = 2.0, dueling: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 256], encoder_norm_type: str | None = None)[source]¶
- Overview:
The neural network structure and computation graph of Gated Transformer-XL DQN algorithm, which is the enhanced version of DRQN, using Transformer-XL to improve long-term sequential modelling ability. The GTrXL-DQN is composed of three parts:
encoder
,head
andcore
. Theencoder
is used to extract the feature from various observation, thecore
is used to process the sequential observation and other data, and thehead
is used to compute the Q value of each action dimension.- Interfaces:
__init__
,forward
,reset_memory
,get_memory
.
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, head_layer_num: int = 1, att_head_dim: int = 16, hidden_size: int = 16, att_head_num: int = 2, att_mlp_num: int = 2, att_layer_num: int = 3, memory_len: int = 64, activation: Module | None = ReLU(), head_norm_type: str | None = None, dropout: float = 0.0, gru_gating: bool = True, gru_bias: float = 2.0, dueling: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 256], encoder_norm_type: str | None = None) None [source]¶
- Overview:
Initialize the GTrXLDQN model accoding to corresponding input arguments.
Tip
You can refer to GTrXl class in
ding.torch_utils.network.gtrxl
for more details about the input arguments.- Arguments:
obs_shape (
Union[int, SequenceType]
): Used by Transformer. Observation’s space.action_shape (:obj:Union[int, SequenceType]): Used by Head. Action’s space.
head_layer_num (
int
): Used by Head. Number of layers.att_head_dim (
int
): Used by Transformer.hidden_size (
int
): Used by Transformer and Head.att_head_num (
int
): Used by Transformer.att_mlp_num (
int
): Used by Transformer.att_layer_num (
int
): Used by Transformer.memory_len (
int
): Used by Transformer.activation (
Optional[nn.Module]
): Used by Transformer and Head. ifNone
then default set tonn.ReLU()
.head_norm_type (
Optional[str]
): Used by Head. The type of normalization to use, seeding.torch_utils.fc_block
for more details`.dropout (
bool
): Used by Transformer.gru_gating (
bool
): Used by Transformer.gru_bias (
float
): Used by Transformer.dueling (
bool
): Used by Head. Make the head dueling.encoder_hidden_size_list(
SequenceType
): Used by Encoder. The collection ofhidden_size
if using a custom convolutional encoder.encoder_norm_type (
Optional[str]
): Used by Encoder. The type of normalization to use, seeding.torch_utils.fc_block
for more details`.
- forward(x: Tensor) Dict [source]¶
- Overview:
Let input tensor go through GTrXl and the Head sequentially.
- Arguments:
x (
torch.Tensor
): input tensor of shape (seq_len, bs, obs_shape).
- Returns:
out (
Dict
): runGTrXL
withDiscreteHead
setups and return the result prediction dictionary.
- ReturnKeys:
logit (
torch.Tensor
): discrete Q-value output of each action dimension, shape is (B, action_space).memory (
torch.Tensor
): memory tensor of size(bs x layer_num+1 x memory_len x embedding_dim)
.transformer_out (
torch.Tensor
): output tensor of transformer with same size as inputx
.
- Examples:
>>> # Init input's Keys: >>> obs_dim, seq_len, bs, action_dim = 128, 64, 32, 4 >>> obs = torch.rand(seq_len, bs, obs_dim) >>> model = GTrXLDQN(obs_dim, action_dim) >>> outputs = model(obs) >>> assert isinstance(outputs, dict)
- get_memory() Tensor | None [source]¶
- Overview:
Return the memory of GTrXL.
- Returns:
memory: (
Optional[torch.Tensor]
): output memory or None if memory has not been initialized, whose shape is (layer_num, memory_len, bs, embedding_dim).
- reset_memory(batch_size: int | None = None, state: Tensor | None = None) None [source]¶
- Overview:
Clear or reset the memory of GTrXL.
- Arguments:
batch_size (
Optional[int]
): The number of samples in a training batch.state (
Optional[torch.Tensor]
): The input memory data, whose shape is (layer_num, memory_len, bs, embedding_dim).
PG¶
- class ding.model.PG(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The neural network and computation graph of algorithms related to Policy Gradient(PG) (https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf). The PG model is composed of two parts: encoder and head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding action logit.
- Interface:
__init__
,forward
.
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', encoder_hidden_size_list: SequenceType = [128, 128, 64], head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Initialize the PG model according to corresponding input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape, such as 6 or [2, 3, 3].action_space (
str
): The type of different action spaces, including [‘discrete’, ‘continuous’], then will instantiate corresponding head, includingDiscreteHead
andReparameterizationHead
.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
.head_hidden_size (
Optional[int]
): Thehidden_size
ofhead
network, defaults to None, it must match the last element ofencoder_hidden_size_list
.head_layer_num (
int
): The num of layers used in thehead
network to compute action.activation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]
- Examples:
>>> model = PG((4, 84, 84), 5) >>> inputs = torch.randn(8, 4, 84, 84) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == (8, 5) >>> assert outputs['dist'].sample().shape == (8, )
- forward(x: Tensor) Dict [source]¶
- Overview:
PG forward computation graph, input observation tensor to predict policy distribution.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
torch.distributions
): The output policy distribution. If action space is discrete, the output is Categorical distribution; if action space is continuous, the output is Normal distribution.
VAC¶
- class ding.model.VAC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', fixed_sigma_value: int | None = 0.3, bound_type: str | None = None, encoder: Module | None = None, impala_cnn_encoder: bool = False)[source]¶
- Overview:
The neural network and computation graph of algorithms related to (state) Value Actor-Critic (VAC), such as A2C/PPO/IMPALA. This model now supports discrete, continuous and hybrid action space. The VAC is composed of four parts:
actor_encoder
,critic_encoder
,actor_head
andcritic_head
. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding value or action logit. In high-dimensional observation space like 2D image, we often use a shared encoder for bothactor_encoder
andcritic_encoder
. In low-dimensional observation space like 1D vector, we often use different encoders.- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
,compute_actor_critic
.
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', fixed_sigma_value: int | None = 0.3, bound_type: str | None = None, encoder: Module | None = None, impala_cnn_encoder: bool = False) None [source]¶
- Overview:
Initialize the VAC model according to corresponding input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape, such as 6 or [2, 3, 3].action_space (
str
): The type of different action spaces, including [‘discrete’, ‘continuous’, ‘hybrid’], then will instantiate corresponding head, includingDiscreteHead
,ReparameterizationHead
, and hybrid heads.share_encoder (
bool
): Whether to share observation encoders between actor and decoder.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element is used as the input size ofactor_head
andcritic_head
.actor_head_hidden_size (
Optional[int]
): Thehidden_size
ofactor_head
network, defaults to 64, it is the hidden size of the last layer of theactor_head
network.actor_head_layer_num (
int
): The num of layers used in theactor_head
network to compute action.critic_head_hidden_size (
Optional[int]
): Thehidden_size
ofcritic_head
network, defaults to 64, it is the hidden size of the last layer of thecritic_head
network.critic_head_layer_num (
int
): The num of layers used in thecritic_head
network.activation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]sigma_type (
Optional[str]
): The type of sigma in continuous action space, seeding.torch_utils.network.dreamer.ReparameterizationHead
for more details, in A2C/PPO, it defaults toindependent
, which means state-independent sigma parameters.fixed_sigma_value (
Optional[int]
): Ifsigma_type
isfixed
, then use this value as sigma.bound_type (
Optional[str]
): The type of action bound methods in continuous action space, defaults toNone
, which means no bound.encoder (
Optional[torch.nn.Module]
): The encoder module, defaults toNone
, you can define your own encoder module and pass it into VAC to deal with different observation space.impala_cnn_encoder (
bool
): Whether to use IMPALA CNN encoder, defaults toFalse
.
- compute_actor(x: Tensor) Dict [source]¶
- Overview:
VAC forward computation graph for actor part, input observation tensor to predict action logit.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict
): The output dict of VAC’s forward computation graph for actor, includinglogit
.
- ReturnsKeys:
logit (
torch.Tensor
): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict withaction_type
andaction_args
.
- Shapes:
logit (
torch.Tensor
): \((B, N)\), where B is batch size and N isaction_shape
- Examples:
>>> model = VAC(64, 64) >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([4, 64])
- compute_actor_critic(x: Tensor) Dict [source]¶
- Overview:
VAC forward computation graph for both actor and critic part, input observation tensor to predict action logit and state value.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict
): The output dict of VAC’s forward computation graph for both actor and critic, includinglogit
andvalue
.
- ReturnsKeys:
logit (
torch.Tensor
): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict withaction_type
andaction_args
.value (
torch.Tensor
): The predicted state value tensor.
- Shapes:
logit (
torch.Tensor
): \((B, N)\), where B is batch size and N isaction_shape
value (
torch.Tensor
): \((B, )\), where B is batch size, (B, 1) is squeezed to (B, ).
- Examples:
>>> model = VAC(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs,'compute_actor_critic') >>> assert critic_outputs['value'].shape == torch.Size([4]) >>> assert outputs['logit'].shape == torch.Size([4, 64])
Note
compute_actor_critic
interface aims to save computation when shares encoder and return the combination dict output.
- compute_critic(x: Tensor) Dict [source]¶
- Overview:
VAC forward computation graph for critic part, input observation tensor to predict state value.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict
): The output dict of VAC’s forward computation graph for critic, includingvalue
.
- ReturnsKeys:
value (
torch.Tensor
): The predicted state value tensor.
- Shapes:
value (
torch.Tensor
): \((B, )\), where B is batch size, (B, 1) is squeezed to (B, ).
- Examples:
>>> model = VAC(64, 64) >>> inputs = torch.randn(4, 64) >>> critic_outputs = model(inputs,'compute_critic') >>> assert critic_outputs['value'].shape == torch.Size([4])
- forward(x: Tensor, mode: str) Dict [source]¶
- Overview:
VAC forward computation graph, input observation tensor to predict state value or action logit. Different
mode
will forward with different network modules to get different outputs and save computation.- Arguments:
x (
torch.Tensor
): The input observation tensor data.mode (
str
): The forward mode, all the modes are defined in the beginning of this class.
- Returns:
outputs (
Dict
): The output dict of VAC’s forward computation graph, whose key-values vary from differentmode
.
- Examples (Actor):
>>> model = VAC(64, 128) >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([4, 128])
- Examples (Critic):
>>> model = VAC(64, 64) >>> inputs = torch.randn(4, 64) >>> critic_outputs = model(inputs,'compute_critic') >>> assert actor_outputs['logit'].shape == torch.Size([4, 64])
- Examples (Actor-Critic):
>>> model = VAC(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs,'compute_actor_critic') >>> assert critic_outputs['value'].shape == torch.Size([4]) >>> assert outputs['logit'].shape == torch.Size([4, 64])
DREAMERVAC¶
- class ding.model.DREAMERVAC(action_shape: int | SequenceType | EasyDict, dyn_stoch=32, dyn_deter=512, dyn_discrete=32, actor_layers=2, value_layers=2, units=512, act='SiLU', norm='LayerNorm', actor_dist='normal', actor_init_std=1.0, actor_min_std=0.1, actor_max_std=1.0, actor_temp=0.1, action_unimix_ratio=0.01)[source]¶
- Overview:
The neural network and computation graph of DreamerV3 (state) Value Actor-Critic (VAC). This model now supports discrete, continuous action space.
- Interfaces:
__init__
,forward
.
- __init__(action_shape: int | SequenceType | EasyDict, dyn_stoch=32, dyn_deter=512, dyn_discrete=32, actor_layers=2, value_layers=2, units=512, act='SiLU', norm='LayerNorm', actor_dist='normal', actor_init_std=1.0, actor_min_std=0.1, actor_max_std=1.0, actor_temp=0.1, action_unimix_ratio=0.01) None [source]¶
- Overview:
Initialize the
DREAMERVAC
model according to arguments.- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape, such as 6 or [2, 3, 3].
MAVAC¶
- class ding.model.MAVAC(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, agent_num: int, actor_head_hidden_size: int = 256, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 512, critic_head_layer_num: int = 1, action_space: str = 'discrete', activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', bound_type: str | None = None, encoder: Tuple[Module, Module] | None = None)[source]¶
- Overview:
The neural network and computation graph of algorithms related to (state) Value Actor-Critic (VAC) for multi-agent, such as MAPPO(https://arxiv.org/abs/2103.01955). This model now supports discrete and continuous action space. The MAVAC is composed of four parts:
actor_encoder
,critic_encoder
,actor_head
andcritic_head
. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding value or action logit.- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
,compute_actor_critic
.
- __init__(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, agent_num: int, actor_head_hidden_size: int = 256, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 512, critic_head_layer_num: int = 1, action_space: str = 'discrete', activation: Module | None = ReLU(), norm_type: str | None = None, sigma_type: str | None = 'independent', bound_type: str | None = None, encoder: Tuple[Module, Module] | None = None) None [source]¶
- Overview:
Init the MAVAC Model according to arguments.
- Arguments:
agent_obs_shape (
Union[int, SequenceType]
): Observation’s space for single agent, such as 8 or [4, 84, 84].global_obs_shape (
Union[int, SequenceType]
): Global observation’s space, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape for single agent, such as 6 or [2, 3, 3].agent_num (
int
): This parameter is temporarily reserved. This parameter may be required for subsequent changes to the modelactor_head_hidden_size (
Optional[int]
): Thehidden_size
ofactor_head
network, defaults to 256, it must match the last element ofagent_obs_shape
.actor_head_layer_num (
int
): The num of layers used in theactor_head
network to compute action.critic_head_hidden_size (
Optional[int]
): Thehidden_size
ofcritic_head
network, defaults to 512, it must match the last element ofglobal_obs_shape
.critic_head_layer_num (
int
): The num of layers used in the network to compute Q value output for critic’s nn.action_space (
Union[int, SequenceType]
): The type of different action spaces, including [‘discrete’, ‘continuous’], then will instantiate corresponding head, includingDiscreteHead
andReparameterizationHead
.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’].sigma_type (
Optional[str]
): The type of sigma in continuous action space, seeding.torch_utils.network.dreamer.ReparameterizationHead
for more details, in MAPPO, it defaults toindependent
, which means state-independent sigma parameters.bound_type (
Optional[str]
): The type of action bound methods in continuous action space, defaults toNone
, which means no bound.encoder (
Optional[Tuple[torch.nn.Module, torch.nn.Module]]
): The encoder module list, defaults toNone
, you can define your own actor and critic encoder module and pass it into MAVAC to deal with different observation space.
- compute_actor(x: Dict) Dict [source]¶
- Overview:
MAVAC forward computation graph for actor part, predicting action logit with agent observation tensor in
x
.- Arguments:
- x (
Dict
): Input data dict with keys [‘agent_state’, ‘action_mask’(optional)]. agent_state: (
torch.Tensor
): Each agent local state(obs).action_mask(optional): (
torch.Tensor
): Whenaction_space
is discrete, action_mask needs to be provided to mask illegal actions.
- x (
- Returns:
outputs (
Dict
): The output dict of the forward computation graph for actor, includinglogit
.
- ReturnsKeys:
logit (
torch.Tensor
): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions.
- Shapes:
logit (
torch.FloatTensor
): \((B, M, N)\), where B is batch size and N isaction_shape
and M isagent_num
.
- Examples:
>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([10, 8, 14])
- compute_actor_critic(x: Dict) Dict [source]¶
- Overview:
MAVAC forward computation graph for both actor and critic part, input observation to predict action logit and state value.
- Arguments:
x (
Dict
): The input dict containsagent_state
,global_state
and other related info.
- Returns:
outputs (
Dict
): The output dict of MAVAC’s forward computation graph for both actor and critic, includinglogit
andvalue
.
- ReturnsKeys:
logit (
torch.Tensor
): Logit encoding tensor, with same size as inputx
.value (
torch.Tensor
): Q value tensor with same size as batch size.
- Shapes:
logit (
torch.FloatTensor
): \((B, M, N)\), where B is batch size and N isaction_shape
and M isagent_num
.value (
torch.FloatTensor
): \((B, M)\), where B is batch sizeand M isagent_num
.
- Examples:
>>> model = MAVAC(64, 64) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> outputs = model(inputs,'compute_actor_critic') >>> assert outputs['value'].shape == torch.Size([10, 8]) >>> assert outputs['logit'].shape == torch.Size([10, 8, 14])
- compute_critic(x: Dict) Dict [source]¶
- Overview:
MAVAC forward computation graph for critic part. Predict state value with global observation tensor in
x
.- Arguments:
- x (
Dict
): Input data dict with keys [‘global_state’]. global_state: (
torch.Tensor
): Global state(obs).
- x (
- Returns:
outputs (
Dict
): The output dict of MAVAC’s forward computation graph for critic, includingvalue
.
- ReturnsKeys:
value (
torch.Tensor
): The predicted state value tensor.
- Shapes:
value (
torch.FloatTensor
): \((B, M)\), where B is batch size and M isagent_num
.
- Examples:
>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> critic_outputs = model(inputs,'compute_critic') >>> assert critic_outputs['value'].shape == torch.Size([10, 8])
- forward(inputs: Tensor | Dict, mode: str) Dict [source]¶
- Overview:
MAVAC forward computation graph, input observation tensor to predict state value or action logit.
mode
includescompute_actor
,compute_critic
,compute_actor_critic
. Differentmode
will forward with different network modules to get different outputs and save computation.- Arguments:
inputs (
Dict
): The input dict including observation and related info, whose key-values vary from differentmode
.mode (
str
): The forward mode, all the modes are defined in the beginning of this class.
- Returns:
outputs (
Dict
): The output dict of MAVAC’s forward computation graph, whose key-values vary from differentmode
.
- Examples (Actor):
>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([10, 8, 14])
- Examples (Critic):
>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> critic_outputs = model(inputs,'compute_critic') >>> assert actor_outputs['value'].shape == torch.Size([10, 8])
- Examples (Actor-Critic):
>>> model = MAVAC(64, 64) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> outputs = model(inputs,'compute_actor_critic') >>> assert outputs['value'].shape == torch.Size([10, 8, 14]) >>> assert outputs['logit'].shape == torch.Size([10, 8])
ContinuousQAC¶
- class ding.model.ContinuousQAC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False)[source]¶
- Overview:
The neural network and computation graph of algorithms related to Q-value Actor-Critic (QAC), such as DDPG/TD3/SAC. This model now supports continuous and hybrid action space. The ContinuousQAC is composed of four parts:
actor_encoder
,critic_encoder
,actor_head
andcritic_head
. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding Q-value or action logit. In high-dimensional observation space like 2D image, we often use a shared encoder for bothactor_encoder
andcritic_encoder
. In low-dimensional observation space like 1D vector, we often use different encoders.- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False) None [source]¶
- Overview:
Initailize the ContinuousQAC Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s shape, such as 128, (156, ).action_shape (
Union[int, SequenceType, EasyDict]
): Action’s shape, such as 4, (3, ), EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 4}).action_space (
str
): The type of action space, including [regression
,reparameterization
,hybrid
],regression
is used for DDPG/TD3,reparameterization
is used for SAC andhybrid
for PADDPG.twin_critic (
bool
): Whether to use twin critic, one of tricks in TD3.actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor head.actor_head_layer_num (
int
): The num of layers used in the actor network to compute action.critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic head.critic_head_layer_num (
int
): The num of layers used in the critic network to compute Q-value.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
after each FC layer, ifNone
then default set tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization to after network layer (FC, Conv), seeding.torch_utils.network
for more details.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
, this argument is only used in image observation.share_encoder (
Optional[bool]
): Whether to share encoder between actor and critic.
- compute_actor(obs: Tensor) Dict[str, Tensor | Dict[str, Tensor]] [source]¶
- Overview:
QAC forward computation graph for actor part, input observation tensor to predict action or action logit.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]
): Actor output dict varying from action_space:regression
,reparameterization
,hybrid
.
- ReturnsKeys (regression):
action (
torch.Tensor
): Continuous action with same size asaction_shape
, usually in DDPG/TD3.
- ReturnsKeys (reparameterization):
logit (
Dict[str, torch.Tensor]
): The predictd reparameterization action logit, usually in SAC. It is a list containing two tensors:mu
andsigma
. The former is the mean of the gaussian distribution, the latter is the standard deviation of the gaussian distribution.
- ReturnsKeys (hybrid):
logit (
torch.Tensor
): The predicted discrete action type logit, it will be the same dimension asaction_type_shape
, i.e., all the possible discrete action types.action_args (
torch.Tensor
): Continuous action arguments with same size asaction_args_shape
.
- Shapes:
obs (
torch.Tensor
): \((B, N0)\), B is batch size and N0 corresponds toobs_shape
.action (
torch.Tensor
): \((B, N1)\), B is batch size and N1 corresponds toaction_shape
.logit.mu (
torch.Tensor
): \((B, N1)\), B is batch size and N1 corresponds toaction_shape
.logit.sigma (
torch.Tensor
): \((B, N1)\), B is batch size.logit (
torch.Tensor
): \((B, N2)\), B is batch size and N2 corresponds toaction_shape.action_type_shape
.action_args (
torch.Tensor
): \((B, N3)\), B is batch size and N3 corresponds toaction_shape.action_args_shape
.
- Examples:
>>> # Regression mode >>> model = ContinuousQAC(64, 6, 'regression') >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['action'].shape == torch.Size([4, 6]) >>> # Reparameterization Mode >>> model = ContinuousQAC(64, 6, 'reparameterization') >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['logit'][0].shape == torch.Size([4, 6]) # mu >>> actor_outputs['logit'][1].shape == torch.Size([4, 6]) # sigma
- compute_critic(inputs: Dict[str, Tensor]) Dict[str, Tensor] [source]¶
- Overview:
QAC forward computation graph for critic part, input observation and action tensor to predict Q-value.
- Arguments:
inputs (
Dict[str, torch.Tensor]
): The dict of input data, includingobs
andaction
tensor, also containslogit
andaction_args
tensor in hybrid action_space.
- ArgumentsKeys:
obs: (
torch.Tensor
): Observation tensor data, now supports a batch of 1-dim vector data.action (
Union[torch.Tensor, Dict]
): Continuous action with same size asaction_shape
.logit (
torch.Tensor
): Discrete action logit, only in hybrid action_space.action_args (
torch.Tensor
): Continuous action arguments, only in hybrid action_space.
- Returns:
outputs (
Dict[str, torch.Tensor]
): The output dict of QAC’s forward computation graph for critic, includingq_value
.
- ReturnKeys:
q_value (
torch.Tensor
): Q value tensor with same size as batch size.
- Shapes:
obs (
torch.Tensor
): \((B, N1)\), where B is batch size and N1 isobs_shape
.logit (
torch.Tensor
): \((B, N2)\), B is batch size and N2 corresponds toaction_shape.action_type_shape
.action_args (
torch.Tensor
): \((B, N3)\), B is batch size and N3 corresponds toaction_shape.action_args_shape
.action (
torch.Tensor
): \((B, N4)\), where B is batch size and N4 isaction_shape
.q_value (
torch.Tensor
): \((B, )\), where B is batch size.
- Examples:
>>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)} >>> model = ContinuousQAC(obs_shape=(8, ),action_shape=1, action_space='regression') >>> assert model(inputs, mode='compute_critic')['q_value'].shape == (4, ) # q value
- forward(inputs: Tensor | Dict[str, Tensor], mode: str) Dict[str, Tensor] [source]¶
- Overview:
QAC forward computation graph, input observation tensor to predict Q-value or action logit. Different
mode
will forward with different network modules to get different outputs and save computation.- Arguments:
inputs (
Union[torch.Tensor, Dict[str, torch.Tensor]]
): The input data for forward computation graph, forcompute_actor
, it is the observation tensor, forcompute_critic
, it is the dict data including obs and action tensor.mode (
str
): The forward mode, all the modes are defined in the beginning of this class.
- Returns:
output (
Dict[str, torch.Tensor]
): The output dict of QAC forward computation graph, whose key-values vary in different forward modes.
- Examples (Actor):
>>> # Regression mode >>> model = ContinuousQAC(64, 6, 'regression') >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['action'].shape == torch.Size([4, 6]) >>> # Reparameterization Mode >>> model = ContinuousQAC(64, 6, 'reparameterization') >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['logit'][0].shape == torch.Size([4, 6]) # mu >>> actor_outputs['logit'][1].shape == torch.Size([4, 6]) # sigma
- Examples (Critic):
>>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)} >>> model = ContinuousQAC(obs_shape=(8, ),action_shape=1, action_space='regression') >>> assert model(inputs, mode='compute_critic')['q_value'].shape == (4, ) # q value
DiscreteQAC¶
- class ding.model.DiscreteQAC(obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False)[source]¶
- Overview:
The neural network and computation graph of algorithms related to discrete action Q-value Actor-Critic (QAC), such as DiscreteSAC. This model now supports only discrete action space. The DiscreteQAC is composed of four parts:
actor_encoder
,critic_encoder
,actor_head
andcritic_head
. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding Q-value or action logit. In high-dimensional observation space like 2D image, we often use a shared encoder for bothactor_encoder
andcritic_encoder
. In low-dimensional observation space like 1D vector, we often use different encoders.- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, encoder_hidden_size_list: SequenceType | None = None, share_encoder: bool | None = False) None [source]¶
- Overview:
Initailize the DiscreteQAC Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s shape, such as 128, (156, ).action_shape (
Union[int, SequenceType, EasyDict]
): Action’s shape, such as 4, (3, ).twin_critic (
bool
): Whether to use twin critic.actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor head.actor_head_layer_num (
int
): The num of layers used in the actor network to compute action.critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic head.critic_head_layer_num (
int
): The num of layers used in the critic network to compute Q-value.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
after each FC layer, ifNone
then default set tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization to after network layer (FC, Conv), seeding.torch_utils.network
for more details.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
, this argument is only used in image observation.share_encoder (
Optional[bool]
): Whether to share encoder between actor and critic.
- compute_actor(inputs: Tensor) Dict[str, Tensor] [source]¶
- Overview:
QAC forward computation graph for actor part, input observation tensor to predict action or action logit.
- Arguments:
inputs (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict[str, torch.Tensor]
): The output dict of QAC forward computation graph for actor, including discrete actionlogit
.
- ReturnsKeys:
logit (
torch.Tensor
): The predicted discrete action type logit, it will be the same dimension asaction_shape
, i.e., all the possible discrete action choices.
- Shapes:
inputs (
torch.Tensor
): \((B, N0)\), B is batch size and N0 corresponds toobs_shape
.logit (
torch.Tensor
): \((B, N2)\), B is batch size and N2 corresponds toaction_shape
.
- Examples:
>>> model = DiscreteQAC(64, 6) >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([4, 6])
- compute_critic(inputs: Tensor) Dict[str, Tensor] [source]¶
- Overview:
QAC forward computation graph for critic part, input observation to predict Q-value for each possible discrete action choices.
- Arguments:
inputs (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict[str, torch.Tensor]
): The output dict of QAC forward computation graph for critic, includingq_value
for each possible discrete action choices.
- ReturnKeys:
q_value (
torch.Tensor
): The predicted Q-value for each possible discrete action choices, it will be the same dimension asaction_shape
and used to calculate the loss.
- Shapes:
obs (
torch.Tensor
): \((B, N1)\), where B is batch size and N1 isobs_shape
.q_value (
torch.Tensor
): \((B, N2)\), where B is batch size and N2 isaction_shape
.
- Examples:
>>> model = DiscreteQAC(64, 6, twin_critic=False) >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_critic') >>> assert actor_outputs['q_value'].shape == torch.Size([4, 6])
- forward(inputs: Tensor, mode: str) Dict[str, Tensor] [source]¶
- Overview:
QAC forward computation graph, input observation tensor to predict Q-value or action logit. Different
mode
will forward with different network modules to get different outputs and save computation.- Arguments:
inputs (
torch.Tensor
): The input observation tensor data.mode (
str
): The forward mode, all the modes are defined in the beginning of this class.
- Returns:
output (
Dict[str, torch.Tensor]
): The output dict of QAC forward computation graph, whose key-values vary in different forward modes.
- Examples (Actor):
>>> model = DiscreteQAC(64, 6) >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([4, 6])
- Examples(Critic):
>>> model = DiscreteQAC(64, 6, twin_critic=False) >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_critic') >>> assert actor_outputs['q_value'].shape == torch.Size([4, 6])
ContinuousMAQAC¶
- class ding.model.ContinuousMAQAC(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The neural network and computation graph of algorithms related to continuous action Multi-Agent Q-value Actor-CritiC (MAQAC) model. The model is composed of actor and critic, where actor is a MLP network and critic is a MLP network. The actor network is used to predict the action probability distribution, and the critic network is used to predict the Q value of the state-action pair.
- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Initialize the QAC Model according to arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s space.action_shape (
Union[int, SequenceType, EasyDict]
): Action’s space, such as 4, (3, )action_space (
str
): Whether chooseregression
orreparameterization
.twin_critic (
bool
): Whether include twin critic.actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor-nn’sHead
.actor_head_layer_num (
int
): The num of layers used in the network to compute Q value output for actor’s nn.critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic-nn’sHead
.critic_head_layer_num (
int
): The num of layers used in the network to compute Q value output for critic’s nn.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
norm_type (
Optional[str]
): The type of normalization to use, seeding.torch_utils.fc_block
for more details.
- compute_actor(inputs: Dict) Dict [source]¶
- Overview:
Use observation tensor to predict action logits.
- Arguments:
- inputs (
Dict[str, torch.Tensor]
): The input dict tensor data, has keys: agent_state
(torch.Tensor
): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds toagent_obs_shape
.
- inputs (
- Returns:
outputs (
Dict
): Outputs of network forward.
- ReturnKeys (
action_space == 'regression'
): action (
torch.Tensor
): Action tensor with same size asaction_shape
.
- ReturnKeys (
action_space == 'reparameterization'
): logit (
list
): 2 elements, each is the shape of \((B, A, N3)\), where B is batch size and A is agent num. N3 corresponds toaction_shape
.
- Examples:
>>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> act_space = 'reparameterization' # 'regression' >>> data = { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> } >>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False) >>> if action_space == 'regression': >>> action = model.compute_actor(data)['action'] >>> elif action_space == 'reparameterization': >>> (mu, sigma) = model.compute_actor(data)['logit']
- compute_critic(inputs: Dict) Dict [source]¶
- Overview:
Use observation tensor and action tensor to predict Q value.
- Arguments:
- inputs (
Dict[str, torch.Tensor]
): The input dict tensor data, has keys: obs
(Dict[str, torch.Tensor]
): The input dict tensor data, has keys:agent_state
(torch.Tensor
): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds toagent_obs_shape
.global_state
(torch.Tensor
): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds toglobal_obs_shape
.action_mask
(torch.Tensor
): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds toaction_shape
.
action
(torch.Tensor
): The action tensor data, with shape \((B, A, N3)\), where B is batch size and A is agent num. N3 corresponds toaction_shape
.
- inputs (
- Returns:
outputs (
Dict
): Outputs of network forward.
- ReturnKeys (
twin_critic=True
): q_value (
list
): 2 elements, each is the shape of \((B, A)\), where B is batch size and A is agent num.
- ReturnKeys (
twin_critic=False
): q_value (
torch.Tensor
): \((B, A)\), where B is batch size and A is agent num.
- Examples:
>>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> act_space = 'reparameterization' # 'regression' >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> }, >>> 'action': torch.randn(B, agent_num, squeeze(action_shape)) >>> } >>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False) >>> value = model.compute_critic(data)['q_value']
- forward(inputs: Tensor | Dict, mode: str) Dict [source]¶
- Overview:
Use observation and action tensor to predict output in
compute_actor
orcompute_critic
mode.- Arguments:
- inputs (
Dict[str, torch.Tensor]
): The input dict tensor data, has keys: obs
(Dict[str, torch.Tensor]
): The input dict tensor data, has keys:agent_state
(torch.Tensor
): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds toagent_obs_shape
.global_state
(torch.Tensor
): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds toglobal_obs_shape
.action_mask
(torch.Tensor
): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds toaction_shape
.
action
(torch.Tensor
): The action tensor data, with shape \((B, A, N3)\), where B is batch size and A is agent num. N3 corresponds toaction_shape
.
- inputs (
mode (
str
): Name of the forward mode.
- Returns:
outputs (
Dict
): Outputs of network forward, whose key-values will be different for differentmode
,twin_critic
,action_space
.
- Examples:
>>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> act_space = 'reparameterization' # regression >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> }, >>> 'action': torch.randn(B, agent_num, squeeze(action_shape)) >>> } >>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False) >>> if action_space == 'regression': >>> action = model(data['obs'], mode='compute_actor')['action'] >>> elif action_space == 'reparameterization': >>> (mu, sigma) = model(data['obs'], mode='compute_actor')['logit'] >>> value = model(data, mode='compute_critic')['q_value']
DiscreteMAQAC¶
- class ding.model.DiscreteMAQAC(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The neural network and computation graph of algorithms related to discrete action Multi-Agent Q-value Actor-CritiC (MAQAC) model. The model is composed of actor and critic, where actor is a MLP network and critic is a MLP network. The actor network is used to predict the action probability distribution, and the critic network is used to predict the Q value of the state-action pair.
- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(agent_obs_shape: int | SequenceType, global_obs_shape: int | SequenceType, action_shape: int | SequenceType, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Initialize the DiscreteMAQAC Model according to arguments.
- Arguments:
agent_obs_shape (
Union[int, SequenceType]
): Agent’s observation’s space.global_obs_shape (
Union[int, SequenceType]
): Global observation’s space.obs_shape (
Union[int, SequenceType]
): Observation’s space.action_shape (
Union[int, SequenceType]
): Action’s space.twin_critic (
bool
): Whether include twin critic.actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor-nn’sHead
.actor_head_layer_num (
int
): The num of layers used in the network to compute Q value output for actor’s nn.critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic-nn’sHead
.critic_head_layer_num (
int
): The num of layers used in the network to compute Q value output for critic’s nn.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
norm_type (
Optional[str]
): The type of normalization to use, seeding.torch_utils.fc_block
for more details.
- compute_actor(inputs: Dict) Dict [source]¶
- Overview:
Use observation tensor to predict action logits.
- Arguments:
- inputs (
Dict[str, torch.Tensor]
): The input dict tensor data, has keys: obs
(Dict[str, torch.Tensor]
): The input dict tensor data, has keys:agent_state
(torch.Tensor
): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds toagent_obs_shape
.global_state
(torch.Tensor
): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds toglobal_obs_shape
.action_mask
(torch.Tensor
): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds toaction_shape
.
- inputs (
- Returns:
- output (
Dict[str, torch.Tensor]
): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different forward modes. logit (
torch.Tensor
): Action’s output logit (real value range), whose shape is \((B, A, N2)\), where N2 corresponds toaction_shape
.action_mask (
torch.Tensor
): Action mask tensor with same size asaction_shape
.
- output (
- Examples:
>>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> } >>> } >>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True) >>> logit = model.compute_actor(data)['logit']
- compute_critic(inputs: Dict) Dict [source]¶
- Overview:
use observation tensor to predict Q value.
- Arguments:
- inputs (
Dict[str, torch.Tensor]
): The input dict tensor data, has keys: obs
(Dict[str, torch.Tensor]
): The input dict tensor data, has keys:agent_state
(torch.Tensor
): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds toagent_obs_shape
.global_state
(torch.Tensor
): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds toglobal_obs_shape
.action_mask
(torch.Tensor
): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds toaction_shape
.
- inputs (
- Returns:
- output (
Dict[str, torch.Tensor]
): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different values oftwin_critic
. q_value (
list
): Iftwin_critic=True
, q_value should be 2 elements, each is the shape of \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds toaction_shape
. Otherwise, q_value should betorch.Tensor
.
- output (
- Examples:
>>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> } >>> } >>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True) >>> value = model.compute_critic(data)['q_value']
- forward(inputs: Tensor | Dict, mode: str) Dict [source]¶
- Overview:
Use observation tensor to predict output, with
compute_actor
orcompute_critic
mode.- Arguments:
- inputs (
Dict[str, torch.Tensor]
): The input dict tensor data, has keys: obs
(Dict[str, torch.Tensor]
): The input dict tensor data, has keys:agent_state
(torch.Tensor
): The agent’s observation tensor data, with shape \((B, A, N0)\), where B is batch size and A is agent num. N0 corresponds toagent_obs_shape
.global_state
(torch.Tensor
): The global observation tensor data, with shape \((B, A, N1)\), where B is batch size and A is agent num. N1 corresponds toglobal_obs_shape
.action_mask
(torch.Tensor
): The action mask tensor data, with shape \((B, A, N2)\), where B is batch size and A is agent num. N2 corresponds toaction_shape
.
- inputs (
mode (
str
): The forward mode, all the modes are defined in the beginning of this class.
- Returns:
output (
Dict[str, torch.Tensor]
): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different forward modes.
- Examples:
>>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> } >>> } >>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True) >>> logit = model(data, mode='compute_actor')['logit'] >>> value = model(data, mode='compute_critic')['q_value']
QACDIST¶
- class ding.model.QACDIST(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'regression', critic_head_type: str = 'categorical', actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51)[source]¶
- Overview:
The QAC model with distributional Q-value.
- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'regression', critic_head_type: str = 'categorical', actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, v_min: float | None = -10, v_max: float | None = 10, n_atom: int | None = 51) None [source]¶
- Overview:
Init the QAC Distributional Model according to arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s space.action_shape (
Union[int, SequenceType]
): Action’s space.action_space (
str
): Whether chooseregression
orreparameterization
.critic_head_type (
str
): Onlycategorical
.actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor-nn’sHead
.- actor_head_layer_num (
int
): The num of layers used in the network to compute Q value output for actor’s nn.
- actor_head_layer_num (
critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic-nn’sHead
.- critic_head_layer_num (
int
): The num of layers used in the network to compute Q value output for critic’s nn.
- critic_head_layer_num (
- activation (
Optional[nn.Module]
): The type of activation function to use in
MLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
- activation (
- norm_type (
Optional[str]
): The type of normalization to use, see
ding.torch_utils.fc_block
for more details.
- norm_type (
v_min (
int
): Value of the smallest atomv_max (
int
): Value of the largest atomn_atom (
int
): Number of atoms in the support
- compute_actor(inputs: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to predict output. Execute parameter updates with
'compute_actor'
mode Use encoded embedding tensor to predict output.- Arguments:
- inputs (
torch.Tensor
): The encoded embedding tensor, determined with given
hidden_size
, i.e.(B, N=hidden_size)
.hidden_size = actor_head_hidden_size
- inputs (
mode (
str
): Name of the forward mode.
- Returns:
outputs (
Dict
): Outputs of forward pass encoder and head.
- ReturnsKeys (either):
action (
torch.Tensor
): Continuous action tensor with same size asaction_shape
.- logit (
torch.Tensor
): Logit tensor encoding
mu
andsigma
, both with same size as inputx
.
- logit (
- Shapes:
inputs (
torch.Tensor
): \((B, N0)\), B is batch size and N0 corresponds tohidden_size
action (
torch.Tensor
): \((B, N0)\)logit (
list
): 2 elements, mu and sigma, each is the shape of \((B, N0)\).q_value (
torch.FloatTensor
): \((B, )\), B is batch size.
- Examples:
>>> # Regression mode >>> model = QACDIST(64, 64, 'regression') >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['action'].shape == torch.Size([4, 64]) >>> # Reparameterization Mode >>> model = QACDIST(64, 64, 'reparameterization') >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> actor_outputs['logit'][0].shape # mu >>> torch.Size([4, 64]) >>> actor_outputs['logit'][1].shape # sigma >>> torch.Size([4, 64])
- compute_critic(inputs: Dict) Dict [source]¶
- Overview:
Execute parameter updates with
'compute_critic'
mode Use encoded embedding tensor to predict output.- Arguments:
obs
,action
encoded tensors.mode (
str
): Name of the forward mode.
- Returns:
outputs (
Dict
): Q-value output and distribution.
- ReturnKeys:
q_value (
torch.Tensor
): Q value tensor with same size as batch size.distribution (
torch.Tensor
): Q value distribution tensor.
- Shapes:
obs (
torch.Tensor
): \((B, N1)\), where B is batch size and N1 isobs_shape
action (
torch.Tensor
): \((B, N2)\), where B is batch size and N2 is``action_shape``q_value (
torch.FloatTensor
): \((B, N2)\), where B is batch size and N2 isaction_shape
distribution (
torch.FloatTensor
): \((B, 1, N3)\), where B is batch size and N3 isnum_atom
- Examples:
>>> # Categorical mode >>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)} >>> model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression', ... critic_head_type='categorical', n_atoms=51) >>> q_value = model(inputs, mode='compute_critic') # q value >>> assert q_value['q_value'].shape == torch.Size([4, 1]) >>> assert q_value['distribution'].shape == torch.Size([4, 1, 51])
- forward(inputs: Tensor | Dict, mode: str) Dict [source]¶
- Overview:
Use observation and action tensor to predict output. Parameter updates with QACDIST’s MLPs forward setup.
- Arguments:
- Forward with
'compute_actor'
: - inputs (
torch.Tensor
): The encoded embedding tensor, determined with given
hidden_size
, i.e.(B, N=hidden_size)
. Whetheractor_head_hidden_size
orcritic_head_hidden_size
depend onmode
.
- inputs (
- Forward with
'compute_critic'
, inputs (Dict) Necessary Keys: obs
,action
encoded tensors.
mode (
str
): Name of the forward mode.
- Forward with
- Returns:
outputs (
Dict
): Outputs of network forward.- Forward with
'compute_actor'
, Necessary Keys (either): action (
torch.Tensor
): Action tensor with same size as inputx
.- logit (
torch.Tensor
): Logit tensor encoding
mu
andsigma
, both with same size as inputx
.
- logit (
- Forward with
'compute_critic'
, Necessary Keys: q_value (
torch.Tensor
): Q value tensor with same size as batch size.distribution (
torch.Tensor
): Q value distribution tensor.
- Forward with
- Actor Shapes:
inputs (
torch.Tensor
): \((B, N0)\), B is batch size and N0 corresponds tohidden_size
action (
torch.Tensor
): \((B, N0)\)q_value (
torch.FloatTensor
): \((B, )\), where B is batch size.
- Critic Shapes:
obs (
torch.Tensor
): \((B, N1)\), where B is batch size and N1 isobs_shape
action (
torch.Tensor
): \((B, N2)\), where B is batch size and N2 is``action_shape``q_value (
torch.FloatTensor
): \((B, N2)\), where B is batch size and N2 isaction_shape
distribution (
torch.FloatTensor
): \((B, 1, N3)\), where B is batch size and N3 isnum_atom
- Actor Examples:
>>> # Regression mode >>> model = QACDIST(64, 64, 'regression') >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['action'].shape == torch.Size([4, 64]) >>> # Reparameterization Mode >>> model = QACDIST(64, 64, 'reparameterization') >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> actor_outputs['logit'][0].shape # mu >>> torch.Size([4, 64]) >>> actor_outputs['logit'][1].shape # sigma >>> torch.Size([4, 64])
- Critic Examples:
>>> # Categorical mode >>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)} >>> model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression', ... critic_head_type='categorical', n_atoms=51) >>> q_value = model(inputs, mode='compute_critic') # q value >>> assert q_value['q_value'].shape == torch.Size([4, 1]) >>> assert q_value['distribution'].shape == torch.Size([4, 1, 51])
DiscreteBC¶
- class ding.model.DiscreteBC(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, strides: list | None = None)[source]¶
- Overview:
The DiscreteBC network.
- Interfaces:
__init__
,forward
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, strides: list | None = None) None [source]¶
- Overview:
Init the DiscreteBC (encoder + head) Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action space shape, such as 6 or [2, 3, 3].encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
.dueling (
dueling
): Whether chooseDuelingHead
orDiscreteHead(default)
.head_hidden_size (
Optional[int]
): Thehidden_size
of head network.head_layer_num (
int
): The number of layers used in the head network to compute Q value outputactivation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details.strides (
Optional[list]
): The strides for each convolution layers, such as [2, 2, 2]. The length of this argument should be the same asencoder_hidden_size_list
.
- forward(x: Tensor) Dict [source]¶
- Overview:
DiscreteBC forward computation graph, input observation tensor to predict q_value.
- Arguments:
x (
torch.Tensor
): Observation inputs
- Returns:
outputs (
Dict
): DiscreteBC forward outputs, such as q_value.
- ReturnsKeys:
logit (
torch.Tensor
): Discrete Q-value output of each action dimension.
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N isobs_shape
logit (
torch.FloatTensor
): \((B, M)\), where B is batch size and M isaction_shape
- Examples:
>>> model = DiscreteBC(32, 6) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 32) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])
ContinuousBC¶
- class ding.model.ContinuousBC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The ContinuousBC network.
- Interfaces:
__init__
,forward
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, action_space: str, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Initialize the ContinuousBC Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s shape, such as 128, (156, ).action_shape (
Union[int, SequenceType, EasyDict]
): Action’s shape, such as 4, (3, ), EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 4}).action_space (
str
): The type of action space, including [regression
,reparameterization
].actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor head.actor_head_layer_num (
int
): The num of layers used in the network to compute Q value output for actor head.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
after each FC layer, ifNone
then default set tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization to after network layer (FC, Conv), seeding.torch_utils.network
for more details.
- forward(inputs: Tensor | Dict[str, Tensor]) Dict [source]¶
- Overview:
The unique execution (forward) method of ContinuousBC.
- Arguments:
inputs (
torch.Tensor
): Observation data, defaults to tensor.
- Returns:
output (
Dict
): Output dict data, including different key-values among distinct action_space.
- ReturnsKeys:
action (
torch.Tensor
): action output of actor network, with shape \((B, action_shape)\).logit (
List[torch.Tensor]
): reparameterized action output of actor network, with shape \((B, action_shape)\).
- Shapes:
inputs (
torch.Tensor
): \((B, N)\), where B is batch size and N isobs_shape
action (
torch.FloatTensor
): \((B, M)\), where B is batch size and M isaction_shape
logit (
List[torch.FloatTensor]
): \((B, M)\), where B is batch size and M isaction_shape
- Examples (Regression):
>>> model = ContinuousBC(32, 6, action_space='regression') >>> inputs = torch.randn(4, 32) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['action'].shape == torch.Size([4, 6])
- Examples (Reparameterization):
>>> model = ContinuousBC(32, 6, action_space='reparameterization') >>> inputs = torch.randn(4, 32) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'][0].shape == torch.Size([4, 6]) >>> assert outputs['logit'][1].shape == torch.Size([4, 6])
PDQN¶
- class ding.model.PDQN(obs_shape: int | SequenceType, action_shape: EasyDict, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, multi_pass: bool | None = False, action_mask: list | None = None)[source]¶
- Overview:
The neural network and computation graph of PDQN(https://arxiv.org/abs/1810.06394v1) and MPDQN(https://arxiv.org/abs/1905.04388) algorithms for parameterized action space. This model supports parameterized action space with discrete
action_type
and continuousaction_arg
. In principle, PDQN consists of x network (continuous action parameter network) and Q network (discrete action type network). But for simplicity, the code is split intoencoder
andactor_head
, which contain the encoder and head of the above two networks respectively.- Interface:
__init__
,forward
,compute_discrete
,compute_continuous
.
- __init__(obs_shape: int | SequenceType, action_shape: EasyDict, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, multi_pass: bool | None = False, action_mask: list | None = None) None [source]¶
- Overview:
Init the PDQN (encoder + head) Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation space shape, such as 8 or [4, 84, 84].action_shape (
EasyDict
): Action space shape in dict type, such as EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 5}).encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
, the last element must matchhead_hidden_size
.dueling (
dueling
): Whether chooseDuelingHead
orDiscreteHead(default)
.head_hidden_size (
Optional[int]
): Thehidden_size
of head network.head_layer_num (
int
): The number of layers used in the head network to compute Q value output.activation (
Optional[nn.Module]
): The type of activation function in networks ifNone
then default set it tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization in networks, seeding.torch_utils.fc_block
for more details.multi_pass (
Optional[bool]
): Whether to use multi pass version.action_mask: (
Optional[list]
): An action mask indicating how action args are associated to each discrete action. For example, if there are 3 discrete action, 4 continous action args, and the first discrete action associates with the first continuous action args, the second discrete action associates with the second continuous action args, and the third discrete action associates with the remaining 2 action args, the action mask will be like: [[1,0,0,0],[0,1,0,0],[0,0,1,1]] with shape 3*4.
- compute_continuous(inputs: Tensor) Dict [source]¶
- Overview:
Use observation tensor to predict continuous action args.
- Arguments:
inputs (
torch.Tensor
): Observation inputs.
- Returns:
- outputs (
Dict
): A dict with key ‘action_args’. ‘action_args’ (
torch.Tensor
): The continuous action args.
- outputs (
- Shapes:
inputs (
torch.Tensor
): \((B, N)\), where B is batch size and N isobs_shape
.action_args (
torch.Tensor
): \((B, M)\), where M isaction_args_shape
.
- Examples:
>>> act_shape = EasyDict({'action_type_shape': (3, ), 'action_args_shape': (5, )}) >>> model = PDQN(4, act_shape) >>> inputs = torch.randn(64, 4) >>> outputs = model.forward(inputs, mode='compute_continuous') >>> assert outputs['action_args'].shape == torch.Size([64, 5])
- compute_discrete(inputs: Dict | EasyDict) Dict [source]¶
- Overview:
Use observation tensor and continuous action args to predict discrete action types.
- Arguments:
- inputs (
Union[Dict, EasyDict]
): A dict with keys ‘state’, ‘action_args’. state (
torch.Tensor
): Observation inputs.action_args (
torch.Tensor
): Action parameters are used to concatenate with the observation and serve as input to the discrete action type network.
- inputs (
- Returns:
- outputs (
Dict
): A dict with keys ‘logit’, ‘action_args’. ‘logit’: The logit value for each discrete action.
‘action_args’: The continuous action args(same as the inputs[‘action_args’]) for later usage.
- outputs (
- Examples:
>>> act_shape = EasyDict({'action_type_shape': (3, ), 'action_args_shape': (5, )}) >>> model = PDQN(4, act_shape) >>> inputs = {'state': torch.randn(64, 4), 'action_args': torch.randn(64, 5)} >>> outputs = model.forward(inputs, mode='compute_discrete') >>> assert outputs['logit'].shape == torch.Size([64, 3]) >>> assert outputs['action_args'].shape == torch.Size([64, 5])
- forward(inputs: Tensor | Dict | EasyDict, mode: str) Dict [source]¶
- Overview:
PDQN forward computation graph, input observation tensor to predict q_value for discrete actions and values for continuous action_args.
- Arguments:
inputs (
Union[torch.Tensor, Dict, EasyDict]
): Inputs including observation and other info according to mode.mode (
str
): Name of the forward mode.
- Shapes:
inputs (
torch.Tensor
): \((B, N)\), where B is batch size and N isobs_shape
.
DecisionTransformer¶
- class ding.model.DecisionTransformer(state_dim: int | SequenceType, act_dim: int, n_blocks: int, h_dim: int, context_len: int, n_heads: int, drop_p: float, max_timestep: int = 4096, state_encoder: Module | None = None, continuous: bool = False)[source]¶
- Overview:
The implementation of decision transformer.
- Interfaces:
__init__
,forward
,configure_optimizers
- __init__(state_dim: int | SequenceType, act_dim: int, n_blocks: int, h_dim: int, context_len: int, n_heads: int, drop_p: float, max_timestep: int = 4096, state_encoder: Module | None = None, continuous: bool = False)[source]¶
- Overview:
Initialize the DecisionTransformer Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Dimension of state, such as 128 or (4, 84, 84).act_dim (
int
): The dimension of actions, such as 6.n_blocks (
int
): The number of transformer blocks in the decision transformer, such as 3.h_dim (
int
): The dimension of the hidden layers, such as 128.context_len (
int
): The max context length of the attention, such as 6.n_heads (
int
): The number of heads in calculating attention, such as 8.drop_p (
float
): The drop rate of the drop-out layer, such as 0.1.max_timestep (
int
): The max length of the total sequence, defaults to be 4096.state_encoder (
Optional[nn.Module]
): The encoder to pre-process the given input. If it is set to None, the raw state will be pushed into the transformer.continuous (
bool
): Whether the action space is continuous, defaults to beFalse
.
- forward(timesteps: Tensor, states: Tensor, actions: Tensor, returns_to_go: Tensor, tar: int | None = None) Tuple[Tensor, Tensor, Tensor] [source]¶
- Overview:
Forward computation graph of the decision transformer, input a sequence tensor and return a tensor with the same shape.
- Arguments:
timesteps (
torch.Tensor
): The timestep for input sequence.states (
torch.Tensor
): The sequence of states.actions (
torch.Tensor
): The sequence of actions.returns_to_go (
torch.Tensor
): The sequence of return-to-go.tar (
Optional[int]
): Whether to predict action, regardless of index.
- Returns:
output (
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
): Output contains three tensors, they are correspondingly the predicted states, predicted actions and predicted return-to-go.
- Examples:
>>> B, T = 4, 6 >>> state_dim = 3 >>> act_dim = 2 >>> DT_model = DecisionTransformer( state_dim=state_dim, act_dim=act_dim, n_blocks=3, h_dim=8, context_len=T, n_heads=2, drop_p=0.1, ) >>> timesteps = torch.randint(0, 100, [B, 3 * T - 1, 1], dtype=torch.long) # B x T >>> states = torch.randn([B, T, state_dim]) # B x T x state_dim >>> actions = torch.randint(0, act_dim, [B, T, 1]) >>> action_target = torch.randint(0, act_dim, [B, T, 1]) >>> returns_to_go_sample = torch.tensor([1, 0.8, 0.6, 0.4, 0.2, 0.]).repeat([B, 1]).unsqueeze(-1).float() >>> traj_mask = torch.ones([B, T], dtype=torch.long) # B x T >>> actions = actions.squeeze(-1) >>> state_preds, action_preds, return_preds = DT_model.forward( timesteps=timesteps, states=states, actions=actions, returns_to_go=returns_to_go ) >>> assert state_preds.shape == torch.Size([B, T, state_dim]) >>> assert return_preds.shape == torch.Size([B, T, 1]) >>> assert action_preds.shape == torch.Size([B, T, act_dim])
LanguageTransformer¶
- class ding.model.LanguageTransformer(model_name: str = 'bert-base-uncased', add_linear: bool = False, embedding_size: int = 128, freeze_encoder: bool = True, hidden_dim: int = 768, norm_embedding: bool = False)[source]¶
- Overview:
The LanguageTransformer network. Download a pre-trained language model and add head on it. In the default case, we use BERT model as the text encoder, whose bi-directional character is good for obtaining the embedding of the whole sentence.
- Interfaces:
__init__
,forward
- __init__(model_name: str = 'bert-base-uncased', add_linear: bool = False, embedding_size: int = 128, freeze_encoder: bool = True, hidden_dim: int = 768, norm_embedding: bool = False) None [source]¶
- Overview:
Init the LanguageTransformer Model according to input arguments.
- Arguments:
model_name (
str
): The base language model name in huggingface, such as “bert-base-uncased”.add_linear (
bool
): Whether to add a linear layer on the top of language model, defaults to beFalse
.embedding_size (
int
): The embedding size of the added linear layer, such as 128.freeze_encoder (
bool
): Whether to freeze the encoder language model while training, defaults to beTrue
.hidden_dim (
int
): The embedding dimension of the encoding model (e.g. BERT). This value should correspond to the model you use. For bert-base-uncased, this value is 768.norm_embedding (
bool
): Whether to normalize the embedding vectors. Default to beFalse
.
- forward(train_samples: List[str], candidate_samples: List[str] | None = None, mode: str = 'compute_actor') Dict [source]¶
- Overview:
LanguageTransformer forward computation graph, input two lists of strings and predict their matching scores. Different
mode
will forward with different network modules to get different outputs.- Arguments:
train_samples (
List[str]
): One list of strings.candidate_samples (
Optional[List[str]]
): The other list of strings to calculate matching scores.mode (
str
): The forward mode, all the modes are defined in the beginning of this class.
- Returns:
output (
Dict
): Output dict data, including the logit of matching scores and the correspondingtorch.distributions.Categorical
object.
- Examples:
>>> test_pids = [1] >>> cand_pids = [0, 2, 4] >>> problems = [ "This is problem 0", "This is the first question", "Second problem is here", "Another problem", "This is the last problem" ] >>> ctxt_list = [problems[pid] for pid in test_pids] >>> cands_list = [problems[pid] for pid in cand_pids] >>> model = LanguageTransformer(model_name="bert-base-uncased", add_linear=True, embedding_size=256) >>> scores = model(ctxt_list, cands_list) >>> assert scores.shape == (1, 3)
Mixer¶
- class ding.model.Mixer(agent_num: int, state_dim: int, mixing_embed_dim: int, hypernet_embed: int = 64, activation: Module = ReLU())[source]¶
- Overview:
Mixer network in QMIX, which mix up the independent q_value of each agent to a total q_value. The weights (but not the biases) of the Mixer network are restricted to be non-negative and produced by separate hypernetworks. Each hypernetwork takes the globle state s as input and generates the weights of one layer of the Mixer network.
- Interface:
__init__
,forward
.
- __init__(agent_num: int, state_dim: int, mixing_embed_dim: int, hypernet_embed: int = 64, activation: Module = ReLU())[source]¶
- Overview:
Initialize mixer network proposed in QMIX according to arguments. Each hypernetwork consists of linear layers, followed by an absolute activation function, to ensure that the Mixer network weights are non-negative.
- Arguments:
agent_num (
int
): The number of agent, such as 8.state_dim(
int
): The dimension of global observation state, such as 16.mixing_embed_dim (
int
): The dimension of mixing state emdedding, such as 128.hypernet_embed (
int
): The dimension of hypernet emdedding, default to 64.activation (
nn.Module
): Activation function in network, defaults to nn.ReLU().
- forward(agent_qs, states)[source]¶
- Overview:
Forward computation graph of pymarl mixer network. Mix up the input independent q_value of each agent to a total q_value with weights generated by hypernetwork according to global
states
.- Arguments:
agent_qs (
torch.FloatTensor
): The independent q_value of each agent.states (
torch.FloatTensor
): The emdedding vector of global state.
- Returns:
q_tot (
torch.FloatTensor
): The total mixed q_value.
- Shapes:
agent_qs (
torch.FloatTensor
): \((B, N)\), where B is batch size and N is agent_num.states (
torch.FloatTensor
): \((B, M)\), where M is embedding_size.q_tot (
torch.FloatTensor
): \((B, )\).
QMix¶
- class ding.model.QMix(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False)[source]¶
- Overview:
The neural network and computation graph of algorithms related to QMIX(https://arxiv.org/abs/1803.11485). The QMIX is composed of two parts: agent Q network and mixer(optional). The QMIX paper mentions that all agents share local Q network parameters, so only one Q network is initialized here. Then use summation or Mixer network to process the local Q according to the
mixer
settings to obtain the global Q.- Interface:
__init__
,forward
.
- __init__(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False) None [source]¶
- Overview:
Initialize QMIX neural network according to arguments, i.e. agent Q network and mixer.
- Arguments:
agent_num (
int
): The number of agent, such as 8.obs_shape (
int
): The dimension of each agent’s observation state, such as 8 or [4, 84, 84].global_obs_shape (
int
): The dimension of global observation state, such as 8 or [4, 84, 84].action_shape (
int
): The dimension of action shape, such as 6 or [2, 3, 3].hidden_size_list (
list
): The list of hidden size forq_network
, the last element must match mixer’smixing_embed_dim
.mixer (
bool
): Use mixer net or not, default to True. If it is false, the final local Q is added to obtain the global Q.lstm_type (
str
): The type of RNN module inq_network
, now support [‘normal’, ‘pytorch’, ‘gru’], default to gru.activation (
nn.Module
): The type of activation function to use inMLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
.dueling (
bool
): Whether chooseDuelingHead
(True) orDiscreteHead (False)
, default to False.
- forward(data: dict, single_step: bool = True) dict [source]¶
- Overview:
QMIX forward computation graph, input dict including time series observation and related data to predict total q_value and each agent q_value.
- Arguments:
- data (
dict
): Input data dict with keys [‘obs’, ‘prev_state’, ‘action’]. agent_state (
torch.Tensor
): Time series local observation data of each agents.global_state (
torch.Tensor
): Time series global observation data.prev_state (
list
): Previous rnn state forq_network
.action (
torch.Tensor
or None): The actions of each agent given outside the function. If action is None, use argmax q_value index as action to calculateagent_q_act
.
- data (
single_step (
bool
): Whether single_step forward, if so, add timestep dim before forward and remove it after forward.
- Returns:
ret (
dict
): Output data dict with keys [total_q
,logit
,next_state
].
- ReturnsKeys:
total_q (
torch.Tensor
): Total q_value, which is the result of mixer network.agent_q (
torch.Tensor
): Each agent q_value.next_state (
list
): Next rnn state forq_network
.
- Shapes:
agent_state (
torch.Tensor
): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape.global_state (
torch.Tensor
): \((T, B, M)\), where M is global_obs_shape.prev_state (
list
): math:(B, A), a list of length B, and each element is a list of length A.action (
torch.Tensor
): \((T, B, A)\).total_q (
torch.Tensor
): \((T, B)\).agent_q (
torch.Tensor
): \((T, B, A, P)\), where P is action_shape.next_state (
list
): math:(B, A), a list of length B, and each element is a list of length A.
COMA¶
- class ding.model.COMA(agent_num: int, obs_shape: Dict, action_shape: int | SequenceType, actor_hidden_size_list: SequenceType)[source]¶
- Overview:
The network of COMA algorithm, which is QAC-type actor-critic.
- Interface:
__init__
,forward
- Properties:
mode (
list
): The list of forward mode, includingcompute_actor
andcompute_critic
- __init__(agent_num: int, obs_shape: Dict, action_shape: int | SequenceType, actor_hidden_size_list: SequenceType) None [source]¶
- Overview:
initialize COMA network
- Arguments:
agent_num (
int
): the number of agentobs_shape (
Dict
): the observation information, including agent_state and global_stateaction_shape (
Union[int, SequenceType]
): the dimension of action shapeactor_hidden_size_list (
SequenceType
): the list of hidden size
- forward(inputs: Dict, mode: str) Dict [source]¶
- Overview:
forward computation graph of COMA network
- Arguments:
inputs (
dict
): input data dict with keys [‘obs’, ‘prev_state’, ‘action’]agent_state (
torch.Tensor
): each agent local state(obs)global_state (
torch.Tensor
): global state(obs)action (
torch.Tensor
): the masked action
- ArgumentsKeys:
necessary:
obs
{agent_state
,global_state
,action_mask
},action
,prev_state
- ReturnsKeys:
- necessary:
compute_critic:
q_value
compute_actor:
logit
,next_state
,action_mask
- Shapes:
obs (
dict
):agent_state
: \((T, B, A, N, D)\),action_mask
: \((T, B, A, N, A)\)prev_state (
list
): \([[[h, c] for _ in range(A)] for _ in range(B)]\)logit (
torch.Tensor
): \((T, B, A, N, A)\)next_state (
list
): \([[[h, c] for _ in range(A)] for _ in range(B)]\)action_mask (
torch.Tensor
): \((T, B, A, N, A)\)q_value (
torch.Tensor
): \((T, B, A, N, A)\)
- Examples:
>>> agent_num, bs, T = 4, 3, 8 >>> agent_num, bs, T = 4, 3, 8 >>> obs_dim, global_obs_dim, action_dim = 32, 32 * 4, 9 >>> coma_model = COMA( >>> agent_num=agent_num, >>> obs_shape=dict(agent_state=(obs_dim, ), global_state=(global_obs_dim, )), >>> action_shape=action_dim, >>> actor_hidden_size_list=[128, 64], >>> ) >>> prev_state = [[None for _ in range(agent_num)] for _ in range(bs)] >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(T, bs, agent_num, obs_dim), >>> 'action_mask': None, >>> }, >>> 'prev_state': prev_state, >>> } >>> output = coma_model(data, mode='compute_actor') >>> data= { >>> 'obs': { >>> 'agent_state': torch.randn(T, bs, agent_num, obs_dim), >>> 'global_state': torch.randn(T, bs, global_obs_dim), >>> }, >>> 'action': torch.randint(0, action_dim, size=(T, bs, agent_num)), >>> } >>> output = coma_model(data, mode='compute_critic')
QTran¶
- class ding.model.QTran(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, embedding_size: int, lstm_type: str = 'gru', dueling: bool = False)[source]¶
- Overview:
QTRAN network
- Interface:
__init__, forward
- __init__(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, embedding_size: int, lstm_type: str = 'gru', dueling: bool = False) None [source]¶
- Overview:
initialize QTRAN network
- Arguments:
agent_num (
int
): the number of agentobs_shape (
int
): the dimension of each agent’s observation stateglobal_obs_shape (
int
): the dimension of global observation stateaction_shape (
int
): the dimension of action shapehidden_size_list (
list
): the list of hidden sizeembedding_size (
int
): the dimension of embeddinglstm_type (
str
): use lstm or gru, default to grudueling (
bool
): use dueling head or not, default to False.
- forward(data: dict, single_step: bool = True) dict [source]¶
- Overview:
forward computation graph of qtran network
- Arguments:
- data (
dict
): input data dict with keys [‘obs’, ‘prev_state’, ‘action’] agent_state (
torch.Tensor
): each agent local state(obs)global_state (
torch.Tensor
): global state(obs)prev_state (
list
): previous rnn stateaction (
torch.Tensor
or None): if action is None, use argmax q_value index as action to calculateagent_q_act
- data (
single_step (
bool
): whether single_step forward, if so, add timestep dim before forward and remove it after forward
- Return:
- ret (
dict
): output data dict with keys [‘total_q’, ‘logit’, ‘next_state’] total_q (
torch.Tensor
): total q_value, which is the result of mixer networkagent_q (
torch.Tensor
): each agent q_valuenext_state (
list
): next rnn state
- ret (
- Shapes:
agent_state (
torch.Tensor
): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shapeglobal_state (
torch.Tensor
): \((T, B, M)\), where M is global_obs_shapeprev_state (
list
): math:(B, A), a list of length B, and each element is a list of length Aaction (
torch.Tensor
): \((T, B, A)\)total_q (
torch.Tensor
): \((T, B)\)agent_q (
torch.Tensor
): \((T, B, A, P)\), where P is action_shapenext_state (
list
): math:(B, A), a list of length B, and each element is a list of length A
WQMix¶
- class ding.model.WQMix(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, lstm_type: str = 'gru', dueling: bool = False)[source]¶
- Overview:
WQMIX (https://arxiv.org/abs/2006.10800) network, There are two components: 1) Q_tot, which is same as QMIX network and composed of agent Q network and mixer network. 2) An unrestricted joint action Q_star, which is composed of agent Q network and mixer_star network. The QMIX paper mentions that all agents share local Q network parameters, so only one Q network is initialized in Q_tot or Q_star.
- Interface:
__init__
,forward
.
- __init__(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, lstm_type: str = 'gru', dueling: bool = False) None [source]¶
- Overview:
Initialize WQMIX neural network according to arguments, i.e. agent Q network and mixer, Q_star network and mixer_star.
- Arguments:
agent_num (
int
): The number of agent, such as 8.obs_shape (
int
): The dimension of each agent’s observation state, such as 8.global_obs_shape (
int
): The dimension of global observation state, such as 8.action_shape (
int
): The dimension of action shape, such as 6.hidden_size_list (
list
): The list of hidden size forq_network
, the last element must match mixer’smixing_embed_dim
.lstm_type (
str
): The type of RNN module inq_network
, now support [‘normal’, ‘pytorch’, ‘gru’], default to gru.dueling (
bool
): Whether chooseDuelingHead
(True) orDiscreteHead (False)
, default to False.
- forward(data: dict, single_step: bool = True, q_star: bool = False) dict [source]¶
- Overview:
Forward computation graph of qmix network. Input dict including time series observation and related data to predict total q_value and each agent q_value. Determine whether to calculate Q_tot or Q_star based on the
q_star
parameter.- Arguments:
- data (
dict
): Input data dict with keys [‘obs’, ‘prev_state’, ‘action’]. agent_state (
torch.Tensor
): Time series local observation data of each agents.global_state (
torch.Tensor
): Time series global observation data.prev_state (
list
): Previous rnn state forq_network
or_q_network_star
.action (
torch.Tensor
or None): If action is None, use argmax q_value index as action to calculateagent_q_act
.
- data (
single_step (
bool
): Whether single_step forward, if so, add timestep dim before forward and remove it after forward.Q_star (
bool
): Whether Q_star network forward. If True, using the Q_star network, where the agent networks have the same architecture as Q network but do not share parameters and the mixing network is a feedforward network with 3 hidden layers of 256 dim; if False, using the Q network, same as the Q network in Qmix paper.
- Returns:
ret (
dict
): Output data dict with keys [total_q
,logit
,next_state
].total_q (
torch.Tensor
): Total q_value, which is the result of mixer network.agent_q (
torch.Tensor
): Each agent q_value.next_state (
list
): Next rnn state.
- Shapes:
agent_state (
torch.Tensor
): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape.global_state (
torch.Tensor
): \((T, B, M)\), where M is global_obs_shape.prev_state (
list
): math:(T, B, A), a list of length B, and each element is a list of length A.action (
torch.Tensor
): \((T, B, A)\).total_q (
torch.Tensor
): \((T, B)\).agent_q (
torch.Tensor
): \((T, B, A, P)\), where P is action_shape.next_state (
list
): math:(T, B, A), a list of length B, and each element is a list of length A.
PPG¶
- class ding.model.PPG(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, impala_cnn_encoder: bool = False)[source]¶
- Overview:
Phasic Policy Gradient (PPG) model from paper Phasic Policy Gradient https://arxiv.org/abs/2009.04416 This module contains VAC module and an auxiliary critic module.
- Interfaces:
forward
,compute_actor
,compute_critic
,compute_actor_critic
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, action_space: str = 'discrete', share_encoder: bool = True, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 2, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, impala_cnn_encoder: bool = False) None [source]¶
- Overview:
Initailize the PPG Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s shape, such as 128, (156, ).action_shape (
Union[int, SequenceType]
): Action’s shape, such as 4, (3, ).action_space (
str
): The action space type, such as ‘discrete’, ‘continuous’.share_encoder (
bool
): Whether to share encoder.encoder_hidden_size_list (
SequenceType
): The hidden size list of encoder.actor_head_hidden_size (
int
): Thehidden_size
to pass to actor head.actor_head_layer_num (
int
): The num of layers used in the network to compute Q value output for actor head.critic_head_hidden_size (
int
): Thehidden_size
to pass to critic head.critic_head_layer_num (
int
): The num of layers used in the network to compute Q value output for critic head.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
after each FC layer, ifNone
then default set tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization to after network layer (FC, Conv), seeding.torch_utils.network
for more details.impala_cnn_encoder (
bool
): Whether to use impala cnn encoder.
- compute_actor(x: Tensor) Dict [source]¶
- Overview:
Use actor to compute action logits.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
output (
Dict
): The output data containing action logits.
- ReturnsKeys:
logit (
torch.Tensor
): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict withaction_type
andaction_args
.
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N is the input feature size.output (
Dict
):logit
: \((B, A)\), where B is batch size and A is the action space size.
- compute_actor_critic(x: Tensor) Dict [source]¶
- Overview:
Use actor and critic to compute action logits and value.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict
): The output dict of PPG’s forward computation graph for both actor and critic, includinglogit
andvalue
.
- ReturnsKeys:
logit (
torch.Tensor
): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict withaction_type
andaction_args
.value (
torch.Tensor
): The predicted state value tensor.
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N is the input feature size.output (
Dict
):value
: \((B, 1)\), where B is batch size.output (
Dict
):logit
: \((B, A)\), where B is batch size and A is the action space size.
Note
compute_actor_critic
interface aims to save computation when shares encoder.
- compute_critic(x: Tensor) Dict [source]¶
- Overview:
Use critic to compute value.
- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
output (
Dict
): The output dict of VAC’s forward computation graph for critic, includingvalue
.
- ReturnsKeys:
necessary:
value
- Shapes:
x (
torch.Tensor
): \((B, N)\), where B is batch size and N is the input feature size.output (
Dict
):value
: \((B, 1)\), where B is batch size.
- forward(inputs: Tensor | Dict, mode: str) Dict [source]¶
- Overview:
Compute action logits or value according to mode being
compute_actor
,compute_critic
orcompute_actor_critic
.- Arguments:
x (
torch.Tensor
): The input observation tensor data.mode (
str
): The forward mode, all the modes are defined in the beginning of this class.
- Returns:
outputs (
Dict
): The output dict of PPG’s forward computation graph, whose key-values vary from differentmode
.
ProcedureCloningBFS¶
- class ding.model.ProcedureCloningBFS(obs_shape: SequenceType, action_shape: int, encoder_hidden_size_list: SequenceType = [128, 128, 256, 256])[source]¶
- Overview:
The neural network introduced in procedure cloning (PC) to process 3-dim observations. Given an input, this model will perform several 3x3 convolutions and output a feature map with the same height and width of input. The channel number of output will be the
action_shape
.- Interfaces:
__init__
,forward
.
- __init__(obs_shape: SequenceType, action_shape: int, encoder_hidden_size_list: SequenceType = [128, 128, 256, 256])[source]¶
- Overview:
Init the
BFSConvolution Encoder
according to the provided arguments.- Arguments:
obs_shape (
SequenceType
): Sequence ofin_channel
, plus one or moreinput size
, such as [4, 84, 84].action_dim (
int
): Action space shape, such as 6.cnn_hidden_list (
SequenceType
): The cnn channel dims for each block, such as [128, 128, 256, 256].
- forward(x: Tensor) Dict [source]¶
- Overview:
The computation graph. Given a 3-dim observation, this function will return a tensor with the same height and width. The channel number of output will be the
action_shape
.- Arguments:
x (
torch.Tensor
): The input observation tensor data.
- Returns:
outputs (
Dict
): The output dict of model’s forward computation graph, only contains a single keylogit
.
- Examples:
>>> model = ProcedureCloningBFS([3, 16, 16], 4) >>> inputs = torch.randn(16, 16, 3).unsqueeze(0) >>> outputs = model(inputs) >>> assert outputs['logit'].shape == torch.Size([16, 16, 4])
ProcedureCloningMCTS¶
- class ding.model.ProcedureCloningMCTS(obs_shape: SequenceType, action_dim: int, cnn_hidden_list: SequenceType = [128, 128, 256, 256, 256], cnn_activation: Module = ReLU(), cnn_kernel_size: SequenceType = [3, 3, 3, 3, 3], cnn_stride: SequenceType = [1, 1, 1, 1, 1], cnn_padding: SequenceType = [1, 1, 1, 1, 1], mlp_hidden_list: SequenceType = [256, 256], mlp_activation: Module = ReLU(), att_heads: int = 8, att_hidden: int = 128, n_att: int = 4, n_feedforward: int = 2, feedforward_hidden: int = 256, drop_p: float = 0.5, max_T: int = 17)[source]¶
- Overview:
The neural network of algorithms related to Procedure cloning (PC).
- Interfaces:
__init__
,forward
.
- __init__(obs_shape: SequenceType, action_dim: int, cnn_hidden_list: SequenceType = [128, 128, 256, 256, 256], cnn_activation: Module = ReLU(), cnn_kernel_size: SequenceType = [3, 3, 3, 3, 3], cnn_stride: SequenceType = [1, 1, 1, 1, 1], cnn_padding: SequenceType = [1, 1, 1, 1, 1], mlp_hidden_list: SequenceType = [256, 256], mlp_activation: Module = ReLU(), att_heads: int = 8, att_hidden: int = 128, n_att: int = 4, n_feedforward: int = 2, feedforward_hidden: int = 256, drop_p: float = 0.5, max_T: int = 17) None [source]¶
- Overview:
Initialize the MCTS procedure cloning model according to corresponding input arguments.
- Arguments:
obs_shape (
SequenceType
): Observation space shape, such as [4, 84, 84].action_dim (
int
): Action space shape, such as 6.cnn_hidden_list (
SequenceType
): The cnn channel dims for each block, such as [128, 128, 256, 256, 256].cnn_activation (
nn.Module
): The activation function for cnn blocks, such asnn.ReLU()
.cnn_kernel_size (
SequenceType
): The kernel size for each cnn block, such as [3, 3, 3, 3, 3].cnn_stride (
SequenceType
): The stride for each cnn block, such as [1, 1, 1, 1, 1].cnn_padding (
SequenceType
): The padding for each cnn block, such as [1, 1, 1, 1, 1].mlp_hidden_list (
SequenceType
): The last dim for this must match the last dim ofcnn_hidden_list
, such as [256, 256].mlp_activation (
nn.Module
): The activation function for mlp layers, such asnn.ReLU()
.att_heads (
int
): The number of attention heads in transformer, such as 8.att_hidden (
int
): The number of attention dimension in transformer, such as 128.n_att (
int
): The number of attention blocks in transformer, such as 4.n_feedforward (
int
): The number of feedforward layers in transformer, such as 2.drop_p (
float
): The drop out rate of attention, such as 0.5.max_T (
int
): The sequence length of procedure cloning, such as 17.
- forward(states: Tensor, goals: Tensor, actions: Tensor) Tuple[Tensor, Tensor] [source]¶
- Overview:
ProcedureCloningMCTS forward computation graph, input states tensor and goals tensor, calculate the predicted states and actions.
- Arguments:
states (
torch.Tensor
): The observation of current time.goals (
torch.Tensor
): The target observation after a period.actions (
torch.Tensor
): The actions executed during the period.
- Returns:
outputs (
Tuple[torch.Tensor, torch.Tensor]
): Predicted states and actions.
- Examples:
>>> inputs = { 'states': torch.randn(2, 3, 64, 64), 'goals': torch.randn(2, 3, 64, 64), 'actions': torch.randn(2, 15, 9) } >>> model = ProcedureCloningMCTS(obs_shape=(3, 64, 64), action_dim=9) >>> goal_preds, action_preds = model(inputs['states'], inputs['goals'], inputs['actions']) >>> assert goal_preds.shape == (2, 256) >>> assert action_preds.shape == (2, 16, 9)
ACER¶
- class ding.model.ACER(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The model of algorithmn ACER(Actor Critic with Experience Replay) Sample Efficient Actor-Critic with Experience Replay. https://arxiv.org/abs/1611.01224
- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Init the ACER Model according to arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s space.action_shape (
Union[int, SequenceType]
): Action’s space.actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor-nn’sHead
.- actor_head_layer_num (
int
): The num of layers used in the network to compute Q value output for actor’s nn.
- actor_head_layer_num (
critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic-nn’sHead
.- critic_head_layer_num (
int
): The num of layers used in the network to compute Q value output for critic’s nn.
- critic_head_layer_num (
- activation (
Optional[nn.Module]
): The type of activation function to use in
MLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
- activation (
- norm_type (
Optional[str]
): The type of normalization to use, see
ding.torch_utils.fc_block
for more details.
- norm_type (
- compute_actor(inputs: Tensor) Dict [source]¶
- Overview:
Use encoded embedding tensor to predict output. Execute parameter updates with
compute_actor
mode Use encoded embedding tensor to predict output.- Arguments:
- inputs (
torch.Tensor
): The encoded embedding tensor, determined with given
hidden_size
, i.e.(B, N=hidden_size)
.hidden_size = actor_head_hidden_size
- inputs (
mode (
str
): Name of the forward mode.
- Returns:
outputs (
Dict
): Outputs of forward pass encoder and head.
- ReturnsKeys (either):
logit (
torch.FloatTensor
): \((B, N1)\), where B is batch size and N1 isaction_shape
- Shapes:
inputs (
torch.Tensor
): \((B, N0)\), B is batch size and N0 corresponds tohidden_size
logit (
torch.FloatTensor
): \((B, N1)\), where B is batch size and N1 isaction_shape
- Examples:
>>> # Regression mode >>> model = ACER(64, 64) >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([4, 64])
- compute_critic(inputs: Tensor) Dict [source]¶
- Overview:
Execute parameter updates with
compute_critic
mode Use encoded embedding tensor to predict output.- Arguments:
obs
,action
encoded tensors.mode (
str
): Name of the forward mode.
- Returns:
outputs (
Dict
): Q-value output.
- ReturnKeys:
q_value (
torch.Tensor
): Q value tensor with same size as batch size.
- Shapes:
obs (
torch.Tensor
): \((B, N1)\), where B is batch size and N1 isobs_shape
q_value (
torch.FloatTensor
): \((B, N2)\), where B is batch size and N2 isaction_shape
.
- Examples:
>>> inputs =torch.randn(4, N) >>> model = ACER(obs_shape=(N, ),action_shape=5) >>> model(inputs, mode='compute_critic')['q_value']
- forward(inputs: Tensor | Dict, mode: str) Dict [source]¶
- Overview:
Use observation to predict output. Parameter updates with ACER’s MLPs forward setup.
- Arguments:
mode (
str
): Name of the forward mode.
- Returns:
outputs (
Dict
): Outputs of network forward.
- Shapes (Actor):
obs (
torch.Tensor
): \((B, N1)\), where B is batch size and N1 isobs_shape
logit (
torch.FloatTensor
): \((B, N2)\), where B is batch size and N2 isaction_shape
- Shapes (Critic):
inputs (
torch.Tensor
): \((B, N1)\), B is batch size and N1 corresponds toobs_shape
q_value (
torch.FloatTensor
): \((B, N2)\), where B is batch size and N2 isaction_shape
NGU¶
- class ding.model.NGU(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], collector_env_num: int | None = 1, dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None)[source]¶
- Overview:
The recurrent Q model for NGU(https://arxiv.org/pdf/2002.06038.pdf) policy, modified from the class DRQN in q_leaning.py. The implementation mentioned in the original paper is ‘adapt the R2D2 agent that uses the dueling network architecture with an LSTM layer after a convolutional neural network’. The NGU network includes encoder, LSTM core(rnn) and head.
- Interface:
__init__
,forward
.
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], collector_env_num: int | None = 1, dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, lstm_type: str | None = 'normal', activation: Module | None = ReLU(), norm_type: str | None = None) None [source]¶
- Overview:
Init the DRQN Model for NGU according to arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s space, such as 8 or [4, 84, 84].action_shape (
Union[int, SequenceType]
): Action’s space, such as 6 or [2, 3, 3].encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
.collector_env_num (
Optional[int]
): The number of environments used to collect data simultaneously.dueling (
bool
): Whether chooseDuelingHead
(True) orDiscreteHead (False)
, default to True.head_hidden_size (
Optional[int]
): Thehidden_size
to pass toHead
, should match the last element ofencoder_hidden_size_list
.head_layer_num (
int
): The number of layers in head network.lstm_type (
Optional[str]
): Version of rnn cell, now support [‘normal’, ‘pytorch’, ‘hpc’, ‘gru’], default is ‘normal’.- activation (
Optional[nn.Module]
): The type of activation function to use in
MLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
.
- activation (
- norm_type (
Optional[str]
): The type of normalization to use, see
ding.torch_utils.fc_block
for more details`.
- norm_type (
- forward(inputs: Dict, inference: bool = False, saved_state_timesteps: list | None = None) Dict [source]¶
- Overview:
Forward computation graph of NGU R2D2 network. Input observation, prev_action prev_reward_extrinsic to predict NGU Q output. Parameter updates with NGU’s MLPs forward setup.
- Arguments:
- inputs (
Dict
): obs (
torch.Tensor
): Encoded observation.prev_state (
list
): Previous state’s tensor of size(B, N)
.
- inputs (
inference: (:obj:’bool’): If inference is True, we unroll the one timestep transition, if inference is False, we unroll the sequence transitions.
saved_state_timesteps: (:obj:’Optional[list]’): When inference is False, we unroll the sequence transitions, then we would save rnn hidden states at timesteps that are listed in list saved_state_timesteps.
- Returns:
- outputs (
Dict
): Run
MLP
withDRQN
setups and return the result prediction dictionary.
- outputs (
- ReturnsKeys:
logit (
torch.Tensor
): Logit tensor with same size as inputobs
.next_state (
list
): Next state’s tensor of size(B, N)
.
- Shapes:
obs (
torch.Tensor
): \((B, N=obs_space)\), where B is batch size.prev_state(
torch.FloatTensor list
): \([(B, N)]\).logit (
torch.FloatTensor
): \((B, N)\).next_state(
torch.FloatTensor list
): \([(B, N)]\).
BCQ¶
- class ding.model.BCQ(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, actor_head_hidden_size: List = [400, 300], critic_head_hidden_size: List = [400, 300], activation: Module | None = ReLU(), vae_hidden_dims: List = [750, 750], phi: float = 0.05)[source]¶
- Overview:
Model of BCQ (Batch-Constrained deep Q-learning). Off-Policy Deep Reinforcement Learning without Exploration. https://arxiv.org/abs/1812.02900
- Interface:
forward
,compute_actor
,compute_critic
,compute_vae
,compute_eval
- Property:
mode
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, actor_head_hidden_size: List = [400, 300], critic_head_hidden_size: List = [400, 300], activation: Module | None = ReLU(), vae_hidden_dims: List = [750, 750], phi: float = 0.05) None [source]¶
- Overview:
Initialize neural network, i.e. agent Q network and actor.
- Arguments:
obs_shape (
int
): the dimension of observation stateaction_shape (
int
): the dimension of action shapeactor_hidden_size (
list
): the list of hidden size of actorcritic_hidden_size (:obj:’list’): the list of hidden size of critic
activation (
nn.Module
): Activation function in network, defaults to nn.ReLU().vae_hidden_dims (
list
): the list of hidden size of vae
- compute_actor(inputs: Dict[str, Tensor]) Dict[str, Tensor | Dict[str, Tensor]] [source]¶
- Overview:
Use actor network to compute action.
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
outputs (
Dict
): Dict containing keywordsaction
(torch.Tensor
).
- Shapes:
inputs (
Dict
): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.outputs (
Dict
): \((B, N)\).
- Examples:
>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)} >>> model = BCQ(32, 6) >>> outputs = model.compute_actor(inputs)
- compute_critic(inputs: Dict[str, Tensor]) Dict[str, Tensor] [source]¶
- Overview:
Use critic network to compute q value.
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
outputs (
Dict
): Dict containing keywordsq_value
(torch.Tensor
).
- Shapes:
inputs (
Dict
): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.outputs (
Dict
): \((B, N)\).
- Examples:
>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)} >>> model = BCQ(32, 6) >>> outputs = model.compute_critic(inputs)
- compute_eval(inputs: Dict[str, Tensor]) Dict[str, Tensor] [source]¶
- Overview:
Use actor network to compute action.
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
outputs (
Dict
): Dict containing keywordsaction
(torch.Tensor
).
- Shapes:
inputs (
Dict
): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.outputs (
Dict
): \((B, N)\).
- Examples:
>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)} >>> model = BCQ(32, 6) >>> outputs = model.compute_eval(inputs)
- compute_vae(inputs: Dict[str, Tensor]) Dict[str, Tensor] [source]¶
- Overview:
Use vae network to compute action.
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
outputs (
Dict
): Dict containing keywordsrecons_action
(torch.Tensor
),prediction_residual
(torch.Tensor
),input
(torch.Tensor
),mu
(torch.Tensor
),log_var
(torch.Tensor
) andz
(torch.Tensor
).
- Shapes:
inputs (
Dict
): \((B, N, D)\), where B is batch size, N is sample number, D is input dimension.outputs (
Dict
): \((B, N)\).
- Examples:
>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)} >>> model = BCQ(32, 6) >>> outputs = model.compute_vae(inputs)
- forward(inputs: Dict[str, Tensor], mode: str) Dict[str, Tensor] [source]¶
- Overview:
The unique execution (forward) method of BCQ method, and one can indicate different modes to implement different computation graph, including
compute_actor
andcompute_critic
in BCQ.- Mode compute_actor:
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
output (
Dict
): Output dict data, including action tensor.
- Mode compute_critic:
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
output (
Dict
): Output dict data, including q_value tensor.
- Mode compute_vae:
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
outputs (
Dict
): Dict containing keywordsrecons_action
(torch.Tensor
),prediction_residual
(torch.Tensor
),input
(torch.Tensor
),mu
(torch.Tensor
),log_var
(torch.Tensor
) andz
(torch.Tensor
).
- Mode compute_eval:
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
output (
Dict
): Output dict data, including action tensor.
- Examples:
>>> inputs = {'obs': torch.randn(4, 32), 'action': torch.randn(4, 6)} >>> model = BCQ(32, 6) >>> outputs = model(inputs, mode='compute_actor') >>> outputs = model(inputs, mode='compute_critic') >>> outputs = model(inputs, mode='compute_vae') >>> outputs = model(inputs, mode='compute_eval')
Note
For specific examples, one can refer to API doc of
compute_actor
andcompute_critic
respectively.
EDAC¶
- class ding.model.EDAC(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, ensemble_num: int = 2, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, **kwargs)[source]¶
- Overview:
The Q-value Actor-Critic network with the ensemble mechanism, which is used in EDAC.
- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(obs_shape: int | SequenceType, action_shape: int | SequenceType | EasyDict, ensemble_num: int = 2, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, **kwargs) None [source]¶
- Overview:
Initailize the EDAC Model according to input arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s shape, such as 128, (156, ).action_shape (
Union[int, SequenceType, EasyDict]
): Action’s shape, such as 4, (3, ), EasyDict({‘action_type_shape’: 3, ‘action_args_shape’: 4}).ensemble_num (
int
): Q-net number.actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor head.actor_head_layer_num (
int
): The num of layers used in the network to compute Q value output for actor head.critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic head.critic_head_layer_num (
int
): The num of layers used in the network to compute Q value output for critic head.activation (
Optional[nn.Module]
): The type of activation function to use inMLP
after each FC layer, ifNone
then default set tonn.ReLU()
.norm_type (
Optional[str]
): The type of normalization to after network layer (FC, Conv), seeding.torch_utils.network
for more details.
- compute_actor(obs: Tensor) Dict[str, Tensor | Dict[str, Tensor]] [source]¶
- Overview:
The forward computation graph of compute_actor mode, uses observation tensor to produce actor output, such as
action
,logit
and so on.- Arguments:
obs (
torch.Tensor
): Observation tensor data, now supports a batch of 1-dim vector data, i.e.(B, obs_shape)
.
- Returns:
outputs (
Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]
): Actor output varying from action_space:reparameterization
.
- ReturnsKeys (either):
- logit (
Dict[str, torch.Tensor]
): Reparameterization logit, usually in SAC. mu (
torch.Tensor
): Mean of parameterization gaussion distribution.sigma (
torch.Tensor
): Standard variation of parameterization gaussion distribution.
- logit (
- Shapes:
obs (
torch.Tensor
): \((B, N0)\), B is batch size and N0 corresponds toobs_shape
.action (
torch.Tensor
): \((B, N1)\), B is batch size and N1 corresponds toaction_shape
.logit.mu (
torch.Tensor
): \((B, N1)\), B is batch size and N1 corresponds toaction_shape
.logit.sigma (
torch.Tensor
): \((B, N1)\), B is batch size.logit (
torch.Tensor
): \((B, N2)\), B is batch size and N2 corresponds toaction_shape.action_type_shape
.action_args (
torch.Tensor
): \((B, N3)\), B is batch size and N3 corresponds toaction_shape.action_args_shape
.
- Examples:
>>> model = EDAC(64, 64,) >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['logit'][0].shape == torch.Size([4, 64]) # mu >>> actor_outputs['logit'][1].shape == torch.Size([4, 64]) # sigma
- compute_critic(inputs: Dict[str, Tensor]) Dict[str, Tensor] [source]¶
- Overview:
The forward computation graph of compute_critic mode, uses observation and action tensor to produce critic output, such as
q_value
.- Arguments:
inputs (
Dict[str, torch.Tensor]
): Dict strcture of input data, includingobs
andaction
tensor
- Returns:
outputs (
Dict[str, torch.Tensor]
): Critic output, such asq_value
.
- ArgumentsKeys:
obs: (
torch.Tensor
): Observation tensor data, now supports a batch of 1-dim vector data.action (
Union[torch.Tensor, Dict]
): Continuous action with same size asaction_shape
.
- ReturnKeys:
q_value (
torch.Tensor
): Q value tensor with same size as batch size.
- Shapes:
obs (
torch.Tensor
): \((B, N1)\) or ‘(Ensemble_num, B, N1)’, where B is batch size and N1 isobs_shape
.action (
torch.Tensor
): \((B, N2)\) or ‘(Ensemble_num, B, N2)’, where B is batch size and N4 isaction_shape
.q_value (
torch.Tensor
): \((Ensemble_num, B)\), where B is batch size.
- Examples:
>>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)} >>> model = EDAC(obs_shape=(8, ),action_shape=1) >>> model(inputs, mode='compute_critic')['q_value'] # q value ... tensor([0.0773, 0.1639, 0.0917, 0.0370], grad_fn=<SqueezeBackward1>)
- forward(inputs: Tensor | Dict[str, Tensor], mode: str) Dict[str, Tensor] [source]¶
- Overview:
The unique execution (forward) method of EDAC method, and one can indicate different modes to implement different computation graph, including
compute_actor
andcompute_critic
in EDAC.- Mode compute_actor:
- Arguments:
inputs (
torch.Tensor
): Observation data, defaults to tensor.
- Returns:
output (
Dict
): Output dict data, including differnet key-values among distinct action_space.
- Mode compute_critic:
- Arguments:
inputs (
Dict
): Input dict data, including obs and action tensor.
- Returns:
output (
Dict
): Output dict data, including q_value tensor.
Note
For specific examples, one can refer to API doc of
compute_actor
andcompute_critic
respectively.
EBM¶
- class ding.model.EBM(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4, **kwargs)[source]¶
- Overview:
Energy based model.
- Interface:
__init__
,forward
- __init__(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4, **kwargs)[source]¶
- Overview:
Initialize the EBM.
- Arguments:
obs_shape (
int
): Observation shape.action_shape (
int
): Action shape.hidden_size (
int
): Hidden size.hidden_layer_num (
int
): Number of hidden layers.
- forward(obs, action)[source]¶
- Overview:
Forward computation graph of EBM.
- Arguments:
obs (
torch.Tensor
): Observation of shape (B, N, O).action (
torch.Tensor
): Action of shape (B, N, A).
- Returns:
pred (
torch.Tensor
): Energy of shape (B, N).
- Examples:
>>> obs = torch.randn(2, 3, 4) >>> action = torch.randn(2, 3, 5) >>> ebm = EBM(4, 5) >>> pred = ebm(obs, action)
AutoregressiveEBM¶
- class ding.model.AutoregressiveEBM(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4)[source]¶
- Overview:
Autoregressive energy based model.
- Interface:
__init__
,forward
- __init__(obs_shape: int, action_shape: int, hidden_size: int = 512, hidden_layer_num: int = 4)[source]¶
- Overview:
Initialize the AutoregressiveEBM.
- Arguments:
obs_shape (
int
): Observation shape.action_shape (
int
): Action shape.hidden_size (
int
): Hidden size.hidden_layer_num (
int
): Number of hidden layers.
- forward(obs, action)[source]¶
- Overview:
Forward computation graph of AutoregressiveEBM.
- Arguments:
obs (
torch.Tensor
): Observation of shape (B, N, O).action (
torch.Tensor
): Action of shape (B, N, A).
- Returns:
pred (
torch.Tensor
): Energy of shape (B, N, A).
- Examples:
>>> obs = torch.randn(2, 3, 4) >>> action = torch.randn(2, 3, 5) >>> arebm = AutoregressiveEBM(4, 5) >>> pred = arebm(obs, action)
VAE¶
- class ding.model.VanillaVAE(action_shape: int, obs_shape: int, latent_size: int, hidden_dims: List = [256, 256], **kwargs)[source]¶
- Overview:
Implementation of Vanilla variational autoencoder for action reconstruction.
- Interfaces:
__init__
,encode
,decode
,decode_with_obs
,reparameterize
,forward
,loss_function
.
- __init__(action_shape: int, obs_shape: int, latent_size: int, hidden_dims: List = [256, 256], **kwargs) None [source]¶
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- decode(z: Tensor, obs_encoding: Tensor) Dict[str, Any] [source]¶
- Overview:
Maps the given latent action and obs_encoding onto the original action space.
- Arguments:
z (
torch.Tensor
): the sampled latent actionobs_encoding (
torch.Tensor
): observation encoding
- Returns:
outputs (
Dict
): DQN forward outputs, such as q_value.
- ReturnsKeys:
reconstruction_action (
torch.Tensor
): reconstruction_action.predition_residual (
torch.Tensor
): predition_residual.
- Shapes:
z (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent_size
obs_encoding (
torch.Tensor
): \((B, H)\), where B is batch size and H ishidden dim
- decode_with_obs(z: Tensor, obs: Tensor) Dict[str, Any] [source]¶
- Overview:
Maps the given latent action and obs onto the original action space. Using the method self.encode_obs_head(obs) to get the obs_encoding.
- Arguments:
z (
torch.Tensor
): the sampled latent actionobs (
torch.Tensor
): observation
- Returns:
outputs (
Dict
): DQN forward outputs, such as q_value.
- ReturnsKeys:
reconstruction_action (
torch.Tensor
): the action reconstructed by VAE .predition_residual (
torch.Tensor
): the observation predicted by VAE.
- Shapes:
z (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent_size
obs (
torch.Tensor
): \((B, O)\), where B is batch size and O isobs_shape
- encode(input: Dict[str, Tensor]) Dict[str, Any] [source]¶
- Overview:
Encodes the input by passing through the encoder network and returns the latent codes.
- Arguments:
input (
Dict
): Dict containing keywords obs (torch.Tensor
) and action (torch.Tensor
), representing the observation and agent’s action respectively.
- Returns:
outputs (
Dict
): Dict containing keywordsmu
(torch.Tensor
),log_var
(torch.Tensor
) andobs_encoding
(torch.Tensor
) representing latent codes.
- Shapes:
obs (
torch.Tensor
): \((B, O)\), where B is batch size and O isobservation dim
.action (
torch.Tensor
): \((B, A)\), where B is batch size and A isaction dim
.mu (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent size
.log_var (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent size
.obs_encoding (
torch.Tensor
): \((B, H)\), where B is batch size and H ishidden dim
.
- forward(input: Dict[str, Tensor], **kwargs) dict [source]¶
- Overview:
Encode the input, reparameterize mu and log_var, decode obs_encoding.
- Argumens:
input (
Dict
): Dict containing keywords obs (torch.Tensor
) and action (torch.Tensor
), representing the observation and agent’s action respectively.
- Returns:
outputs (
Dict
): Dict containing keywordsrecons_action
(torch.Tensor
),prediction_residual
(torch.Tensor
),input
(torch.Tensor
),mu
(torch.Tensor
),log_var
(torch.Tensor
) andz
(torch.Tensor
).
- Shapes:
recons_action (
torch.Tensor
): \((B, A)\), where B is batch size and A isaction dim
.prediction_residual (
torch.Tensor
): \((B, O)\), where B is batch size and O isobservation dim
.mu (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent size
.log_var (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent size
.z (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent_size
- loss_function(args: Dict[str, Tensor], **kwargs) Dict[str, Tensor] [source]¶
- Overview:
Computes the VAE loss function.
- Arguments:
args (
Dict[str, Tensor]
): Dict containing keywordsrecons_action
,prediction_residual
original_action
,mu
,log_var
andtrue_residual
.kwargs (
Dict
): Dict containing keywordskld_weight
andpredict_weight
.
- Returns:
outputs (
Dict[str, Tensor]
): Dict containing differentloss
results, includingloss
,reconstruction_loss
,kld_loss
,predict_loss
.
- Shapes:
recons_action (
torch.Tensor
): \((B, A)\), where B is batch size and A isaction dim
.prediction_residual (
torch.Tensor
): \((B, O)\), where B is batch size and O isobservation dim
.original_action (
torch.Tensor
): \((B, A)\), where B is batch size and A isaction dim
.mu (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent size
.log_var (
torch.Tensor
): \((B, L)\), where B is batch size and L islatent size
.true_residual (
torch.Tensor
): \((B, O)\), where B is batch size and O isobservation dim
.
- reparameterize(mu: Tensor, logvar: Tensor) Tensor [source]¶
- Overview:
Reparameterization trick to sample from N(mu, var) from N(0,1).
- Arguments:
mu (
torch.Tensor
): Mean of the latent Gaussianlogvar (
torch.Tensor
): Standard deviation of the latent Gaussian
- Shapes:
mu (
torch.Tensor
): \((B, L)\), where B is batch size and L islatnet_size
logvar (
torch.Tensor
): \((B, L)\), where B is batch size and L islatnet_size
Wrapper¶
Please refer to ding/model/wrapper
for more details.
IModelWrapper¶
- class ding.model.IModelWrapper(model: Module)[source]¶
- Overview:
The basic interface class of model wrappers. Model wrapper is a wrapper class of torch.nn.Module model, which is used to add some extra operations for the wrapped model, such as hidden state maintain for RNN-base model, argmax action selection for discrete action space, etc.
- Interfaces:
__init__
,__getattr__
,info
,reset
,forward
.
- __getattr__(key: str) Any [source]¶
- Overview:
Get original attrbutes of torch.nn.Module model, such as variables and methods defined in model.
- Arguments:
key (
str
): The string key to query.
- Returns:
ret (
Any
): The queried attribute.
- __init__(model: Module) None [source]¶
- Overview:
Initialize model and other necessary member variabls in the model wrapper.
- forward(*args, **kwargs) Any [source]¶
- Overview:
Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.
- info(attr_name: str) str [source]¶
- Overview:
Get some string information of the indicated
attr_name
, which is used for debug wrappers. This method will recursively search for the indicatedattr_name
.- Arguments:
attr_name (
str
): The string key to query information.
- Returns:
info_string (
str
): The information string of the indicatedattr_name
.
- reset(data_id: List[int] | None = None, **kwargs) None [source]¶
- Overview
Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.
- Arguments:
data_id (
List[int]
): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave thisdata_id
argument to reset the stateful variables of the indicated data.
model_wrap¶
- ding.model.model_wrap(model: Module | IModelWrapper, wrapper_name: str | None = None, **kwargs)[source]¶
- Overview:
Wrap the model with the specified wrapper and return the wrappered model.
- Arguments:
model (
Any
): The model to be wrapped.wrapper_name (
str
): The name of the wrapper to be used.
Note
The arguments of the wrapper should be passed in as kwargs.
register_wrapper¶
- ding.model.register_wrapper(name: str, wrapper_type: type) None [source]¶
- Overview:
Register new wrapper to
wrapper_name_map
. When user implements a new wrapper, they must call this function to complete the registration. Then the wrapper can be called bymodel_wrap
.- Arguments:
name (
str
): The name of the new wrapper to be registered.wrapper_type (
type
): The wrapper class needs to be added inwrapper_name_map
. This argument should be the subclass ofIModelWrapper
.
BaseModelWrapper¶
- class ding.model.wrapper.model_wrappers.BaseModelWrapper(model: Module)[source]¶
- Overview:
Placeholder class for the model wrapper. This class is used to wrap the model without any extra operations, including a empty
reset
method and aforward
method which directly call the wrapped model’s forward. To keep the consistency of the model wrapper interface, we use this class to wrap the model without specific operations in the implementation of DI-engine’s policy.
- forward(*args, **kwargs) Any ¶
- Overview:
Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.
- reset(data_id: List[int] | None = None, **kwargs) None ¶
- Overview
Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.
- Arguments:
data_id (
List[int]
): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave thisdata_id
argument to reset the stateful variables of the indicated data.
ArgmaxSampleWrapper¶
MultinomialSampleWrapper¶
EpsGreedySampleWrapper¶
- class ding.model.wrapper.model_wrappers.EpsGreedySampleWrapper(model: Module)[source]¶
- Overview:
Epsilon greedy sampler used in collector_model to help balance exploratin and exploitation. The type of eps can vary from different algorithms, such as: - float (i.e. python native scalar): for almost normal case - Dict[str, float]: for algorithm NGU
- Interfaces:
forward
.
EpsGreedyMultinomialSampleWrapper¶
DeterministicSampleWrapper¶
ReparamSampleWrapper¶
CombinationArgmaxSampleWrapper¶
CombinationMultinomialSampleWrapper¶
HybridArgmaxSampleWrapper¶
HybridEpsGreedySampleWrapper¶
HybridEpsGreedyMultinomialSampleWrapper¶
- class ding.model.wrapper.model_wrappers.HybridEpsGreedyMultinomialSampleWrapper(model: Module)[source]¶
- Overview:
Epsilon greedy sampler coupled with multinomial sample used in collector_model to help balance exploration and exploitation. In hybrid action space, i.e.{‘action_type’: discrete, ‘action_args’, continuous}
- Interfaces:
forward
.
HybridReparamMultinomialSampleWrapper¶
- class ding.model.wrapper.model_wrappers.HybridReparamMultinomialSampleWrapper(model: Module)[source]¶
- Overview:
Reparameterization sampler coupled with multinomial sample used in collector_model to help balance exploration and exploitation. In hybrid action space, i.e.{‘action_type’: discrete, ‘action_args’, continuous}
- Interfaces:
forward
HybridDeterministicArgmaxSampleWrapper¶
ActionNoiseWrapper¶
- class ding.model.wrapper.model_wrappers.ActionNoiseWrapper(model: Any, noise_type: str = 'gauss', noise_kwargs: dict = {}, noise_range: dict | None = None, action_range: dict | None = {'max': 1, 'min': -1})[source]¶
- Overview:
Add noise to collector’s action output; Do clips on both generated noise and action after adding noise.
- Interfaces:
__init__
,forward
.- Arguments:
model (
Any
): Wrapped model class. Should containforward
method.noise_type (
str
): The type of noise that should be generated, support [‘gauss’, ‘ou’].noise_kwargs (
dict
): Keyword args that should be used in noise init. Depends onnoise_type
.noise_range (
Optional[dict]
): Range of noise, used for clipping.action_range (
Optional[dict]
): Range of action + noise, used for clip, default clip to [-1, 1].
TargetNetworkWrapper¶
- class ding.model.wrapper.model_wrappers.TargetNetworkWrapper(model: Any, update_type: str, update_kwargs: dict)[source]¶
- Overview:
Maintain and update the target network
- Interfaces:
update, reset
- __init__(model: Any, update_type: str, update_kwargs: dict)[source]¶
- Overview:
Initialize model and other necessary member variabls in the model wrapper.
- forward(*args, **kwargs) Any ¶
- Overview:
Basic interface, call the wrapped model’s forward method. Other derived model wrappers can override this method to add some extra operations.
TransformerInputWrapper¶
- class ding.model.wrapper.model_wrappers.TransformerInputWrapper(model: ~typing.Any, seq_len: int, init_fn: ~typing.Callable = <function TransformerInputWrapper.<lambda>>)[source]¶
- __init__(model: ~typing.Any, seq_len: int, init_fn: ~typing.Callable = <function TransformerInputWrapper.<lambda>>) None [source]¶
- Overview:
Given N the length of the sequences received by a Transformer model, maintain the last N-1 input observations. In this way we can provide at each step all the observations needed by Transformer to compute its output. We need this because some methods such as ‘collect’ and ‘evaluate’ only provide the model 1 observation per step and don’t have memory of past observations, but Transformer needs a sequence of N observations. The wrapper method
forward
will save the input observation in a FIFO memory of length N and the methodreset
will reset the memory. The empty memory spaces will be initialized with ‘init_fn’ or zero by calling the methodreset_input
. Since different env can terminate at different steps, the methodreset_memory_entry
only initializes the memory of specific environments in the batch size.- Arguments:
model (
Any
): Wrapped model class, should contain forward method.seq_len (
int
): Number of past observations to remember.init_fn (
Callable
): The function which is used to init every memory locations when init and reset.
- forward(input_obs: Tensor, only_last_logit: bool = True, data_id: List | None = None, **kwargs) Dict[str, Tensor] [source]¶
- Arguments:
input_obs (
torch.Tensor
): Input observation without sequence shape:(bs, *obs_shape)
.only_last_logit (
bool
): if True ‘logit’ only contains the output corresponding to the current observation (shape: bs, embedding_dim), otherwise logit has shape (seq_len, bs, embedding_dim).data_id (
List
): id of the envs that are currently running. Memory update and logits return has only effect for those environments. If None it is considered that all envs are running.
- Returns:
Dictionary containing the input_sequence ‘input_seq’ stored in memory and the transformer output ‘logit’.
- reset(*args, **kwargs)[source]¶
- Overview
Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.
- Arguments:
data_id (
List[int]
): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave thisdata_id
argument to reset the stateful variables of the indicated data.
TransformerSegmentWrapper¶
- class ding.model.wrapper.model_wrappers.TransformerSegmentWrapper(model: Any, seq_len: int)[source]¶
- __init__(model: Any, seq_len: int) None [source]¶
- Overview:
Given T the length of a trajectory and N the length of the sequences received by a Transformer model, split T in sequences of N elements and forward each sequence one by one. If T % N != 0, the last sequence will be zero-padded. Usually used during Transformer training phase.
- Arguments:
model (
Any
): Wrapped model class, should contain forward method.seq_len (
int
): N, length of a sequence.
TransformerMemoryWrapper¶
- class ding.model.wrapper.model_wrappers.TransformerMemoryWrapper(model: Any, batch_size: int)[source]¶
- __init__(model: Any, batch_size: int) None [source]¶
- Overview:
- Stores a copy of the Transformer memory in order to be reused across different phases. To make it more
clear, suppose the training pipeline is divided into 3 phases: evaluate, collect, learn. The goal of the wrapper is to maintain the content of the memory at the end of each phase and reuse it when the same phase is executed again. In this way, it prevents different phases to interferer each other memory.
- Arguments:
model (
Any
): Wrapped model class, should contain forward method.batch_size (
int
): Memory batch size.
- forward(*args, **kwargs) Dict[str, Tensor] [source]¶
- Arguments:
data (
dict
): Dict type data, including at least [‘main_obs’, ‘target_obs’, ‘action’, ‘reward’, ‘done’, ‘weight’]
- Returns:
Output of the forward method.
- reset(*args, **kwargs)[source]¶
- Overview
Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations.
- Arguments:
data_id (
List[int]
): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave thisdata_id
argument to reset the stateful variables of the indicated data.