Shortcuts

ding.policy

Base Policy

Please refer to ding/policy/base_policy.py for more details.

Policy

class ding.policy.Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

The basic class of Reinforcement Learning (RL) and Imitation Learning (IL) policy in DI-engine.

Property:

cfg, learn_mode, collect_mode, eval_mode

__init__(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None) None[source]
Overview:

Initialize policy instance according to input configures and model. This method will initialize differnent fields in policy, including learn, collect, eval. The learn field is used to train the policy, the collect field is used to collect data for training, and the eval field is used to evaluate the policy. The enable_field is used to specify which field to initialize, if it is None, then all fields will be initialized.

Arguments:
  • cfg (EasyDict): The final merged config used to initialize policy. For the default config, see the config attribute and its comments of policy class.

  • model (torch.nn.Module): The neural network model used to initialize policy. If it is None, then the model will be created according to default_model method and cfg.model field. Otherwise, the model will be set to the model instance created by outside caller.

  • enable_field (Optional[List[str]]): The field list to initialize. If it is None, then all fields will be initialized. Otherwise, only the fields in enable_field will be initialized, which is beneficial to save resources.

Note

For the derived policy class, it should implement the _init_learn, _init_collect, _init_eval method to initialize the corresponding field.

__repr__() str[source]
Overview:

Get the string representation of the policy.

Returns:
  • repr (str): The string representation of the policy.

_create_model(cfg: EasyDict, model: Module | None = None) Module[source]
Overview:

Create or validate the neural network model according to the input configuration and model. If the input model is None, then the model will be created according to default_model method and cfg.model field. Otherwise, the model will be verified as an instance of torch.nn.Module and set to the model instance created by outside caller.

Arguments:
  • cfg (EasyDict): The final merged config used to initialize policy.

  • model (torch.nn.Module): The neural network model used to initialize policy. User can refer to the default model defined in the corresponding policy to customize its own model.

Returns:
  • model (torch.nn.Module): The created neural network model. The different modes of policy will add distinct wrappers and plugins to the model, which is used to train, collect and evaluate.

Raises:
  • RuntimeError: If the input model is not None and is not an instance of torch.nn.Module.

abstract _forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs, or the action logits to calculate the loss in learn mode. This method is left to be implemented by the subclass, and more arguments can be added in kwargs part if necessary.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

abstract _forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance, such as interacting with envs or computing metrics on validation dataset). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. This method is left to be implemented by the subclass.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

abstract _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss value, policy entropy, q value, priority, and so on. This method is left to be implemented by the subclass, and more arguments can be added in data item if necessary.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, in the _forward_learn method, data should be stacked in the batch dimension by some utility functions such as default_preprocess_learn.

Returns:
  • output (Dict[int, Any]): The training information of policy forward, including some metrics for monitoring training such as loss, priority, q value, policy entropy, and some data for next step training such as priority. Note the output data item should be Python native scalar rather than PyTorch tensor, which is convenient for the outside to use.

_get_attribute(name: str) Any[source]
Overview:

In order to control the access of the policy attributes, we expose different modes to outside rather than directly use the policy instance. And we also provide a method to get the attribute of the policy in different modes.

Arguments:
  • name (str): The name of the attribute.

Returns:
  • value (Any): The value of the attribute.

Note

DI-engine’s policy will first try to access _get_{name} method, and then try to access _{name} attribute. If both of them are not found, it will raise a NotImplementedError.

abstract _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. A train sample can be a processed transition (DQN with nstep TD) or some multi-timestep transitions (DRQN). This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as nstep reward, advantage, etc.

Note

We will vectorize process_transition and get_train_sample method in the following release version. And the user can customize the this data processing procecure by overriding this two methods and collector itself

abstract _init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. This method will be called in __init__ method if collect field is in enable_field. Almost different policies have its own collect mode, so this method must be overrided in subclass.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_collect and _load_state_dict_collect methods.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

abstract _init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. This method will be called in __init__ method if eval field is in enable_field. Almost different policies have its own eval mode, so this method must be override in subclass.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_eval and _load_state_dict_eval methods.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

abstract _init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. This method will be called in __init__ method if learn field is in enable_field. Almost different policies have its own learn mode, so this method must be overrided in subclass.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_init_multi_gpu_setting(model: Module, bp_update_sync: bool) None[source]
Overview:

Initialize multi-gpu data parallel training setting, including broadcast model parameters at the beginning of the training, and prepare the hook function to allreduce the gradients of model parameters.

Arguments:
  • model (torch.nn.Module): The neural network model to be trained.

  • bp_update_sync (bool): Whether to synchronize update the model parameters after allreduce the gradients of model parameters. Async update can be parallel in different network layers like pipeline so that it can save time.

_load_state_dict_collect(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy collect mode, such as load pretrained state_dict, auto-recover checkpoint, or model replica from learner in distributed training scenarios.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy collect state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_load_state_dict_eval(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy eval mode, such as load auto-recover checkpoint, or model replica from learner in distributed training scenarios.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy eval state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

Tip

The default implementation is ['cur_lr', 'total_loss']. Other derived classes can overwrite this method to add their own keys if necessary.

abstract _process_transition(obs: Tensor | Dict[str, Tensor], policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, such as <s, a, r, s’, done>. Some policies need to do some special process and pack its own necessary attributes (e.g. hidden state and logit), so this method is left to be implemented by the subclass.

Arguments:
  • obs (Union[torch.Tensor, Dict[str, torch.Tensor]]): The observation of the current timestep.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. Usually, it contains the action and the logit of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_reset_collect(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for collect mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in collecting in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

Note

This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.

_reset_eval(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

Note

This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.

_reset_learn(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different trajectories in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

Note

This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.

_set_attribute(name: str, value: Any) None[source]
Overview:

In order to control the access of the policy attributes, we expose different modes to outside rather than directly use the policy instance. And we also provide a method to set the attribute of the policy in different modes. And the new attribute will name as _{name}.

Arguments:
  • name (str): The name of the attribute.

  • value (Any): The value of the attribute.

_state_dict_collect() Dict[str, Any][source]
Overview:

Return the state_dict of collect mode, only including model in usual, which is necessary for distributed training scenarios to auto-recover collectors.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy collect state, for saving and restoring.

Tip

Not all the scenarios need to auto-recover collectors, sometimes, we can directly shutdown the crashed collector and renew a new one.

_state_dict_eval() Dict[str, Any][source]
Overview:

Return the state_dict of eval mode, only including model in usual, which is necessary for distributed training scenarios to auto-recover evaluators.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy eval state, for saving and restoring.

Tip

Not all the scenarios need to auto-recover evaluators, sometimes, we can directly shutdown the crashed evaluator and renew a new one.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model and optimizer.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

property collect_mode: collect_function
Overview:

Return the interfaces of collect mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different modes. Moreover, derived subclass can override the interfaces to customize its own collect mode.

Returns:
  • interfaces (Policy.collect_function): The interfaces of collect mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:
>>> policy = Policy(cfg, model)
>>> policy_collect = policy.collect_mode
>>> obs = env_manager.ready_obs
>>> inference_output = policy_collect.forward(obs)
>>> next_obs, rew, done, info = env_manager.step(inference_output.action)
classmethod default_config() EasyDict[source]
Overview:

Get the default config of policy. This method is used to create the default config of policy.

Returns:
  • cfg (EasyDict): The default config of corresponding policy. For the derived policy class, it will recursively merge the default config of base class and its own default config.

Tip

This method will deepcopy the config attribute of the class and return the result. So users don’t need to worry about the modification of the returned config.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about DQN, its registered name is dqn and the import_names is ding.model.template.q_learning.DQN

property eval_mode: eval_function
Overview:

Return the interfaces of eval mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different mode. Moreover, derived subclass can override the interfaces to customize its own eval mode.

Returns:
  • interfaces (Policy.eval_function): The interfaces of eval mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:
>>> policy = Policy(cfg, model)
>>> policy_eval = policy.eval_mode
>>> obs = env_manager.ready_obs
>>> inference_output = policy_eval.forward(obs)
>>> next_obs, rew, done, info = env_manager.step(inference_output.action)
property learn_mode: learn_function
Overview:

Return the interfaces of learn mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different modes. Moreover, derived subclass can override the interfaces to customize its own learn mode.

Returns:
  • interfaces (Policy.learn_function): The interfaces of learn mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:
>>> policy = Policy(cfg, model)
>>> policy_learn = policy.learn_mode
>>> train_output = policy_learn.forward(data)
>>> state_dict = policy_learn.state_dict()
sync_gradients(model: Module) None[source]
Overview:

Synchronize (allreduce) gradients of model parameters in data-parallel multi-gpu training.

Arguments:
  • model (torch.nn.Module): The model to synchronize gradients.

Note

This method is only used in multi-gpu training, and it should be called after backward method and before step method. The user can also use bp_update_sync config to control whether to synchronize gradients allreduce and optimizer updates.

CommandModePolicy

class ding.policy.CommandModePolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy with command mode, which can be used in old version of DI-engine pipeline: serial_pipeline. CommandModePolicy uses _get_setting_learn, _get_setting_collect, _get_setting_eval methods to exchange information between different workers.

Interface:

_init_command, _get_setting_learn, _get_setting_collect, _get_setting_eval

Property:

command_mode

abstract _get_setting_collect(command_info: Dict[str, Any]) Dict[str, Any][source]
Overview:

Accoding to command_info, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of collect mode, which contains dynamically changed hyperparameters for collect mode, such as eps, temperature, etc.

Arguments:
  • command_info (Dict[str, Any]): The global training information, which is defined in commander.

Returns:
  • setting (Dict[str, Any]): The latest setting of collect mode, which is usually used as extra arguments of the policy._forward_collect method.

abstract _get_setting_eval(command_info: Dict[str, Any]) Dict[str, Any][source]
Overview:

Accoding to command_info, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of eval mode, which contains dynamically changed hyperparameters for eval mode, such as temperature, etc.

Arguments:
  • command_info (Dict[str, Any]): The global training information, which is defined in commander.

Returns:
  • setting (Dict[str, Any]): The latest setting of eval mode, which is usually used as extra arguments of the policy._forward_eval method.

abstract _get_setting_learn(command_info: Dict[str, Any]) Dict[str, Any][source]
Overview:

Accoding to command_info, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of learn mode, which contains dynamically changed hyperparameters for learn mode, such as batch_size, learning_rate, etc.

Arguments:
  • command_info (Dict[str, Any]): The global training information, which is defined in commander.

Returns:
  • setting (Dict[str, Any]): The latest setting of learn mode, which is usually used as extra arguments of the policy._forward_learn method.

abstract _init_command() None[source]
Overview:

Initialize the command mode of policy, including related attributes and modules. This method will be called in __init__ method if command field is in enable_field. Almost different policies have its own command mode, so this method must be overrided in subclass.

Note

If you want to set some spacial member variables in _init_command method, you’d better name them with prefix _command_ to avoid conflict with other modes, such as self._command_attr1.

property command_mode: Policy.command_function
Overview:

Return the interfaces of command mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different mode. Moreover, derived subclass can override the interfaces to customize its own command mode.

Returns:
  • interfaces (Policy.command_function): The interfaces of command mode, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:
>>> policy = CommandModePolicy(cfg, model)
>>> policy_command = policy.command_mode
>>> settings = policy_command.get_setting_learn(command_info)

create_policy

ding.policy.create_policy(cfg: EasyDict, **kwargs) Policy[source]
Overview:

Create a policy instance according to cfg and other kwargs.

Arguments:
  • cfg (EasyDict): Final merged policy config.

ArgumentsKeys:
  • type (str): Policy type set in POLICY_REGISTRY.register method , such as dqn .

  • import_names (List[str]): A list of module names (paths) to import before creating policy, such as ding.policy.dqn .

Returns:
  • policy (Policy): The created policy instance.

Tip

kwargs contains other arguments that need to be passed to the policy constructor. You can refer to the __init__ method of the corresponding policy class for details.

Note

For more details about how to merge config, please refer to the system document of DI-engine (en link).

get_policy_cls

ding.policy.get_policy_cls(cfg: EasyDict) type[source]
Overview:

Get policy class according to cfg, which is used to access related class variables/methods.

Arguments:
  • cfg (EasyDict): Final merged policy config.

ArgumentsKeys:
  • type (str): Policy type set in POLICY_REGISTRY.register method , such as dqn .

  • import_names (List[str]): A list of module names (paths) to import before creating policy, such as ding.policy.dqn .

Returns:
  • policy (type): The policy class.

DQN

Please refer to ding/policy/dqn.py for more details.

DQNPolicy

class ding.policy.DQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of DQN algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

dqn

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update. If
True, priority must be True.

6

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

1, [3, 5]

N-step reward discount sum for target
q_value estimation

8

model.dueling

bool

True

dueling head architecture

9

model.encoder
_hidden
_size_list

list (int)

[32, 64, 64, 128]

Sequence of hidden_size of
subsequent conv layers and the
final dense layer.
default kernel_size
is [8, 4, 3]
default stride is
[4, 2 ,1]

10

model.dropout

float

None

Dropout rate for dropout layers.
[0,1]
If set to None
means no dropout

11

learn.update
per_collect

int

3

How many updates(iterations) to train
after collector’s one collection.
Only valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

12

learn.batch_
size

int

64

The number of samples of an iteration

13

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

14

learn.target_
update_freq

int

100

Frequence of target network update.
Hard(assign) update

15

learn.target_
theta

float

0.005

Frequence of target network update.
Only one of [target_update_freq,
target_theta] should be set
Soft(assign) update

16

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

17

collect.n_sample

int

[8, 128]

The number of training samples of a
call of collector.
It varies from
different envs

18

collect.n_episode

int

8

The number of training episodes of a
call of collector

only one of [n_sample
,n_episode] should
be set

19

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1

20

other.eps.type

str

exp

exploration rate decay type
Support [‘exp’,
‘linear’].

21

other.eps.
start

float

0.95

start value of exploration rate
[0,1]

22

other.eps.
end

float

0.1

end value of exploration rate
[0,1]

23

other.eps.
decay

int

10000

decay length of exploration
greater than 0. set
decay=10000 means
the exploration rate
decay from start
value to end value
during decay length.
_forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

  • eps (float): The epsilon value for exploration.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DQNPolicy: ding.policy.tests.test_dqn.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DQNPolicy: ding.policy.tests.test_dqn.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DQN, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement your own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DQNPolicy: ding.policy.tests.test_dqn.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DQN with nstep TD, a train sample is a processed transition. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize relevant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is in the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is similar in format to input transitions, but may contain more data for training, such as nstep reward and target obs.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For DQN, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism, and other algorithm-specific arguments such as unroll_len and nstep. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and nstep in DQN. This design is for the convenience of parallel execution of different policy modes.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For DQN, it contains the eval model to greedily select action with argmax q_value mechanism. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For DQN, it mainly contains optimizer, algorithm-specific arguments such as nstep and gamma, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DQN, it contains obs, next_obs, action, reward, done.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For DQN, it contains the action and the logit (q_value) of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizer.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same interface definition indicated by import_names path. For example about DQN, its registered name is dqn and the import_names is ding.model.template.q_learning.

DQNSTDIMPolicy

class ding.policy.DQNSTDIMPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of DQN algorithm, extended by ST-DIM auxiliary objectives. ST-DIM paper link: https://arxiv.org/abs/1906.08226.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

dqn_stdim

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling Weight
to correct biased update. If True,
priority must be True.

6

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

1, [3, 5]

N-step reward discount sum for target
q_value estimation

8

learn.update
per_collect
_gpu

int

3

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

10

learn.batch_
size

int

64

The number of samples of an iteration

11

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

12

learn.target_
update_freq

int

100

Frequence of target network update.
Hard(assign) update

13

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

14

collect.n_sample

int

[8, 128]

The number of training samples of a
call of collector.
It varies from
different envs

15

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1

16

other.eps.type

str

exp

exploration rate decay type
Support [‘exp’,
‘linear’].

17

other.eps.
start

float

0.95

start value of exploration rate
[0,1]

18

other.eps.
end

float

0.1

end value of exploration rate
[0,1]

19

other.eps.
decay

int

10000

decay length of exploration
greater than 0. set
decay=10000 means
the exploration rate
decay from start
value to end value
during decay length.

20

aux_loss
_weight

float

0.001

the ratio of the auxiliary loss to
the TD loss
any real value,
typically in
[-0.1, 0.1].
_forward_learn(data: Dict[str, Any]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, priority, aux_loss.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DQNSTDIM, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For DQNSTDIM, it first call super class’s _init_learn method, then initialize extra auxiliary model, its optimizer, and the loss weight. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): the dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_model_encode(data: dict) Tuple[Tensor][source]
Overview:

Get the encoding of the main model as input for the auxiliary model.

Arguments:
  • data (dict): Dict type data, same as the _forward_learn input.

Returns:
  • (Tuple[torch.Tensor]): the tuple of two tensors to apply contrastive embedding learning. In ST-DIM algorithm, these two variables are the dqn encoding of obs and next_obs respectively.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model and optimizer.

Returns:
  • state_dict (Dict[str, Any]): the dict of current policy learn state, for saving and restoring.

PPO

Please refer to ding/policy/ppo.py for more details.

PPOPolicy

class ding.policy.PPOPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of on-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347.

_forward_collect(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPolicy: ding.policy.tests.test_ppo.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in PPO often uses deterministic sample method to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPolicy: ding.policy.tests.test_ppo.

_forward_learn(data: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac, approx_kl.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including the latest collected training samples for on-policy algorithms like PPO. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PPO, each element in list is a dict containing at least the following keys: obs, action, reward, logit, value, done. Sometimes, it also contains other keys such as weight.

Returns:
  • return_infos (List[Dict[str, Any]]): The information list that indicated training result, each training iteration contains append a information dict into the final list. The list will be precessed and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Tip

The training procedure of PPO is two for loops. The outer loop trains all the collected training samples with epoch_per_collect epochs. The inner loop splits all the data into different mini-batch with the length of batch_size.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPolicy: ding.policy.tests.test_ppo.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PPO, a train sample is a processed transition with new computed traj_flag and adv field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as GAE advantage.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For PPO, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. This design is for the convenience of parallel execution of different policy modes.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For PPO, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For PPO, it mainly contains optimizer, algorithm-specific arguments such as loss weight, clip_ratio and recompute_adv. This method also executes some special network initializations and prepares running mean/std monitor for value. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PPO, it contains the state value, action and the logit of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

Note

next_obs is used to calculate nstep return when necessary, so we place in into transition by default. You can delete this field to save memory occupancy if you do not need nstep return.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about PPO, its registered name is ppo and the import_names is ding.model.template.vac.

Note

Because now PPO supports both single-agent and multi-agent usages, so we can implement these functions with the same policy and two different default models, which is controled by self._cfg.multi_agent.

PPOPGPolicy

class ding.policy.PPOPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of on policy version PPO algorithm (pure policy gradient without value network). Paper link: https://arxiv.org/abs/1707.06347.

_forward_collect(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in PPO often uses deterministic sample method to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPGPolicy: ding.policy.tests.test_ppo.

_forward_learn(data: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac, approx_kl.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including the latest collected training samples for on-policy algorithms like PPO. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PPOPG, each element in list is a dict containing at least the following keys: obs, action, return, logit, done. Sometimes, it also contains other keys such as weight.

Returns:
  • return_infos (List[Dict[str, Any]]): The information list that indicated training result, each training iteration contains append a information dict into the final list. The list will be precessed and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Tip

The training procedure of PPOPG is two for loops. The outer loop trains all the collected training samples with epoch_per_collect epochs. The inner loop splits all the data into different mini-batch with the length of batch_size.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_get_train_sample(data: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given entire episode data (a list of transition), process it into a list of sample that can be used for training directly. In PPOPG, a train sample is a processed transition with new computed return field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • data (List[Dict[str, Any]): The episode data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as discounted episode return.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For PPOPG, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. This design is for the convenience of parallel execution of different policy modes.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For PPOPG, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For PPOPG, it mainly contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method also executes some special network initializations. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPOPG, it contains obs, action, reward, done, logit.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PPOPG, it contains the action and the logit of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

PPOOffPolicy

class ding.policy.PPOOffPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of off-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347. This version is more suitable for large-scale distributed training.

_forward_collect(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOOffPolicy: ding.policy.tests.test_ppo.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in PPO often uses deterministic sample method to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOOffPolicy: ding.policy.tests.test_ppo.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac and approx_kl.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PPOOff, each element in list is a dict containing at least the following keys: obs, adv, action, logit, value, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PPO, a train sample is a processed transition with new computed traj_flag and adv field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as GAE advantage.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For PPOOff, it contains collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPOOff. This design is for the convenience of parallel execution of different policy modes.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For PPOOff, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For PPOOff, it mainly contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method also executes some special network initializations and prepares running mean/std monitor for value. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PPO, it contains the state value, action and the logit of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

Note

next_obs is used to calculate nstep return when necessary, so we place in into transition by default. You can delete this field to save memory occupancy if you do not need nstep return.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

PPOSTDIMPolicy

class ding.policy.PPOSTDIMPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of on policy version PPO algorithm with ST-DIM auxiliary model. PPO paper link: https://arxiv.org/abs/1707.06347. ST-DIM paper link: https://arxiv.org/abs/1906.08226.

_forward_learn(data: Dict[str, Any]) Dict[str, Any][source]
Overview:

Forward and backward function of learn mode.

Arguments:
  • data (dict): Dict type data

Returns:
  • info_dict (Dict[str, Any]): Including current lr, total_loss, policy_loss, value_loss, entropy_loss, adv_abs_max, approx_kl, clipfrac

_init_learn() None[source]
Overview:

Learn mode init method. Called by self.__init__. Init the auxiliary model, its optimizer, and the axuliary loss weight to the main loss.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_model_encode(data)[source]
Overview:

Get the encoding of the main model as input for the auxiliary model.

Arguments:
  • data (dict): Dict type data, same as the _forward_learn input.

Returns:
  • (Tuple[Tensor]): the tuple of two tensors to apply contrastive embedding learning.

    In ST-DIM algorithm, these two variables are the dqn encoding of obs and next_obs respectively.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, optimizer and aux_optimizer for representation learning.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

BC

Please refer to ding/policy/bc.py for more details.

BehaviourCloningPolicy

class ding.policy.BehaviourCloningPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Behaviour Cloning (BC) policy class, which supports both discrete and continuous action space. The policy is trained by supervised learning, and the data is a offline dataset collected by expert.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss and time.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For BC, each element in list is a dict containing at least the following keys: obs, action.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_collect() None[source]
Overview:

BC policy uses offline dataset so it does not need to collect data. However, sometimes we need to use the trained BC policy to collect data for other purposes.

_init_eval()[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For BC, it contains the eval model to greedily select action with argmax q_value mechanism for discrete action space. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For BC, it mainly contains optimizer, algorithm-specific arguments such as lr_scheduler, loss, etc. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about discrete BC, its registered name is discrete_bc and the import_names is ding.model.template.bc.

DDPG

Please refer to ding/policy/ddpg.py for more details.

DDPGPolicy

class ding.policy.DDPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of DDPG algorithm. Paper link: https://arxiv.org/abs/1509.02971.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

ddpg

RL policy register name, refer
to registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network

3

random_
collect_size

int

25000

Number of randomly collected
training samples in replay
buffer when training starts.
Default to 25000 for
DDPG/TD3, 10000 for
sac.

4

model.twin_
critic


bool

False

Whether to use two critic
networks or only one.


Default False for
DDPG, Clipped Double
Q-learning method in
TD3 paper.

5

learn.learning
_rate_actor

float

1e-3

Learning rate for actor
network(aka. policy).


6

learn.learning
_rate_critic

float

1e-3

Learning rates for critic
network (aka. Q-network).


7

learn.actor_
update_freq


int

2

When critic network updates
once, how many times will actor
network update.

Default 1 for DDPG,
2 for TD3. Delayed
Policy Updates method
in TD3 paper.

8

learn.noise




bool

False

Whether to add noise on target
network’s action.



Default False for
DDPG, True for TD3.
Target Policy Smoo-
thing Regularization
in TD3 paper.

9

learn.-
ignore_done

bool

False

Determine whether to ignore
done flag.
Use ignore_done only
in halfcheetah env.

10

learn.-
target_theta


float

0.005

Used for soft update of the
target network.


aka. Interpolation
factor in polyak aver-
aging for target
networks.

11

collect.-
noise_sigma



float

0.1

Used for add noise during co-
llection, through controlling
the sigma of distribution


Sample noise from dis-
tribution, Ornstein-
Uhlenbeck process in
DDPG paper, Gaussian
process in ours.
_forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e., environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DDPG, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and logit which is used for hybrid action space.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DDPG, a train sample is a processed transition (unroll_len=1).

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For DDPG, it contains the collect_model to balance the exploration and exploitation with the perturbed noise mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For DDPG, it contains the eval model to greedily select action type with argmax q_value mechanism for hybrid action space. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For DDPG, it mainly contains two optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DDPG, it contains obs, next_obs, action, reward, done.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For DDPG, it contains the action and the logit of the action (in hybrid action space).

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

TD3

Please refer to ding/policy/td3.py for more details.

TD3Policy

class ding.policy.TD3Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of TD3 algorithm. Since DDPG and TD3 share many common things, we can easily derive this TD3 class from DDPG class by changing _actor_update_freq, _twin_critic and noise in model wrapper. Paper link: https://arxiv.org/pdf/1802.09477.pdf

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

td3

RL policy register name, refer
to registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network

3

random_
collect_size

int

25000

Number of randomly collected
training samples in replay
buffer when training starts.
Default to 25000 for
DDPG/TD3, 10000 for
sac.

4

model.twin_
critic


bool

True

Whether to use two critic
networks or only one.


Default True for TD3,
Clipped Double
Q-learning method in
TD3 paper.

5

learn.learning
_rate_actor

float

1e-3

Learning rate for actor
network(aka. policy).


6

learn.learning
_rate_critic

float

1e-3

Learning rates for critic
network (aka. Q-network).


7

learn.actor_
update_freq


int

2

When critic network updates
once, how many times will actor
network update.

Default 2 for TD3, 1
for DDPG. Delayed
Policy Updates method
in TD3 paper.

8

learn.noise




bool

True

Whether to add noise on target
network’s action.



Default True for TD3,
False for DDPG.
Target Policy Smoo-
thing Regularization
in TD3 paper.

9

learn.noise_
range

dict

dict(min=-0.5,
max=0.5,)

Limit for range of target
policy smoothing noise,
aka. noise_clip.



10

learn.-
ignore_done

bool

False

Determine whether to ignore
done flag.
Use ignore_done only
in halfcheetah env.

11

learn.-
target_theta


float

0.005

Used for soft update of the
target network.


aka. Interpolation
factor in polyak aver
-aging for target
networks.

12

collect.-
noise_sigma



float

0.1

Used for add noise during co-
llection, through controlling
the sigma of distribution


Sample noise from dis
-tribution, Ornstein-
Uhlenbeck process in
DDPG paper, Gaussian
process in ours.
_forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e., environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_eval(data: Dict[int, Any]) Dict[int, Any]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DDPG, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and logit which is used for hybrid action space.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DDPG, a train sample is a processed transition (unroll_len=1).

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() None
Overview:

Initialize the collect mode of policy, including related attributes and modules. For DDPG, it contains the collect_model to balance the exploration and exploitation with the perturbed noise mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() None
Overview:

Initialize the eval mode of policy, including related attributes and modules. For DDPG, it contains the eval model to greedily select action type with argmax q_value mechanism for hybrid action space. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None
Overview:

Initialize the learn mode of policy, including related attributes and modules. For DDPG, it mainly contains two optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DDPG, it contains obs, next_obs, action, reward, done.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For DDPG, it contains the action and the logit of the action (in hybrid action space).

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() Dict[str, Any]
Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

SAC

Please refer to ding/policy/sac.py for more details.

SACPolicy

class ding.policy.SACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of continuous SAC algorithm. Paper link: https://arxiv.org/pdf/1801.01290.pdf

Config:

ID

Symbol

Type

Default Value

Description

Other

1

type

str

sac

RL policy register name, refer
to registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

True

Whether to use cuda for network

3

on_policy

bool

False

SAC is an off-policy
algorithm.


4

priority

bool

False

Whether to use priority
sampling in buffer.


5

priority_IS_
weight

bool

False

Whether use Importance Sampling
weight to correct biased update


6

random_
collect_size

int

10000

Number of randomly collected
training samples in replay
buffer when training starts.
Default to 10000 for
SAC, 25000 for DDPG/
TD3.

7

learn.learning
_rate_q

float

3e-4

Learning rate for soft q
network.
Defalut to 1e-3

8

learn.learning
_rate_policy

float

3e-4

Learning rate for policy
network.
Defalut to 1e-3

9

learn.alpha



float

0.2

Entropy regularization
coefficient.


alpha is initiali-
zation for auto
alpha, when
auto_alpha is True

10

learn.
auto_alpha



bool

False

Determine whether to use
auto temperature parameter
alpha.


Temperature parameter
determines the
relative importance
of the entropy term
against the reward.

11

learn.-
ignore_done

bool

False

Determine whether to ignore
done flag.
Use ignore_done only
in env like Pendulum

12

learn.-
target_theta


float

0.005

Used for soft update of the
target network.


aka. Interpolation
factor in polyak aver
aging for target
networks.
_forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

logit in SAC means the mu and sigma of Gaussioan distribution. Here we use this name for consistency.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

logit in SAC means the mu and sigma of Gaussioan distribution. Here we use this name for consistency.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For SAC, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In continuous SAC, a train sample is a processed transition (unroll_len=1).

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For SAC, it contains the collect_model other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For SAC, it contains the eval model, which is equipped with base model wrapper to ensure compability. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For continuous SAC, it contains obs, next_obs, action, reward, done. The logit will be also added when collector_logit is True.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For continuous SAC, it contains the action and the logit (mu and sigma) of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

DiscreteSACPolicy

class ding.policy.DiscreteSACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of discrete SAC algorithm. Paper link: https://arxiv.org/abs/1910.07207.

_forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

  • eps (float): The epsilon value for exploration.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DiscreteSACPolicy: ding.policy.tests.test_discrete_sac.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DiscreteSACPolicy: ding.policy.tests.test_discrete_sac.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For SAC, each element in list is a dict containing at least the following keys: obs, action, logit, reward, next_obs, done. Sometimes, it also contains other keys like weight.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DiscreteSACPolicy: ding.policy.tests.test_discrete_sac.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In discrete SAC, a train sample is a processed transition (unroll_len=1).

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For SAC, it contains the collect_model to balance the exploration and exploitation with the epsilon and multinomial sample mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For DiscreteSAC, it contains the eval model to greedily select action type with argmax q_value mechanism. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For DiscreteSAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For discrete SAC, it contains obs, next_obs, logit, action, reward, done.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For discrete SAC, it contains the action and the logit of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

SQILSACPolicy

class ding.policy.SQILSACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of continuous SAC algorithm with SQIL extension. SAC paper link: https://arxiv.org/pdf/1801.01290.pdf SQIL paper link: https://arxiv.org/abs/1905.11108

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For SAC, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

For SQIL + SAC, input data is composed of two parts with the same size: agent data and expert data. Both of them are relabelled with new reward according to SQIL algorithm.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

R2D2

Please refer to ding/policy/r2d2.py for more details.

R2D2Policy

class ding.policy.R2D2Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of R2D2, from paper Recurrent Experience Replay in Distributed Reinforcement Learning . R2D2 proposes that several tricks should be used to improve upon DRQN, namely some recurrent experience replay tricks and the burn-in mechanism for off-policy training.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

r2d2

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling Weight
to correct biased update. If True,
priority must be True.

6

discount_
factor

float

0.997, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

3, [3, 5]

N-step reward discount sum for target
q_value estimation

8

burnin_step

int

2

The timestep of burnin operation,
which is designed to RNN hidden state
difference caused by off-policy

9

learn.update
per_collect

int

1

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

10

learn.batch_
size

int

64

The number of samples of an iteration

11

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

12

learn.value_
rescale

bool

True

Whether use value_rescale function for
predicted value

13

learn.target_
update_freq

int

100

Frequence of target network update.
Hard(assign) update

14

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

15

collect.n_sample

int

[8, 128]

The number of training samples of a
call of collector.
It varies from
different envs

16

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1
_forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

  • eps (float): The epsilon value for exploration.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (prev_state) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

RNN’s hidden states are maintained in the policy, so we don’t need pass them into data but to reset the hidden states with _reset_collect method when episode ends. Besides, the previous hidden states are necessary for training, so we need to return them in _process_transition method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for R2D2Policy: ding.policy.tests.test_r2d2.

_forward_learn(data: List[List[Dict[str, Any]]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data (trajectory for R2D2) from the replay buffer and then returns the output result, including various training information such as loss, q value, priority.

Arguments:
  • data (List[List[Dict[int, Any]]]): The input data used for policy forward, including a batch of training samples. For each dict element, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the time and batch dimension by the utility functions self._data_preprocess_learn. For R2D2, each element in list is a trajectory with the length of unroll_len, and the element in trajectory list is a dict containing at least the following keys: obs, action, prev_state, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for R2D2Policy: ding.policy.tests.test_r2d2.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In R2D2, a train sample is processed transitions with unroll_len length. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each sample is a fixed-length trajectory, and each element in a sample is the similar format as input transitions, but may contain more data for training, such as nstep reward and value_gamma factor.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For R2D2, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and maintain the hidden state of rnn. Besides, there are some initialization operations about other algorithm-specific arguments such as burnin_step, unroll_len and nstep. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and nstep in R2D2. This design is for the convenience of parallel execution of different policy modes.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including some attributes and modules. For R2D2, it mainly contains optimizer, algorithm-specific arguments such as burnin_step, value_rescale and gamma, main and target model. Because of the use of RNN, all the models should be wrappered with hidden_state which needs to be initialized with proper size. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For R2D2, it contains obs, action, prev_state, reward, and done.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network given the observation as input. For R2D2, it contains the action and the prev_state of RNN.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_reset_collect(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in R2D2) specified by data_id.

_reset_eval(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in R2D2) specified by data_id.

_reset_learn(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different trajectories in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e. RNN hidden_state in R2D2) specified by data_id.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizer.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about R2D2, its registered name is drqn and the import_names is ding.model.template.q_learning.

IMPALA

Please refer to ding/policy/impala.py for more details.

IMPALAPolicy

class ding.policy.IMPALAPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of IMPALA algorithm. Paper link: https://arxiv.org/abs/1802.01561.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

impala

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling Weight

If True, priority
must be True

6

unroll_len

int

32

trajectory length to calculate v-trace
target

7

learn.update
per_collect

int

4

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy
_forward_collect(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to unittest for IMPALAPolicy: ding.policy.tests.test_impala.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in IMPALA often uses deterministic sample to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to unittest for IMPALAPolicy: ding.policy.tests.test_impala.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss and current learning rate.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For IMPALA, each element in list is a dict containing at least the following keys: obs, action, logit, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to unittest for IMPALAPolicy: ding.policy.tests.test_impala.

_get_train_sample(data: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training. In IMPALA, a train sample is processed transitions with unroll_len length.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For IMPALA, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For IMPALA, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For IMPALA, it mainly contains optimizer, algorithm-specific arguments such as loss weight and gamma, main (learn) model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For IMPALA, it contains obs, next_obs, action, reward, done, logit.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For IMPALA, it contains the action and the logit of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about IMPALA , its registered name is vac and the import_names is ding.model.template.vac.

QMIX

Please refer to ding/policy/qmix.py for more details.

QMIXPolicy

class ding.policy.QMIXPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of QMIX algorithm. QMIX is a multi-agent reinforcement learning algorithm, you can view the paper in the following link https://arxiv.org/abs/1803.11485.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

qmix

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

True

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update.
IS weight

6

learn.update_
per_collect

int

20

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

7

learn.target_
update_theta

float

0.001

Target network update momentum
parameter.
between[0,1]

8

learn.discount
_factor

float

0.99

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env
_forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

  • eps (float): The epsilon value for exploration.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (prev_state) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

RNN’s hidden states are maintained in the policy, so we don’t need pass them into data but to reset the hidden states with _reset_collect method when episode ends. Besides, the previous hidden states are necessary for training, so we need to return them in _process_transition method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for QMIXPolicy: ding.policy.tests.test_qmix.

_forward_learn(data: List[List[Dict[str, Any]]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data (trajectory for QMIX) from the replay buffer and then returns the output result, including various training information such as loss, q value, grad_norm.

Arguments:
  • data (List[List[Dict[int, Any]]]): The input data used for policy forward, including a batch of training samples. For each dict element, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the time and batch dimension by the utility functions self._data_preprocess_learn. For QMIX, each element in list is a trajectory with the length of unroll_len, and the element in trajectory list is a dict containing at least the following keys: obs, action, prev_state, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for QMIXPolicy: ding.policy.tests.test_qmix.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In QMIX, a train sample is processed transitions with unroll_len length. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each sample is a fixed-length trajectory, and each element in a sample is the similar format as input transitions.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For QMIX, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and maintain the hidden state of rnn. Besides, there are some initialization operations about other algorithm-specific arguments such as burnin_step, unroll_len and nstep. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including some attributes and modules. For QMIX, it mainly contains optimizer, algorithm-specific arguments such as gamma, main and target model. Because of the use of RNN, all the models should be wrappered with hidden_state which needs to be initialized with proper size. This method will be called in __init__ method if learn field is in enable_field.

Tip

For multi-agent algorithm, we often need to use agent_num to initialize some necessary variables.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1. - agent_num (int): Since this is a multi-agent algorithm, we need to input the agent num.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For QMIX, it contains obs, next_obs, action, prev_state, reward, done.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, usually including agent_obs and global_obs in multi-agent environment like MPE and SMAC.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For QMIX, it contains the action and the prev_state of RNN.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_reset_collect(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in QMIX) specified by data_id.

_reset_eval(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in QMIX) specified by data_id.

_reset_learn(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different trajectories in data_id will have different hidden state in RNN.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e. RNN hidden_state in QMIX) specified by data_id.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizer.

Returns:
  • state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default model setting for demonstration.

Returns:
  • model_info (Tuple[str, List[str]]): model name and mode import_names

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For QMIX, ding.model.qmix.qmix

CQL

Please refer to ding/policy/cql.py for more details.

CQLPolicy

class ding.policy.CQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of CQL algorithm for continuous control. Paper link: https://arxiv.org/abs/2006.04779.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

cql

RL policy register name, refer
to registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

True

Whether to use cuda for network

3

random_
collect_size

int

10000

Number of randomly collected
training samples in replay
buffer when training starts.
Default to 10000 for
SAC, 25000 for DDPG/
TD3.

4

model.policy_
embedding_size

int

256

Linear layer size for policy
network.


5

model.soft_q_
embedding_size

int

256

Linear layer size for soft q
network.


6

model.value_
embedding_size

int

256

Linear layer size for value
network.

Defalut to None when
model.value_network
is False.

7

learn.learning
_rate_q

float

3e-4

Learning rate for soft q
network.

Defalut to 1e-3, when
model.value_network
is True.

8

learn.learning
_rate_policy

float

3e-4

Learning rate for policy
network.

Defalut to 1e-3, when
model.value_network
is True.

9

learn.learning
_rate_value

float

3e-4

Learning rate for policy
network.

Defalut to None when
model.value_network
is False.

10

learn.alpha



float

0.2

Entropy regularization
coefficient.


alpha is initiali-
zation for auto
alpha, when
auto_alpha is True

11

learn.repara_
meterization

bool

True

Determine whether to use
reparameterization trick.


12

learn.
auto_alpha



bool

False

Determine whether to use
auto temperature parameter
alpha.


Temperature parameter
determines the
relative importance
of the entropy term
against the reward.

13

learn.-
ignore_done

bool

False

Determine whether to ignore
done flag.
Use ignore_done only
in halfcheetah env.

14

learn.-
target_theta


float

0.005

Used for soft update of the
target network.


aka. Interpolation
factor in polyak aver
aging for target
networks.
_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, action, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For CQL, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma, min_q_weight, with_lagrange and with_q_entropy, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

DiscreteCQLPolicy

class ding.policy.DiscreteCQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of discrete CQL algorithm in discrete action space environments. Paper link: https://arxiv.org/abs/2006.04779.

_forward_learn(data: List[Dict[str, Any]]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, action, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DiscreteCQL, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys like weight and value_gamma for nstep return computation.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For DiscreteCQL, it mainly contains the optimizer, algorithm-specific arguments such as gamma, nstep and min_q_weight, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

DecisionTransformer

Please refer to ding/policy/dt.py for more details.

DTPolicy

class ding.policy.DTPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of Decision Transformer algorithm in discrete environments. Paper link: https://arxiv.org/abs/2106.01345.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance, such as interacting with envs. Forward means that the policy gets some input data (current obs/return-to-go and historical information) from the envs and then returns the output data, such as the action to interact with the envs. Arguments: - data (Dict[int, Any]): The input data used for policy forward, including at least the obs and reward to calculate running return-to-go. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

Decision Transformer will do different operations for different types of envs in evaluation.

_forward_learn(data: List[Tensor]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, current learning rate.

Arguments:
  • data (List[torch.Tensor]): The input data used for policy forward, including a series of processed torch.Tensor data, i.e., timesteps, states, actions, returns_to_go, traj_mask.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For DQN, it contains the eval model, some algorithm-specific parameters such as context_len, max_eval_ep_len, etc. This method will be called in __init__ method if eval field is in enable_field.

Tip

For the evaluation of complete episodes, we need to maintain some historical information for transformer inference. These variables need to be initialized in _init_eval and reset in _reset_eval when necessary.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For Decision Transformer, it mainly contains the optimizer, algorithm-specific arguments such as rtg_scale and lr scheduler. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_reset_eval(data_id: List[int] | None = None) None[source]
Overview:

Reset some stateful variables for eval mode when necessary, such as the historical info of transformer for decision transformer. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different history.

Arguments:
  • data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

PDQN

Please refer to ding/policy/pdqn.py for more details.

PDQNPolicy

class ding.policy.PDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of PDQN algorithm, which extends the DQN algorithm on discrete-continuous hybrid action spaces. Paper link: https://arxiv.org/abs/1810.06394.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

pdqn

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy
This value is always
False for PDQN

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling Weight
to correct biased update. If True,
priority must be True.

6

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

1, [3, 5]

N-step reward discount sum for target
q_value estimation

8

learn.update
per_collect

int

3

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

9

learn.batch_
size
_gpu

int

64

The number of samples of an iteration

11

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

12

learn.target_
update_freq

int

100

Frequence of target network update.
Hard(assign) update

13

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

14

collect.n_sample

int

[8, 128]

The number of training samples of a
call of collector.
It varies from
different envs

15

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1

16

collect.noise
_sigma

float

0.1

add noise to continuous args
during collection

17

other.eps.type

str

exp

exploration rate decay type
Support [‘exp’,
‘linear’].

18

other.eps.
start

float

0.95

start value of exploration rate
[0,1]

19

other.eps.
end

float

0.05

end value of exploration rate
[0,1]

20

other.eps.
decay

int

10000

decay length of exploration
greater than 0. set
decay=10000 means
the exploration rate
decay from start
value to end value
during decay length.
_forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any][source]
Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

  • eps (float): The epsilon value for exploration.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PDQNPolicy: ding.policy.tests.test_pdqn.

_forward_eval(data: Dict[int, Any]) Dict[int, Any][source]
Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:
  • data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:
  • output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PDQNPolicy: ding.policy.tests.test_pdqn.

_forward_learn(data: Dict[str, Any]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, target_q_value, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PDQN, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PDQNPolicy: ding.policy.tests.test_pdqn.

_get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]][source]
Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PDQN, a train sample is a processed transition. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:
  • transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:
  • samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as nstep reward and target obs.

_init_collect() None[source]
Overview:

Initialize the collect mode of policy, including related attributes and modules. For PDQN, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and continuous action mechanism, besides, other algorithm-specific arguments such as unroll_len and nstep are also initialized here. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and nstep in PDQN. This design is for the convenience of parallel execution of different policy modes.

_init_eval() None[source]
Overview:

Initialize the eval mode of policy, including related attributes and modules. For PDQN, it contains the eval model to greedily select action with argmax q_value mechanism. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For PDQN, it mainly contains two optimizers, algorithm-specific arguments such as nstep and gamma, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) None[source]
Overview:

Load the state_dict variable into policy learn mode.

Arguments:
  • state_dict (Dict[str, Any]): the dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor][source]
Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PDQN, it contains obs, next_obs, action, reward, done and logit.

Arguments:
  • obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.

  • policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PDQN, it contains the hybrid action and the logit (discrete part q_value) of the action.

  • timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:
  • transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() Dict[str, Any][source]
Overview:

Return the state_dict of learn mode, usually including model, target model, discrete part optimizer, and continuous part optimizer.

Returns:
  • state_dict (Dict[str, Any]): the dict of current policy learn state, for saving and restoring.

default_model() Tuple[str, List[str]][source]
Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:
  • model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about PDQN, its registered name is pdqn and the import_names is ding.model.template.pdqn.

MDQN

Please refer to ding/policy/mdqn.py for more details.

MDQNPolicy

class ding.policy.MDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of Munchausen DQN algorithm, extended by auxiliary objectives. Paper link: https://arxiv.org/abs/2007.14430.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

mdqn

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling Weight
to correct biased update. If True,
priority must be True.

6

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

1, [3, 5]

N-step reward discount sum for target
q_value estimation

8

learn.update
per_collect
_gpu

int

1

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

10

learn.batch_
size

int

32

The number of samples of an iteration

11

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

12

learn.target_
update_freq

int

2000

Frequence of target network update.
Hard(assign) update

13

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

14

collect.n_sample

int

4

The number of training samples of a
call of collector.
It varies from
different envs

15

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1

16

other.eps.type

str

exp

exploration rate decay type
Support [‘exp’,
‘linear’].

17

other.eps.
start

float

0.01

start value of exploration rate
[0,1]

18

other.eps.
end

float

0.001

end value of exploration rate
[0,1]

19

other.eps.
decay

int

250000

decay length of exploration
greater than 0. set
decay=250000 means
the exploration rate
decay from start
value to end value
during decay length.

20

entropy_tau

float

0.003

the ration of entropy in TD loss

21

alpha

float

0.9

the ration of Munchausen term to the
TD loss
_forward_learn(data: Dict[str, Any]) Dict[str, Any][source]
Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action_gap, clip_frac, priority.

Arguments:
  • data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For MDQN, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:
  • info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for MDQNPolicy: ding.policy.tests.test_mdqn.

_init_learn() None[source]
Overview:

Initialize the learn mode of policy, including related attributes and modules. For MDQN, it contains optimizer, algorithm-specific arguments such as entropy_tau, m_alpha and nstep, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() List[str][source]
Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:
  • necessary_keys (List[str]): The list of the necessary keys to be logged.

Policy Factory

Please refer to ding/policy/policy_factory.py for more details.

PolicyFactory

class ding.policy.PolicyFactory[source]
Overview:

Policy factory class, used to generate different policies for general purpose. Such as random action policy, which is used for initial sample collecting for better exploration when random_collect_size > 0.

Interfaces:

get_random_policy

static get_random_policy(policy: Policy.collect_mode, action_space: gym.spaces.Space = None, forward_fn: Callable = None) Policy.collect_mode[source]
Overview:

According to the given action space, define the forward function of the random policy, then pack it with other interfaces of the given policy, and return the final collect mode interfaces of policy.

Arguments:
  • policy (Policy.collect_mode): The collect mode interfaces of the policy.

  • action_space (gym.spaces.Space): The action space of the environment, gym-style.

  • forward_fn (Callable): It action space is too complex, you can define your own forward function and pass it to this function, note you should set action_space to None in this case.

Returns:

get_random_policy

ding.policy.get_random_policy(cfg: EasyDict, policy: Policy.collect_mode, env: BaseEnvManager) Policy.collect_mode[source]
Overview:

The entry function to get the corresponding random policy. If a policy needs special data items in a transition, then return itself, otherwise, we will use PolicyFactory to return a general random policy.

Arguments:
  • cfg (EasyDict): The EasyDict-type dict configuration.

  • policy (Policy.collect_mode): The collect mode interfaces of the policy.

  • env (BaseEnvManager): The env manager instance, which is used to get the action space for random action generation.

Returns:

Common Utilities

Please refer to ding/policy/common_utils.py for more details.

default_preprocess_learn

ding.policy.default_preprocess_learn(data: List[Any], use_priority_IS_weight: bool = False, use_priority: bool = False, use_nstep: bool = False, ignore_done: bool = False) Dict[str, Tensor][source]
Overview:

Default data pre-processing in policy’s _forward_learn method, including stacking batch data, preprocess ignore done, nstep and priority IS weight.

Arguments:
  • data (List[Any]): The list of a training batch samples, each sample is a dict of PyTorch Tensor.

  • use_priority_IS_weight (bool): Whether to use priority IS weight correction, if True, this function will set the weight of each sample to the priority IS weight.

  • use_priority (bool): Whether to use priority, if True, this function will set the priority IS weight.

  • use_nstep (bool): Whether to use nstep TD error, if True, this function will reshape the reward.

  • ignore_done (bool): Whether to ignore done, if True, this function will set the done to 0.

Returns:
  • data (Dict[str, torch.Tensor]): The preprocessed dict data whose values can be directly used for the following model forward and loss computation.

single_env_forward_wrapper

ding.policy.single_env_forward_wrapper(forward_fn: Callable) Callable[source]
Overview:

Wrap policy to support gym-style interaction between policy and single environment.

Arguments:
  • forward_fn (Callable): The original forward function of policy.

Returns:
  • wrapped_forward_fn (Callable): The wrapped forward function of policy.

Examples:
>>> env = gym.make('CartPole-v0')
>>> policy = DQNPolicy(...)
>>> forward_fn = single_env_forward_wrapper(policy.eval_mode.forward)
>>> obs = env.reset()
>>> action = forward_fn(obs)
>>> next_obs, rew, done, info = env.step(action)

single_env_forward_wrapper_ttorch

ding.policy.single_env_forward_wrapper_ttorch(forward_fn: Callable, cuda: bool = True) Callable[source]
Overview:

Wrap policy to support gym-style interaction between policy and single environment for treetensor (ttorch) data.

Arguments:
  • forward_fn (Callable): The original forward function of policy.

  • cuda (bool): Whether to use cuda in policy, if True, this function will move the input data to cuda.

Returns:
  • wrapped_forward_fn (Callable): The wrapped forward function of policy.

Examples:
>>> env = gym.make('CartPole-v0')
>>> policy = PPOFPolicy(...)
>>> forward_fn = single_env_forward_wrapper_ttorch(policy.eval)
>>> obs = env.reset()
>>> action = forward_fn(obs)
>>> next_obs, rew, done, info = env.step(action)