ding.policy¶

Base Policy¶

Please refer to ding/policy/base_policy.py for more details.

Policy¶

class ding.policy.Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: The basic class of Reinforcement Learning (RL) and Imitation Learning (IL) policy in DI-engine.
Property:: cfg, learn_mode, collect_mode, eval_mode

__init__(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None) → None[source]¶

Overview:

Initialize policy instance according to input configures and model. This method will initialize differnent fields in policy, including learn, collect, eval. The learn field is used to train the policy, the collect field is used to collect data for training, and the eval field is used to evaluate the policy. The enable_field is used to specify which field to initialize, if it is None, then all fields will be initialized.

Arguments:

cfg (EasyDict): The final merged config used to initialize policy. For the default config, see the config attribute and its comments of policy class.
model (torch.nn.Module): The neural network model used to initialize policy. If it is None, then the model will be created according to default_model method and cfg.model field. Otherwise, the model will be set to the model instance created by outside caller.
enable_field (Optional[List[str]]): The field list to initialize. If it is None, then all fields will be initialized. Otherwise, only the fields in enable_field will be initialized, which is beneficial to save resources.

Note

For the derived policy class, it should implement the _init_learn, _init_collect, _init_eval method to initialize the corresponding field.

__repr__() → str[source]¶

Overview:

Get the string representation of the policy.

Returns:

repr (str): The string representation of the policy.

_create_model(cfg: EasyDict, model: Module | None = None) → Module[source]¶

Overview:

Create or validate the neural network model according to the input configuration and model. If the input model is None, then the model will be created according to default_model method and cfg.model field. Otherwise, the model will be verified as an instance of torch.nn.Module and set to the model instance created by outside caller.

Arguments:

cfg (EasyDict): The final merged config used to initialize policy.
model (torch.nn.Module): The neural network model used to initialize policy. User can refer to the default model defined in the corresponding policy to customize its own model.

Returns:

model (torch.nn.Module): The created neural network model. The different modes of policy will add distinct wrappers and plugins to the model, which is used to train, collect and evaluate.

Raises:

RuntimeError: If the input model is not None and is not an instance of torch.nn.Module.

abstract _forward_collect(data: Dict[int, Any], **kwargs) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs, or the action logits to calculate the loss in learn mode. This method is left to be implemented by the subclass, and more arguments can be added in kwargs part if necessary.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

abstract _forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance, such as interacting with envs or computing metrics on validation dataset). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. This method is left to be implemented by the subclass.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

abstract _forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss value, policy entropy, q value, priority, and so on. This method is left to be implemented by the subclass, and more arguments can be added in data item if necessary.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, in the _forward_learn method, data should be stacked in the batch dimension by some utility functions such as default_preprocess_learn.

Returns:

output (Dict[int, Any]): The training information of policy forward, including some metrics for monitoring training such as loss, priority, q value, policy entropy, and some data for next step training such as priority. Note the output data item should be Python native scalar rather than PyTorch tensor, which is convenient for the outside to use.

_get_attribute(name: str) → Any[source]¶

Overview:

In order to control the access of the policy attributes, we expose different modes to outside rather than directly use the policy instance. And we also provide a method to get the attribute of the policy in different modes.

Arguments:

name (str): The name of the attribute.

Returns:

value (Any): The value of the attribute.

Note

DI-engine’s policy will first try to access _get_{name} method, and then try to access _{name} attribute. If both of them are not found, it will raise a NotImplementedError.

abstract _get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. A train sample can be a processed transition (DQN with nstep TD) or some multi-timestep transitions (DRQN). This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as nstep reward, advantage, etc.

Note

We will vectorize process_transition and get_train_sample method in the following release version. And the user can customize the this data processing procecure by overriding this two methods and collector itself

abstract _init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. This method will be called in __init__ method if collect field is in enable_field. Almost different policies have its own collect mode, so this method must be overrided in subclass.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_collect and _load_state_dict_collect methods.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

abstract _init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. This method will be called in __init__ method if eval field is in enable_field. Almost different policies have its own eval mode, so this method must be override in subclass.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_eval and _load_state_dict_eval methods.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

abstract _init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. This method will be called in __init__ method if learn field is in enable_field. Almost different policies have its own learn mode, so this method must be overrided in subclass.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_init_multi_gpu_setting(model: Module, bp_update_sync: bool) → None[source]¶

Overview:

Initialize multi-gpu data parallel training setting, including broadcast model parameters at the beginning of the training, and prepare the hook function to allreduce the gradients of model parameters.

Arguments:

model (torch.nn.Module): The neural network model to be trained.
bp_update_sync (bool): Whether to synchronize update the model parameters after allreduce the gradients of model parameters. Async update can be parallel in different network layers like pipeline so that it can save time.

_load_state_dict_collect(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy collect mode, such as load pretrained state_dict, auto-recover checkpoint, or model replica from learner in distributed training scenarios.

Arguments:

state_dict (Dict[str, Any]): The dict of policy collect state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_load_state_dict_eval(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy eval mode, such as load auto-recover checkpoint, or model replica from learner in distributed training scenarios.

Arguments:

state_dict (Dict[str, Any]): The dict of policy eval state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

Tip

The default implementation is ['cur_lr', 'total_loss']. Other derived classes can overwrite this method to add their own keys if necessary.

abstract _process_transition(obs: Tensor | Dict[str, Tensor], policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, such as <s, a, r, s’, done>. Some policies need to do some special process and pack its own necessary attributes (e.g. hidden state and logit), so this method is left to be implemented by the subclass.

Arguments:

obs (Union[torch.Tensor, Dict[str, torch.Tensor]]): The observation of the current timestep.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. Usually, it contains the action and the logit of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_reset_collect(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for collect mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in collecting in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

Note

This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.

_reset_eval(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

Note

This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.

_reset_learn(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different trajectories in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

Note

This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.

_set_attribute(name: str, value: Any) → None[source]¶

Overview:

In order to control the access of the policy attributes, we expose different modes to outside rather than directly use the policy instance. And we also provide a method to set the attribute of the policy in different modes. And the new attribute will name as _{name}.

Arguments:

name (str): The name of the attribute.
value (Any): The value of the attribute.

_state_dict_collect() → Dict[str, Any][source]¶

Overview:

Return the state_dict of collect mode, only including model in usual, which is necessary for distributed training scenarios to auto-recover collectors.

Returns:

state_dict (Dict[str, Any]): The dict of current policy collect state, for saving and restoring.

Tip

Not all the scenarios need to auto-recover collectors, sometimes, we can directly shutdown the crashed collector and renew a new one.

_state_dict_eval() → Dict[str, Any][source]¶

Overview:

Return the state_dict of eval mode, only including model in usual, which is necessary for distributed training scenarios to auto-recover evaluators.

Returns:

state_dict (Dict[str, Any]): The dict of current policy eval state, for saving and restoring.

Tip

Not all the scenarios need to auto-recover evaluators, sometimes, we can directly shutdown the crashed evaluator and renew a new one.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model and optimizer.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

property collect_mode: collect_function¶

Overview:

Return the interfaces of collect mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different modes. Moreover, derived subclass can override the interfaces to customize its own collect mode.

Returns:

interfaces (Policy.collect_function): The interfaces of collect mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:

>>> policy = Policy(cfg, model)
>>> policy_collect = policy.collect_mode
>>> obs = env_manager.ready_obs
>>> inference_output = policy_collect.forward(obs)
>>> next_obs, rew, done, info = env_manager.step(inference_output.action)

classmethod default_config() → EasyDict[source]¶

Overview:

Get the default config of policy. This method is used to create the default config of policy.

Returns:

cfg (EasyDict): The default config of corresponding policy. For the derived policy class, it will recursively merge the default config of base class and its own default config.

Tip

This method will deepcopy the config attribute of the class and return the result. So users don’t need to worry about the modification of the returned config.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about DQN, its registered name is dqn and the import_names is ding.model.template.q_learning.DQN

property eval_mode: eval_function¶

Overview:

Return the interfaces of eval mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different mode. Moreover, derived subclass can override the interfaces to customize its own eval mode.

Returns:

interfaces (Policy.eval_function): The interfaces of eval mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:

>>> policy = Policy(cfg, model)
>>> policy_eval = policy.eval_mode
>>> obs = env_manager.ready_obs
>>> inference_output = policy_eval.forward(obs)
>>> next_obs, rew, done, info = env_manager.step(inference_output.action)

property learn_mode: learn_function¶

Overview:

Return the interfaces of learn mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different modes. Moreover, derived subclass can override the interfaces to customize its own learn mode.

Returns:

interfaces (Policy.learn_function): The interfaces of learn mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:

>>> policy = Policy(cfg, model)
>>> policy_learn = policy.learn_mode
>>> train_output = policy_learn.forward(data)
>>> state_dict = policy_learn.state_dict()

sync_gradients(model: Module) → None[source]¶

Overview:

Synchronize (allreduce) gradients of model parameters in data-parallel multi-GPU training. For parameters that did not participate in the forward/backward pass in some GPUs, assign a zero gradient with an indicator of 0. This ensures that only GPUs which contributed to the gradient computation are considered when averaging, thereby avoiding an incorrect division by the total number of GPUs.

Arguments:

model (torch.nn.Module): The model to synchronize gradients.

Note

This method is only used in multi-gpu training, and it should be called after the backward method and before the step method. The user can also use the bp_update_sync config to control whether to synchronize gradients allreduce and optimizer updates.

CommandModePolicy¶

class ding.policy.CommandModePolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy with command mode, which can be used in old version of DI-engine pipeline: serial_pipeline. CommandModePolicy uses _get_setting_learn, _get_setting_collect, _get_setting_eval methods to exchange information between different workers.
Interface:: _init_command, _get_setting_learn, _get_setting_collect, _get_setting_eval
Property:: command_mode

abstract _get_setting_collect(command_info: Dict[str, Any]) → Dict[str, Any][source]¶

Overview:

Accoding to command_info, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of collect mode, which contains dynamically changed hyperparameters for collect mode, such as eps, temperature, etc.

Arguments:

command_info (Dict[str, Any]): The global training information, which is defined in commander.

Returns:

setting (Dict[str, Any]): The latest setting of collect mode, which is usually used as extra arguments of the policy._forward_collect method.

abstract _get_setting_eval(command_info: Dict[str, Any]) → Dict[str, Any][source]¶

Overview:

Accoding to command_info, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of eval mode, which contains dynamically changed hyperparameters for eval mode, such as temperature, etc.

Arguments:

command_info (Dict[str, Any]): The global training information, which is defined in commander.

Returns:

setting (Dict[str, Any]): The latest setting of eval mode, which is usually used as extra arguments of the policy._forward_eval method.

abstract _get_setting_learn(command_info: Dict[str, Any]) → Dict[str, Any][source]¶

Overview:

Accoding to command_info, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of learn mode, which contains dynamically changed hyperparameters for learn mode, such as batch_size, learning_rate, etc.

Arguments:

command_info (Dict[str, Any]): The global training information, which is defined in commander.

Returns:

setting (Dict[str, Any]): The latest setting of learn mode, which is usually used as extra arguments of the policy._forward_learn method.

abstract _init_command() → None[source]¶

Overview:: Initialize the command mode of policy, including related attributes and modules. This method will be called in __init__ method if command field is in enable_field. Almost different policies have its own command mode, so this method must be overrided in subclass.

Note

If you want to set some spacial member variables in _init_command method, you’d better name them with prefix _command_ to avoid conflict with other modes, such as self._command_attr1.

property command_mode: Policy.command_function¶

Overview:

Return the interfaces of command mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different mode. Moreover, derived subclass can override the interfaces to customize its own command mode.

Returns:

interfaces (Policy.command_function): The interfaces of command mode, it is a namedtuple whose values of distinct fields are different internal methods.

Examples:

>>> policy = CommandModePolicy(cfg, model)
>>> policy_command = policy.command_mode
>>> settings = policy_command.get_setting_learn(command_info)

create_policy¶

ding.policy.create_policy(cfg: EasyDict, **kwargs) → Policy[source]¶

Overview:

Create a policy instance according to cfg and other kwargs.

Arguments:

cfg (EasyDict): Final merged policy config.

ArgumentsKeys:

type (str): Policy type set in POLICY_REGISTRY.register method , such as dqn .
import_names (List[str]): A list of module names (paths) to import before creating policy, such as ding.policy.dqn .

Returns:

policy (Policy): The created policy instance.

Tip

kwargs contains other arguments that need to be passed to the policy constructor. You can refer to the __init__ method of the corresponding policy class for details.

Note

For more details about how to merge config, please refer to the system document of DI-engine (en link).

get_policy_cls¶

ding.policy.get_policy_cls(cfg: EasyDict) → type[source]¶

Overview:

Get policy class according to cfg, which is used to access related class variables/methods.

Arguments:

cfg (EasyDict): Final merged policy config.

ArgumentsKeys:

type (str): Policy type set in POLICY_REGISTRY.register method , such as dqn .
import_names (List[str]): A list of module names (paths) to import before creating policy, such as ding.policy.dqn .

Returns:

policy (type): The policy class.

DQN¶

Please refer to ding/policy/dqn.py for more details.

DQNPolicy¶

class ding.policy.DQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of DQN algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	dqn	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	False	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	1, [3, 5]	N-step reward discount sum for target q_value estimation
8	`model.dueling`	bool	True	dueling head architecture
9	`model.encoder` `_hidden` `_size_list`	list (int)	[32, 64, 64, 128]	Sequence of `hidden_size` of subsequent conv layers and the final dense layer.	default kernel_size is [8, 4, 3] default stride is [4, 2 ,1]
10	`model.dropout`	float	None	Dropout rate for dropout layers.	[0,1] If set to `None` means no dropout
11	`learn.update` `per_collect`	int	3	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
12	`learn.batch_` `size`	int	64	The number of samples of an iteration
13	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
14	`learn.target_` `update_freq`	int	100	Frequence of target network update.	Hard(assign) update
15	`learn.target_` `theta`	float	0.005	Frequence of target network update. Only one of [target_update_freq, target_theta] should be set	Soft(assign) update
16	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
17	`collect.n_sample`	int	[8, 128]	The number of training samples of a call of collector.	It varies from different envs
18	`collect.n_episode`	int	8	The number of training episodes of a call of collector	only one of [n_sample ,n_episode] should be set
19	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1
20	`other.eps.type`	str	exp	exploration rate decay type	Support [‘exp’, ‘linear’].
21	`other.eps.` `start`	float	0.95	start value of exploration rate	[0,1]
22	`other.eps.` `end`	float	0.1	end value of exploration rate	[0,1]
23	`other.eps.` `decay`	int	10000	decay length of exploration	greater than 0. set decay=10000 means the exploration rate decay from start value to end value during decay length.

_forward_collect(data: Dict[int, Any], eps: float) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
eps (float): The epsilon value for exploration.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DQNPolicy: ding.policy.tests.test_dqn.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DQNPolicy: ding.policy.tests.test_dqn.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DQN, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement your own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DQNPolicy: ding.policy.tests.test_dqn.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DQN with nstep TD, a train sample is a processed transition. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize relevant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is in the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is similar in format to input transitions, but may contain more data for training, such as nstep reward and target obs.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For DQN, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism, and other algorithm-specific arguments such as unroll_len and nstep. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and nstep in DQN. This design is for the convenience of parallel execution of different policy modes.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For DQN, it contains the eval model to greedily select action with argmax q_value mechanism. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For DQN, it mainly contains optimizer, algorithm-specific arguments such as nstep and gamma, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DQN, it contains obs, next_obs, action, reward, done.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For DQN, it contains the action and the logit (q_value) of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizer.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same interface definition indicated by import_names path. For example about DQN, its registered name is dqn and the import_names is ding.model.template.q_learning.

DQNSTDIMPolicy¶

class ding.policy.DQNSTDIMPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of DQN algorithm, extended by ST-DIM auxiliary objectives. ST-DIM paper link: https://arxiv.org/abs/1906.08226.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	dqn_stdim	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	False	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	1, [3, 5]	N-step reward discount sum for target q_value estimation
8	`learn.update` `per_collect` `_gpu`	int	3	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
10	`learn.batch_` `size`	int	64	The number of samples of an iteration
11	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
12	`learn.target_` `update_freq`	int	100	Frequence of target network update.	Hard(assign) update
13	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
14	`collect.n_sample`	int	[8, 128]	The number of training samples of a call of collector.	It varies from different envs
15	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1
16	`other.eps.type`	str	exp	exploration rate decay type	Support [‘exp’, ‘linear’].
17	`other.eps.` `start`	float	0.95	start value of exploration rate	[0,1]
18	`other.eps.` `end`	float	0.1	end value of exploration rate	[0,1]
19	`other.eps.` `decay`	int	10000	decay length of exploration	greater than 0. set decay=10000 means the exploration rate decay from start value to end value during decay length.
20	`aux_loss` `_weight`	float	0.001	the ratio of the auxiliary loss to the TD loss	any real value, typically in [-0.1, 0.1].

_forward_learn(data: Dict[str, Any]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, priority, aux_loss.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DQNSTDIM, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For DQNSTDIM, it first call super class’s _init_learn method, then initialize extra auxiliary model, its optimizer, and the loss weight. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): the dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_model_encode(data: dict) → Tuple[Tensor][source]¶

Overview:

Get the encoding of the main model as input for the auxiliary model.

Arguments:

data (dict): Dict type data, same as the _forward_learn input.

Returns:

(Tuple[torch.Tensor]): the tuple of two tensors to apply contrastive embedding learning. In ST-DIM algorithm, these two variables are the dqn encoding of obs and next_obs respectively.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model and optimizer.

Returns:

state_dict (Dict[str, Any]): the dict of current policy learn state, for saving and restoring.

PPO¶

Please refer to ding/policy/ppo.py for more details.

PPOPolicy¶

class ding.policy.PPOPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of on-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347.

_forward_collect(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPolicy: ding.policy.tests.test_ppo.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in PPO often uses deterministic sample method to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPolicy: ding.policy.tests.test_ppo.

_forward_learn(data: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac, approx_kl.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including the latest collected training samples for on-policy algorithms like PPO. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PPO, each element in list is a dict containing at least the following keys: obs, action, reward, logit, value, done. Sometimes, it also contains other keys such as weight.

Returns:

return_infos (List[Dict[str, Any]]): The information list that indicated training result, each training iteration contains append a information dict into the final list. The list will be precessed and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Tip

The training procedure of PPO is two for loops. The outer loop trains all the collected training samples with epoch_per_collect epochs. The inner loop splits all the data into different mini-batch with the length of batch_size.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPolicy: ding.policy.tests.test_ppo.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PPO, a train sample is a processed transition with new computed traj_flag and adv field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as GAE advantage.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For PPO, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. This design is for the convenience of parallel execution of different policy modes.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For PPO, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For PPO, it mainly contains optimizer, algorithm-specific arguments such as loss weight, clip_ratio and recompute_adv. This method also executes some special network initializations and prepares running mean/std monitor for value. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PPO, it contains the state value, action and the logit of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

Note

next_obs is used to calculate nstep return when necessary, so we place in into transition by default. You can delete this field to save memory occupancy if you do not need nstep return.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about PPO, its registered name is ppo and the import_names is ding.model.template.vac.

Note

Because now PPO supports both single-agent and multi-agent usages, so we can implement these functions with the same policy and two different default models, which is controled by self._cfg.multi_agent.

PPOPGPolicy¶

class ding.policy.PPOPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of on policy version PPO algorithm (pure policy gradient without value network). Paper link: https://arxiv.org/abs/1707.06347.

_forward_collect(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in PPO often uses deterministic sample method to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOPGPolicy: ding.policy.tests.test_ppo.

_forward_learn(data: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac, approx_kl.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including the latest collected training samples for on-policy algorithms like PPO. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PPOPG, each element in list is a dict containing at least the following keys: obs, action, return, logit, done. Sometimes, it also contains other keys such as weight.

Returns:

return_infos (List[Dict[str, Any]]): The information list that indicated training result, each training iteration contains append a information dict into the final list. The list will be precessed and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Tip

The training procedure of PPOPG is two for loops. The outer loop trains all the collected training samples with epoch_per_collect epochs. The inner loop splits all the data into different mini-batch with the length of batch_size.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_get_train_sample(data: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given entire episode data (a list of transition), process it into a list of sample that can be used for training directly. In PPOPG, a train sample is a processed transition with new computed return field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

data (List[Dict[str, Any]): The episode data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as discounted episode return.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For PPOPG, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. This design is for the convenience of parallel execution of different policy modes.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For PPOPG, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For PPOPG, it mainly contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method also executes some special network initializations. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPOPG, it contains obs, action, reward, done, logit.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PPOPG, it contains the action and the logit of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

PPOOffPolicy¶

class ding.policy.PPOOffPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of off-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347. This version is more suitable for large-scale distributed training.

_forward_collect(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOOffPolicy: ding.policy.tests.test_ppo.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in PPO often uses deterministic sample method to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PPOOffPolicy: ding.policy.tests.test_ppo.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac and approx_kl.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PPOOff, each element in list is a dict containing at least the following keys: obs, adv, action, logit, value, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PPO, a train sample is a processed transition with new computed traj_flag and adv field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as GAE advantage.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For PPOOff, it contains collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPOOff. This design is for the convenience of parallel execution of different policy modes.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For PPOOff, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For PPOOff, it mainly contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method also executes some special network initializations and prepares running mean/std monitor for value. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PPO, it contains the state value, action and the logit of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

Note

next_obs is used to calculate nstep return when necessary, so we place in into transition by default. You can delete this field to save memory occupancy if you do not need nstep return.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

PPOSTDIMPolicy¶

class ding.policy.PPOSTDIMPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of on policy version PPO algorithm with ST-DIM auxiliary model. PPO paper link: https://arxiv.org/abs/1707.06347. ST-DIM paper link: https://arxiv.org/abs/1906.08226.

_forward_learn(data: Dict[str, Any]) → Dict[str, Any][source]¶

Overview:

Forward and backward function of learn mode.

Arguments:

data (dict): Dict type data

Returns:

info_dict (Dict[str, Any]): Including current lr, total_loss, policy_loss, value_loss, entropy_loss, adv_abs_max, approx_kl, clipfrac

_init_learn() → None[source]¶

Overview:: Learn mode init method. Called by self.__init__. Init the auxiliary model, its optimizer, and the axuliary loss weight to the main loss.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_model_encode(data)[source]¶

Overview:

Get the encoding of the main model as input for the auxiliary model.

Arguments:

data (dict): Dict type data, same as the _forward_learn input.

Returns:

(Tuple[Tensor]): the tuple of two tensors to apply contrastive embedding learning.
In ST-DIM algorithm, these two variables are the dqn encoding of obs and next_obs respectively.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, optimizer and aux_optimizer for representation learning.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

BC¶

Please refer to ding/policy/bc.py for more details.

BehaviourCloningPolicy¶

class ding.policy.BehaviourCloningPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Behaviour Cloning (BC) policy class, which supports both discrete and continuous action space. The policy is trained by supervised learning, and the data is a offline dataset collected by expert.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss and time.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For BC, each element in list is a dict containing at least the following keys: obs, action.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_collect() → None[source]¶

Overview:: BC policy uses offline dataset so it does not need to collect data. However, sometimes we need to use the trained BC policy to collect data for other purposes.

_init_eval()[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For BC, it contains the eval model to greedily select action with argmax q_value mechanism for discrete action space. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For BC, it mainly contains optimizer, algorithm-specific arguments such as lr_scheduler, loss, etc. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about discrete BC, its registered name is discrete_bc and the import_names is ding.model.template.bc.

DDPG¶

Please refer to ding/policy/ddpg.py for more details.

DDPGPolicy¶

class ding.policy.DDPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of DDPG algorithm. Paper link: https://arxiv.org/abs/1509.02971.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	ddpg	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network
3	`random_` `collect_size`	int	25000	Number of randomly collected training samples in replay buffer when training starts.	Default to 25000 for DDPG/TD3, 10000 for sac.
4	`model.twin_` `critic`	bool	False	Whether to use two critic networks or only one.	Default False for DDPG, Clipped Double Q-learning method in TD3 paper.
5	`learn.learning` `_rate_actor`	float	1e-3	Learning rate for actor network(aka. policy).
6	`learn.learning` `_rate_critic`	float	1e-3	Learning rates for critic network (aka. Q-network).
7	`learn.actor_` `update_freq`	int	2	When critic network updates once, how many times will actor network update.	Default 1 for DDPG, 2 for TD3. Delayed Policy Updates method in TD3 paper.
8	`learn.noise`	bool	False	Whether to add noise on target network’s action.	Default False for DDPG, True for TD3. Target Policy Smoo- thing Regularization in TD3 paper.
9	`learn.-` `ignore_done`	bool	False	Determine whether to ignore done flag.	Use ignore_done only in halfcheetah env.
10	`learn.-` `target_theta`	float	0.005	Used for soft update of the target network.	aka. Interpolation factor in polyak aver- aging for target networks.
11	`collect.-` `noise_sigma`	float	0.1	Used for add noise during co- llection, through controlling the sigma of distribution	Sample noise from dis- tribution, Ornstein- Uhlenbeck process in DDPG paper, Gaussian process in ours.

_forward_collect(data: Dict[int, Any], **kwargs) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e., environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DDPG, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and logit which is used for hybrid action space.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DDPG, a train sample is a processed transition (unroll_len=1).

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For DDPG, it contains the collect_model to balance the exploration and exploitation with the perturbed noise mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For DDPG, it contains the eval model to greedily select action type with argmax q_value mechanism for hybrid action space. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For DDPG, it mainly contains two optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DDPG, it contains obs, next_obs, action, reward, done.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For DDPG, it contains the action and the logit of the action (in hybrid action space).
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

TD3¶

Please refer to ding/policy/td3.py for more details.

TD3Policy¶

class ding.policy.TD3Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of TD3 algorithm. Since DDPG and TD3 share many common things, we can easily derive this TD3 class from DDPG class by changing _actor_update_freq, _twin_critic and noise in model wrapper. Paper link: https://arxiv.org/pdf/1802.09477.pdf

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	td3	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network
3	`random_` `collect_size`	int	25000	Number of randomly collected training samples in replay buffer when training starts.	Default to 25000 for DDPG/TD3, 10000 for sac.
4	`model.twin_` `critic`	bool	True	Whether to use two critic networks or only one.	Default True for TD3, Clipped Double Q-learning method in TD3 paper.
5	`learn.learning` `_rate_actor`	float	1e-3	Learning rate for actor network(aka. policy).
6	`learn.learning` `_rate_critic`	float	1e-3	Learning rates for critic network (aka. Q-network).
7	`learn.actor_` `update_freq`	int	2	When critic network updates once, how many times will actor network update.	Default 2 for TD3, 1 for DDPG. Delayed Policy Updates method in TD3 paper.
8	`learn.noise`	bool	True	Whether to add noise on target network’s action.	Default True for TD3, False for DDPG. Target Policy Smoo- thing Regularization in TD3 paper.
9	`learn.noise_` `range`	dict	dict(min=-0.5, max=0.5,)	Limit for range of target policy smoothing noise, aka. noise_clip.
10	`learn.-` `ignore_done`	bool	False	Determine whether to ignore done flag.	Use ignore_done only in halfcheetah env.
11	`learn.-` `target_theta`	float	0.005	Used for soft update of the target network.	aka. Interpolation factor in polyak aver -aging for target networks.
12	`collect.-` `noise_sigma`	float	0.1	Used for add noise during co- llection, through controlling the sigma of distribution	Sample noise from dis -tribution, Ornstein- Uhlenbeck process in DDPG paper, Gaussian process in ours.

_forward_collect(data: Dict[int, Any], **kwargs) → Dict[int, Any]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e., environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DDPG, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and logit which is used for hybrid action space.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DDPGPolicy: ding.policy.tests.test_ddpg.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DDPG, a train sample is a processed transition (unroll_len=1).

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() → None¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For DDPG, it contains the collect_model to balance the exploration and exploitation with the perturbed noise mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() → None¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For DDPG, it contains the eval model to greedily select action type with argmax q_value mechanism for hybrid action space. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For DDPG, it mainly contains two optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DDPG, it contains obs, next_obs, action, reward, done.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For DDPG, it contains the action and the logit of the action (in hybrid action space).
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() → Dict[str, Any]¶

Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

SAC¶

Please refer to ding/policy/sac.py for more details.

SACPolicy¶

class ding.policy.SACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of continuous SAC algorithm. Paper link: https://arxiv.org/pdf/1801.01290.pdf

Config:

ID	Symbol	Type	Default Value	Description	Other
1	`type`	str	sac	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	True	Whether to use cuda for network
3	`on_policy`	bool	False	SAC is an off-policy algorithm.
4	`priority`	bool	False	Whether to use priority sampling in buffer.
5	`priority_IS_` `weight`	bool	False	Whether use Importance Sampling weight to correct biased update
6	`random_` `collect_size`	int	10000	Number of randomly collected training samples in replay buffer when training starts.	Default to 10000 for SAC, 25000 for DDPG/ TD3.
7	`learn.learning` `_rate_q`	float	3e-4	Learning rate for soft q network.	Defalut to 1e-3
8	`learn.learning` `_rate_policy`	float	3e-4	Learning rate for policy network.	Defalut to 1e-3
9	`learn.alpha`	float	0.2	Entropy regularization coefficient.	alpha is initiali- zation for auto alpha, when auto_alpha is True
10	`learn.` `auto_alpha`	bool	False	Determine whether to use auto temperature parameter alpha.	Temperature parameter determines the relative importance of the entropy term against the reward.
11	`learn.-` `ignore_done`	bool	False	Determine whether to ignore done flag.	Use ignore_done only in env like Pendulum
12	`learn.-` `target_theta`	float	0.005	Used for soft update of the target network.	aka. Interpolation factor in polyak aver aging for target networks.

_forward_collect(data: Dict[int, Any], **kwargs) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

logit in SAC means the mu and sigma of Gaussioan distribution. Here we use this name for consistency.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

logit in SAC means the mu and sigma of Gaussioan distribution. Here we use this name for consistency.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For SAC, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In continuous SAC, a train sample is a processed transition (unroll_len=1).

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For SAC, it contains the collect_model other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For SAC, it contains the eval model, which is equipped with base model wrapper to ensure compability. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For continuous SAC, it contains obs, next_obs, action, reward, done. The logit will be also added when collector_logit is True.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For continuous SAC, it contains the action and the logit (mu and sigma) of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

DiscreteSACPolicy¶

class ding.policy.DiscreteSACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of discrete SAC algorithm. Paper link: https://arxiv.org/abs/1910.07207.

_forward_collect(data: Dict[int, Any], eps: float) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
eps (float): The epsilon value for exploration.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DiscreteSACPolicy: ding.policy.tests.test_discrete_sac.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DiscreteSACPolicy: ding.policy.tests.test_discrete_sac.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For SAC, each element in list is a dict containing at least the following keys: obs, action, logit, reward, next_obs, done. Sometimes, it also contains other keys like weight.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for DiscreteSACPolicy: ding.policy.tests.test_discrete_sac.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In discrete SAC, a train sample is a processed transition (unroll_len=1).

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For SAC, it contains the collect_model to balance the exploration and exploitation with the epsilon and multinomial sample mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For DiscreteSAC, it contains the eval model to greedily select action type with argmax q_value mechanism. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For DiscreteSAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For discrete SAC, it contains obs, next_obs, logit, action, reward, done.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For discrete SAC, it contains the action and the logit of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizers.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

SQILSACPolicy¶

class ding.policy.SQILSACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of continuous SAC algorithm with SQIL extension. SAC paper link: https://arxiv.org/pdf/1801.01290.pdf SQIL paper link: https://arxiv.org/abs/1905.11108

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For SAC, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

For SQIL + SAC, input data is composed of two parts with the same size: agent data and expert data. Both of them are relabelled with new reward according to SQIL algorithm.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for SACPolicy: ding.policy.tests.test_sac.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

R2D2¶

Please refer to ding/policy/r2d2.py for more details.

R2D2Policy¶

class ding.policy.R2D2Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of R2D2, from paper Recurrent Experience Replay in Distributed Reinforcement Learning . R2D2 proposes that several tricks should be used to improve upon DRQN, namely some recurrent experience replay tricks and the burn-in mechanism for off-policy training.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	r2d2	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	False	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.997, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	3, [3, 5]	N-step reward discount sum for target q_value estimation
8	`burnin_step`	int	2	The timestep of burnin operation, which is designed to RNN hidden state difference caused by off-policy
9	`learn.update` `per_collect`	int	1	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
10	`learn.batch_` `size`	int	64	The number of samples of an iteration
11	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
12	`learn.value_` `rescale`	bool	True	Whether use value_rescale function for predicted value
13	`learn.target_` `update_freq`	int	100	Frequence of target network update.	Hard(assign) update
14	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
15	`collect.n_sample`	int	[8, 128]	The number of training samples of a call of collector.	It varies from different envs
16	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1

_forward_collect(data: Dict[int, Any], eps: float) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
eps (float): The epsilon value for exploration.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (prev_state) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

RNN’s hidden states are maintained in the policy, so we don’t need pass them into data but to reset the hidden states with _reset_collect method when episode ends. Besides, the previous hidden states are necessary for training, so we need to return them in _process_transition method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for R2D2Policy: ding.policy.tests.test_r2d2.

_forward_learn(data: List[List[Dict[str, Any]]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data (trajectory for R2D2) from the replay buffer and then returns the output result, including various training information such as loss, q value, priority.

Arguments:

data (List[List[Dict[int, Any]]]): The input data used for policy forward, including a batch of training samples. For each dict element, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the time and batch dimension by the utility functions self._data_preprocess_learn. For R2D2, each element in list is a trajectory with the length of unroll_len, and the element in trajectory list is a dict containing at least the following keys: obs, action, prev_state, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for R2D2Policy: ding.policy.tests.test_r2d2.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In R2D2, a train sample is processed transitions with unroll_len length. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each sample is a fixed-length trajectory, and each element in a sample is the similar format as input transitions, but may contain more data for training, such as nstep reward and value_gamma factor.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For R2D2, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and maintain the hidden state of rnn. Besides, there are some initialization operations about other algorithm-specific arguments such as burnin_step, unroll_len and nstep. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and nstep in R2D2. This design is for the convenience of parallel execution of different policy modes.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including some attributes and modules. For R2D2, it mainly contains optimizer, algorithm-specific arguments such as burnin_step, value_rescale and gamma, main and target model. Because of the use of RNN, all the models should be wrappered with hidden_state which needs to be initialized with proper size. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For R2D2, it contains obs, action, prev_state, reward, and done.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network given the observation as input. For R2D2, it contains the action and the prev_state of RNN.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_reset_collect(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in R2D2) specified by data_id.

_reset_eval(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in R2D2) specified by data_id.

_reset_learn(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different trajectories in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e. RNN hidden_state in R2D2) specified by data_id.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizer.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about R2D2, its registered name is drqn and the import_names is ding.model.template.q_learning.

IMPALA¶

Please refer to ding/policy/impala.py for more details.

IMPALAPolicy¶

class ding.policy.IMPALAPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of IMPALA algorithm. Paper link: https://arxiv.org/abs/1802.01561.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	impala	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
	`priority`	bool	False	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	bool	False	Whether use Importance Sampling Weight	If True, priority must be True
6	`unroll_len`	int	32	trajectory length to calculate v-trace target
7	`learn.update` `per_collect`	int	4	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy

_forward_collect(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Tip

If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to unittest for IMPALAPolicy: ding.policy.tests.test_impala.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs. _forward_eval in IMPALA often uses deterministic sample to get actions while _forward_collect usually uses stochastic sample method for balance exploration and exploitation.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to unittest for IMPALAPolicy: ding.policy.tests.test_impala.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss and current learning rate.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For IMPALA, each element in list is a dict containing at least the following keys: obs, action, logit, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to unittest for IMPALAPolicy: ding.policy.tests.test_impala.

_get_train_sample(data: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training. In IMPALA, a train sample is processed transitions with unroll_len length.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For IMPALA, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For IMPALA, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For IMPALA, it mainly contains optimizer, algorithm-specific arguments such as loss weight and gamma, main (learn) model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For IMPALA, it contains obs, next_obs, action, reward, done, logit.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For IMPALA, it contains the action and the logit of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about IMPALA , its registered name is vac and the import_names is ding.model.template.vac.

QMIX¶

Please refer to ding/policy/qmix.py for more details.

QMIXPolicy¶

class ding.policy.QMIXPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of QMIX algorithm. QMIX is a multi-agent reinforcement learning algorithm, you can view the paper in the following link https://arxiv.org/abs/1803.11485.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	qmix	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	True	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
	`priority`	bool	False	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update_` `per_collect`	int	20	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.target_` `update_theta`	float	0.001	Target network update momentum parameter.	between[0,1]
8	`learn.discount` `_factor`	float	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env

_forward_collect(data: Dict[int, Any], eps: float) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
eps (float): The epsilon value for exploration.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data (prev_state) for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

RNN’s hidden states are maintained in the policy, so we don’t need pass them into data but to reset the hidden states with _reset_collect method when episode ends. Besides, the previous hidden states are necessary for training, so we need to return them in _process_transition method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for QMIXPolicy: ding.policy.tests.test_qmix.

_forward_learn(data: List[List[Dict[str, Any]]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data (trajectory for QMIX) from the replay buffer and then returns the output result, including various training information such as loss, q value, grad_norm.

Arguments:

data (List[List[Dict[int, Any]]]): The input data used for policy forward, including a batch of training samples. For each dict element, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the time and batch dimension by the utility functions self._data_preprocess_learn. For QMIX, each element in list is a trajectory with the length of unroll_len, and the element in trajectory list is a dict containing at least the following keys: obs, action, prev_state, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for QMIXPolicy: ding.policy.tests.test_qmix.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In QMIX, a train sample is processed transitions with unroll_len length. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each sample is a fixed-length trajectory, and each element in a sample is the similar format as input transitions.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For QMIX, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and maintain the hidden state of rnn. Besides, there are some initialization operations about other algorithm-specific arguments such as burnin_step, unroll_len and nstep. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including some attributes and modules. For QMIX, it mainly contains optimizer, algorithm-specific arguments such as gamma, main and target model. Because of the use of RNN, all the models should be wrappered with hidden_state which needs to be initialized with proper size. This method will be called in __init__ method if learn field is in enable_field.

Tip

For multi-agent algorithm, we often need to use agent_num to initialize some necessary variables.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1. - agent_num (int): Since this is a multi-agent algorithm, we need to input the agent num.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): The dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For QMIX, it contains obs, next_obs, action, prev_state, reward, done.

Arguments:

obs (torch.Tensor): The env observation of current timestep, usually including agent_obs and global_obs in multi-agent environment like MPE and SMAC.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For QMIX, it contains the action and the prev_state of RNN.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_reset_collect(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in QMIX) specified by data_id.

_reset_eval(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in QMIX) specified by data_id.

_reset_learn(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different trajectories in data_id will have different hidden state in RNN.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables (i.e. RNN hidden_state in QMIX) specified by data_id.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, target_model and optimizer.

Returns:

state_dict (Dict[str, Any]): The dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default model setting for demonstration.

Returns:

model_info (Tuple[str, List[str]]): model name and mode import_names

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For QMIX, ding.model.qmix.qmix

CQL¶

Please refer to ding/policy/cql.py for more details.

CQLPolicy¶

class ding.policy.CQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of CQL algorithm for continuous control. Paper link: https://arxiv.org/abs/2006.04779.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	cql	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	True	Whether to use cuda for network
3	`random_` `collect_size`	int	10000	Number of randomly collected training samples in replay buffer when training starts.	Default to 10000 for SAC, 25000 for DDPG/ TD3.
4	`model.policy_` `embedding_size`	int	256	Linear layer size for policy network.
5	`model.soft_q_` `embedding_size`	int	256	Linear layer size for soft q network.
6	`model.value_` `embedding_size`	int	256	Linear layer size for value network.	Defalut to None when model.value_network is False.
7	`learn.learning` `_rate_q`	float	3e-4	Learning rate for soft q network.	Defalut to 1e-3, when model.value_network is True.
8	`learn.learning` `_rate_policy`	float	3e-4	Learning rate for policy network.	Defalut to 1e-3, when model.value_network is True.
9	`learn.learning` `_rate_value`	float	3e-4	Learning rate for policy network.	Defalut to None when model.value_network is False.
10	`learn.alpha`	float	0.2	Entropy regularization coefficient.	alpha is initiali- zation for auto alpha, when auto_alpha is True
11	`learn.repara_` `meterization`	bool	True	Determine whether to use reparameterization trick.
12	`learn.` `auto_alpha`	bool	False	Determine whether to use auto temperature parameter alpha.	Temperature parameter determines the relative importance of the entropy term against the reward.
13	`learn.-` `ignore_done`	bool	False	Determine whether to ignore done flag.	Use ignore_done only in halfcheetah env.
14	`learn.-` `target_theta`	float	0.005	Used for soft update of the target network.	aka. Interpolation factor in polyak aver aging for target networks.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, action, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For CQL, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma, min_q_weight, with_lagrange and with_q_entropy, main and target model. Especially, the auto_alpha mechanism for balancing max entropy target is also initialized here. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

DiscreteCQLPolicy¶

class ding.policy.DiscreteCQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of discrete CQL algorithm in discrete action space environments. Paper link: https://arxiv.org/abs/2006.04779.

_forward_learn(data: List[Dict[str, Any]]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, action, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For DiscreteCQL, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys like weight and value_gamma for nstep return computation.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For DiscreteCQL, it mainly contains the optimizer, algorithm-specific arguments such as gamma, nstep and min_q_weight, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

DecisionTransformer¶

Please refer to ding/policy/dt.py for more details.

DTPolicy¶

class ding.policy.DTPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:: Policy class of Decision Transformer algorithm in discrete environments. Paper link: https://arxiv.org/abs/2106.01345.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance, such as interacting with envs. Forward means that the policy gets some input data (current obs/return-to-go and historical information) from the envs and then returns the output data, such as the action to interact with the envs. Arguments: - data (Dict[int, Any]): The input data used for policy forward, including at least the obs and reward to calculate running return-to-go. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

Decision Transformer will do different operations for different types of envs in evaluation.

_forward_learn(data: List[Tensor]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, current learning rate.

Arguments:

data (List[torch.Tensor]): The input data used for policy forward, including a series of processed torch.Tensor data, i.e., timesteps, states, actions, returns_to_go, traj_mask.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For DQN, it contains the eval model, some algorithm-specific parameters such as context_len, max_eval_ep_len, etc. This method will be called in __init__ method if eval field is in enable_field.

Tip

For the evaluation of complete episodes, we need to maintain some historical information for transformer inference. These variables need to be initialized in _init_eval and reset in _reset_eval when necessary.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For Decision Transformer, it mainly contains the optimizer, algorithm-specific arguments such as rtg_scale and lr scheduler. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_reset_eval(data_id: List[int] | None = None) → None[source]¶

Overview:

Reset some stateful variables for eval mode when necessary, such as the historical info of transformer for decision transformer. If data_id is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to the data_id. For example, different environments/episodes in evaluation in data_id will have different history.

Arguments:

data_id (Optional[List[int]]): The id of the data, which is used to reset the stateful variables specified by data_id.

PDQN¶

Please refer to ding/policy/pdqn.py for more details.

PDQNPolicy¶

class ding.policy.PDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of PDQN algorithm, which extends the DQN algorithm on discrete-continuous hybrid action spaces. Paper link: https://arxiv.org/abs/1810.06394.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	pdqn	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy	This value is always False for PDQN
4	`priority`	bool	False	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	1, [3, 5]	N-step reward discount sum for target q_value estimation
8	`learn.update` `per_collect`	int	3	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
9	`learn.batch_` `size` `_gpu`	int	64	The number of samples of an iteration
11	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
12	`learn.target_` `update_freq`	int	100	Frequence of target network update.	Hard(assign) update
13	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
14	`collect.n_sample`	int	[8, 128]	The number of training samples of a call of collector.	It varies from different envs
15	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1
16	`collect.noise` `_sigma`	float	0.1	add noise to continuous args during collection
17	`other.eps.type`	str	exp	exploration rate decay type	Support [‘exp’, ‘linear’].
18	`other.eps.` `start`	float	0.95	start value of exploration rate	[0,1]
19	`other.eps.` `end`	float	0.05	end value of exploration rate	[0,1]
20	`other.eps.` `decay`	int	10000	decay length of exploration	greater than 0. set decay=10000 means the exploration rate decay from start value to end value during decay length.

_forward_collect(data: Dict[int, Any], eps: float) → Dict[int, Any][source]¶

Overview:

Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs eps argument for exploration, i.e., classic epsilon-greedy exploration strategy.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
eps (float): The epsilon value for exploration.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action and other necessary data for learn mode defined in self._process_transition method. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PDQNPolicy: ding.policy.tests.test_pdqn.

_forward_eval(data: Dict[int, Any]) → Dict[int, Any][source]¶

Overview:

Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.

Arguments:

data (Dict[int, Any]): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.

Returns:

output (Dict[int, Any]): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PDQNPolicy: ding.policy.tests.test_pdqn.

_forward_learn(data: Dict[str, Any]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, target_q_value, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For PDQN, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for PDQNPolicy: ding.policy.tests.test_pdqn.

_get_train_sample(transitions: List[Dict[str, Any]]) → List[Dict[str, Any]][source]¶

Overview:

For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PDQN, a train sample is a processed transition. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in self._forward_learn method.

Arguments:

transitions (List[Dict[str, Any]): The trajectory data (a list of transition), each element is the same format as the return value of self._process_transition method.

Returns:

samples (List[Dict[str, Any]]): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as nstep reward and target obs.

_init_collect() → None[source]¶

Overview:: Initialize the collect mode of policy, including related attributes and modules. For PDQN, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and continuous action mechanism, besides, other algorithm-specific arguments such as unroll_len and nstep are also initialized here. This method will be called in __init__ method if collect field is in enable_field.

Note

If you want to set some spacial member variables in _init_collect method, you’d better name them with prefix _collect_ to avoid conflict with other modes, such as self._collect_attr1.

Tip

Some variables need to initialize independently in different modes, such as gamma and nstep in PDQN. This design is for the convenience of parallel execution of different policy modes.

_init_eval() → None[source]¶

Overview:: Initialize the eval mode of policy, including related attributes and modules. For PDQN, it contains the eval model to greedily select action with argmax q_value mechanism. This method will be called in __init__ method if eval field is in enable_field.

Note

If you want to set some spacial member variables in _init_eval method, you’d better name them with prefix _eval_ to avoid conflict with other modes, such as self._eval_attr1.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For PDQN, it mainly contains two optimizers, algorithm-specific arguments such as nstep and gamma, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_load_state_dict_learn(state_dict: Dict[str, Any]) → None[source]¶

Overview:

Load the state_dict variable into policy learn mode.

Arguments:

state_dict (Dict[str, Any]): the dict of policy learn state saved before.

Tip

If you want to only load some parts of model, you can simply set the strict argument in load_state_dict to False, or refer to ding.torch_utils.checkpoint_helper for more complicated operation.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

_process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) → Dict[str, Tensor][source]¶

Overview:

Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PDQN, it contains obs, next_obs, action, reward, done and logit.

Arguments:

obs (torch.Tensor): The env observation of current timestep, such as stacked 2D image in Atari.
policy_output (Dict[str, torch.Tensor]): The output of the policy network with the observation as input. For PDQN, it contains the hybrid action and the logit (discrete part q_value) of the action.
timestep (namedtuple): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.

Returns:

transition (Dict[str, torch.Tensor]): The processed transition data of the current timestep.

_state_dict_learn() → Dict[str, Any][source]¶

Overview:

Return the state_dict of learn mode, usually including model, target model, discrete part optimizer, and continuous part optimizer.

Returns:

state_dict (Dict[str, Any]): the dict of current policy learn state, for saving and restoring.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns:

model_info (Tuple[str, List[str]]): The registered model name and model’s import_names.

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about PDQN, its registered name is pdqn and the import_names is ding.model.template.pdqn.

MDQN¶

Please refer to ding/policy/mdqn.py for more details.

MDQNPolicy¶

class ding.policy.MDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶

Overview:

Policy class of Munchausen DQN algorithm, extended by auxiliary objectives. Paper link: https://arxiv.org/abs/2007.14430.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	mdqn	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	False	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	1, [3, 5]	N-step reward discount sum for target q_value estimation
8	`learn.update` `per_collect` `_gpu`	int	1	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
10	`learn.batch_` `size`	int	32	The number of samples of an iteration
11	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
12	`learn.target_` `update_freq`	int	2000	Frequence of target network update.	Hard(assign) update
13	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
14	`collect.n_sample`	int	4	The number of training samples of a call of collector.	It varies from different envs
15	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1
16	`other.eps.type`	str	exp	exploration rate decay type	Support [‘exp’, ‘linear’].
17	`other.eps.` `start`	float	0.01	start value of exploration rate	[0,1]
18	`other.eps.` `end`	float	0.001	end value of exploration rate	[0,1]
19	`other.eps.` `decay`	int	250000	decay length of exploration	greater than 0. set decay=250000 means the exploration rate decay from start value to end value during decay length.
20	`entropy_tau`	float	0.003	the ration of entropy in TD loss
21	`alpha`	float	0.9	the ration of Munchausen term to the TD loss

_forward_learn(data: Dict[str, Any]) → Dict[str, Any][source]¶

Overview:

Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action_gap, clip_frac, priority.

Arguments:

data (List[Dict[int, Any]]): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the _forward_learn method, data often need to first be stacked in the batch dimension by some utility functions such as default_preprocess_learn. For MDQN, each element in list is a dict containing at least the following keys: obs, action, reward, next_obs, done. Sometimes, it also contains other keys such as weight and value_gamma.

Returns:

info_dict (Dict[str, Any]): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of _monitor_vars_learn method.

Note

The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.

Note

For more detailed examples, please refer to our unittest for MDQNPolicy: ding.policy.tests.test_mdqn.

_init_learn() → None[source]¶

Overview:: Initialize the learn mode of policy, including related attributes and modules. For MDQN, it contains optimizer, algorithm-specific arguments such as entropy_tau, m_alpha and nstep, main and target model. This method will be called in __init__ method if learn field is in enable_field.

Note

For the member variables that need to be saved and loaded, please refer to the _state_dict_learn and _load_state_dict_learn methods.

Note

For the member variables that need to be monitored, please refer to the _monitor_vars_learn method.

Note

If you want to set some spacial member variables in _init_learn method, you’d better name them with prefix _learn_ to avoid conflict with other modes, such as self._learn_attr1.

_monitor_vars_learn() → List[str][source]¶

Overview:

Return the necessary keys for logging the return dict of self._forward_learn. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.

Returns:

necessary_keys (List[str]): The list of the necessary keys to be logged.

Policy Factory¶

Please refer to ding/policy/policy_factory.py for more details.

PolicyFactory¶

class ding.policy.PolicyFactory[source]¶

Overview:: Policy factory class, used to generate different policies for general purpose. Such as random action policy, which is used for initial sample collecting for better exploration when random_collect_size > 0.
Interfaces:: get_random_policy

static get_random_policy(policy: Policy.collect_mode, action_space: gym.spaces.Space = None, forward_fn: Callable = None) → Policy.collect_mode[source]¶

Overview:

According to the given action space, define the forward function of the random policy, then pack it with other interfaces of the given policy, and return the final collect mode interfaces of policy.

Arguments:

policy (Policy.collect_mode): The collect mode interfaces of the policy.
action_space (gym.spaces.Space): The action space of the environment, gym-style.
forward_fn (Callable): It action space is too complex, you can define your own forward function and pass it to this function, note you should set action_space to None in this case.

Returns:

random_policy (Policy.collect_mode): The collect mode intefaces of the random policy.

get_random_policy¶

ding.policy.get_random_policy(cfg: EasyDict, policy: Policy.collect_mode, env: BaseEnvManager) → Policy.collect_mode[source]¶

Overview:

The entry function to get the corresponding random policy. If a policy needs special data items in a transition, then return itself, otherwise, we will use PolicyFactory to return a general random policy.

Arguments:

cfg (EasyDict): The EasyDict-type dict configuration.
policy (Policy.collect_mode): The collect mode interfaces of the policy.
env (BaseEnvManager): The env manager instance, which is used to get the action space for random action generation.

Returns:

random_policy (Policy.collect_mode): The collect mode intefaces of the random policy.

Common Utilities¶

Please refer to ding/policy/common_utils.py for more details.

default_preprocess_learn¶

ding.policy.default_preprocess_learn(data: List[Any], use_priority_IS_weight: bool = False, use_priority: bool = False, use_nstep: bool = False, ignore_done: bool = False) → Dict[str, Tensor][source]¶

Overview:

Default data pre-processing in policy’s _forward_learn method, including stacking batch data, preprocess ignore done, nstep and priority IS weight.

Arguments:

data (List[Any]): The list of a training batch samples, each sample is a dict of PyTorch Tensor.
use_priority_IS_weight (bool): Whether to use priority IS weight correction, if True, this function will set the weight of each sample to the priority IS weight.
use_priority (bool): Whether to use priority, if True, this function will set the priority IS weight.
use_nstep (bool): Whether to use nstep TD error, if True, this function will reshape the reward.
ignore_done (bool): Whether to ignore done, if True, this function will set the done to 0.

Returns:

data (Dict[str, torch.Tensor]): The preprocessed dict data whose values can be directly used for the following model forward and loss computation.

single_env_forward_wrapper¶

ding.policy.single_env_forward_wrapper(forward_fn: Callable) → Callable[source]¶

Overview:

Wrap policy to support gym-style interaction between policy and single environment.

Arguments:

forward_fn (Callable): The original forward function of policy.

Returns:

wrapped_forward_fn (Callable): The wrapped forward function of policy.

Examples:

>>> env = gym.make('CartPole-v0')
>>> policy = DQNPolicy(...)
>>> forward_fn = single_env_forward_wrapper(policy.eval_mode.forward)
>>> obs = env.reset()
>>> action = forward_fn(obs)
>>> next_obs, rew, done, info = env.step(action)

single_env_forward_wrapper_ttorch¶

ding.policy.single_env_forward_wrapper_ttorch(forward_fn: Callable, cuda: bool = True) → Callable[source]¶

Overview:

Wrap policy to support gym-style interaction between policy and single environment for treetensor (ttorch) data.

Arguments:

forward_fn (Callable): The original forward function of policy.
cuda (bool): Whether to use cuda in policy, if True, this function will move the input data to cuda.

Returns:

wrapped_forward_fn (Callable): The wrapped forward function of policy.

Examples:

>>> env = gym.make('CartPole-v0')
>>> policy = PPOFPolicy(...)
>>> forward_fn = single_env_forward_wrapper_ttorch(policy.eval)
>>> obs = env.reset()
>>> action = forward_fn(obs)
>>> next_obs, rew, done, info = env.step(action)