ding.policy¶
Base Policy¶
Please refer to ding/policy/base_policy.py
for more details.
Policy¶
- class ding.policy.Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
The basic class of Reinforcement Learning (RL) and Imitation Learning (IL) policy in DI-engine.
- Property:
cfg
,learn_mode
,collect_mode
,eval_mode
- __init__(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None) None [source]¶
- Overview:
Initialize policy instance according to input configures and model. This method will initialize differnent fields in policy, including
learn
,collect
,eval
. Thelearn
field is used to train the policy, thecollect
field is used to collect data for training, and theeval
field is used to evaluate the policy. Theenable_field
is used to specify which field to initialize, if it is None, then all fields will be initialized.- Arguments:
cfg (
EasyDict
): The final merged config used to initialize policy. For the default config, see theconfig
attribute and its comments of policy class.model (
torch.nn.Module
): The neural network model used to initialize policy. If it is None, then the model will be created according todefault_model
method andcfg.model
field. Otherwise, the model will be set to themodel
instance created by outside caller.enable_field (
Optional[List[str]]
): The field list to initialize. If it is None, then all fields will be initialized. Otherwise, only the fields inenable_field
will be initialized, which is beneficial to save resources.
Note
For the derived policy class, it should implement the
_init_learn
,_init_collect
,_init_eval
method to initialize the corresponding field.
- __repr__() str [source]¶
- Overview:
Get the string representation of the policy.
- Returns:
repr (
str
): The string representation of the policy.
- _create_model(cfg: EasyDict, model: Module | None = None) Module [source]¶
- Overview:
Create or validate the neural network model according to the input configuration and model. If the input model is None, then the model will be created according to
default_model
method andcfg.model
field. Otherwise, the model will be verified as an instance oftorch.nn.Module
and set to themodel
instance created by outside caller.- Arguments:
cfg (
EasyDict
): The final merged config used to initialize policy.model (
torch.nn.Module
): The neural network model used to initialize policy. User can refer to the default model defined in the corresponding policy to customize its own model.
- Returns:
model (
torch.nn.Module
): The created neural network model. The different modes of policy will add distinct wrappers and plugins to the model, which is used to train, collect and evaluate.
- Raises:
RuntimeError: If the input model is not None and is not an instance of
torch.nn.Module
.
- abstract _forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs, or the action logits to calculate the loss in learn mode. This method is left to be implemented by the subclass, and more arguments can be added in
kwargs
part if necessary.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
- abstract _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance, such as interacting with envs or computing metrics on validation dataset). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. This method is left to be implemented by the subclass.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
- abstract _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss value, policy entropy, q value, priority, and so on. This method is left to be implemented by the subclass, and more arguments can be added in
data
item if necessary.- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, in the_forward_learn
method, data should be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
.
- Returns:
output (
Dict[int, Any]
): The training information of policy forward, including some metrics for monitoring training such as loss, priority, q value, policy entropy, and some data for next step training such as priority. Note the output data item should be Python native scalar rather than PyTorch tensor, which is convenient for the outside to use.
- _get_attribute(name: str) Any [source]¶
- Overview:
In order to control the access of the policy attributes, we expose different modes to outside rather than directly use the policy instance. And we also provide a method to get the attribute of the policy in different modes.
- Arguments:
name (
str
): The name of the attribute.
- Returns:
value (
Any
): The value of the attribute.
Note
DI-engine’s policy will first try to access _get_{name} method, and then try to access _{name} attribute. If both of them are not found, it will raise a
NotImplementedError
.
- abstract _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. A train sample can be a processed transition (DQN with nstep TD) or some multi-timestep transitions (DRQN). This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in
self._forward_learn
method.- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as nstep reward, advantage, etc.
Note
We will vectorize
process_transition
andget_train_sample
method in the following release version. And the user can customize the this data processing procecure by overriding this two methods and collector itself
- abstract _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. This method will be called in
__init__
method ifcollect
field is inenable_field
. Almost different policies have its own collect mode, so this method must be overrided in subclass.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_collect
and_load_state_dict_collect
methods.Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.
- abstract _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. This method will be called in
__init__
method ifeval
field is inenable_field
. Almost different policies have its own eval mode, so this method must be override in subclass.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_eval
and_load_state_dict_eval
methods.Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- abstract _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. This method will be called in
__init__
method iflearn
field is inenable_field
. Almost different policies have its own learn mode, so this method must be overrided in subclass.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _init_multi_gpu_setting(model: Module, bp_update_sync: bool) None [source]¶
- Overview:
Initialize multi-gpu data parallel training setting, including broadcast model parameters at the beginning of the training, and prepare the hook function to allreduce the gradients of model parameters.
- Arguments:
model (
torch.nn.Module
): The neural network model to be trained.bp_update_sync (
bool
): Whether to synchronize update the model parameters after allreduce the gradients of model parameters. Async update can be parallel in different network layers like pipeline so that it can save time.
- _load_state_dict_collect(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy collect mode, such as load pretrained state_dict, auto-recover checkpoint, or model replica from learner in distributed training scenarios.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy collect state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _load_state_dict_eval(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy eval mode, such as load auto-recover checkpoint, or model replica from learner in distributed training scenarios.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy eval state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
Tip
The default implementation is
['cur_lr', 'total_loss']
. Other derived classes can overwrite this method to add their own keys if necessary.
- abstract _process_transition(obs: Tensor | Dict[str, Tensor], policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, such as <s, a, r, s’, done>. Some policies need to do some special process and pack its own necessary attributes (e.g. hidden state and logit), so this method is left to be implemented by the subclass.
- Arguments:
obs (
Union[torch.Tensor, Dict[str, torch.Tensor]]
): The observation of the current timestep.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. Usually, it contains the action and the logit of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _reset_collect(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for collect mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different environments/episodes in collecting indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables specified bydata_id
.
Note
This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.
- _reset_eval(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different environments/episodes in evaluation indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables specified bydata_id
.
Note
This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.
- _reset_learn(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different trajectories indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables specified bydata_id
.
Note
This method is not mandatory to be implemented. The sub-class can overwrite this method if necessary.
- _set_attribute(name: str, value: Any) None [source]¶
- Overview:
In order to control the access of the policy attributes, we expose different modes to outside rather than directly use the policy instance. And we also provide a method to set the attribute of the policy in different modes. And the new attribute will name as
_{name}
.- Arguments:
name (
str
): The name of the attribute.value (
Any
): The value of the attribute.
- _state_dict_collect() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of collect mode, only including model in usual, which is necessary for distributed training scenarios to auto-recover collectors.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy collect state, for saving and restoring.
Tip
Not all the scenarios need to auto-recover collectors, sometimes, we can directly shutdown the crashed collector and renew a new one.
- _state_dict_eval() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of eval mode, only including model in usual, which is necessary for distributed training scenarios to auto-recover evaluators.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy eval state, for saving and restoring.
Tip
Not all the scenarios need to auto-recover evaluators, sometimes, we can directly shutdown the crashed evaluator and renew a new one.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model and optimizer.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- property collect_mode: collect_function¶
- Overview:
Return the interfaces of collect mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different modes. Moreover, derived subclass can override the interfaces to customize its own collect mode.
- Returns:
interfaces (
Policy.collect_function
): The interfaces of collect mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.
- Examples:
>>> policy = Policy(cfg, model) >>> policy_collect = policy.collect_mode >>> obs = env_manager.ready_obs >>> inference_output = policy_collect.forward(obs) >>> next_obs, rew, done, info = env_manager.step(inference_output.action)
- classmethod default_config() EasyDict [source]¶
- Overview:
Get the default config of policy. This method is used to create the default config of policy.
- Returns:
cfg (
EasyDict
): The default config of corresponding policy. For the derived policy class, it will recursively merge the default config of base class and its own default config.
Tip
This method will deepcopy the
config
attribute of the class and return the result. So users don’t need to worry about the modification of the returned config.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
Note
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about DQN, its registered name is
dqn
and the import_names isding.model.template.q_learning.DQN
- property eval_mode: eval_function¶
- Overview:
Return the interfaces of eval mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different mode. Moreover, derived subclass can override the interfaces to customize its own eval mode.
- Returns:
interfaces (
Policy.eval_function
): The interfaces of eval mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.
- Examples:
>>> policy = Policy(cfg, model) >>> policy_eval = policy.eval_mode >>> obs = env_manager.ready_obs >>> inference_output = policy_eval.forward(obs) >>> next_obs, rew, done, info = env_manager.step(inference_output.action)
- property learn_mode: learn_function¶
- Overview:
Return the interfaces of learn mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different modes. Moreover, derived subclass can override the interfaces to customize its own learn mode.
- Returns:
interfaces (
Policy.learn_function
): The interfaces of learn mode of policy, it is a namedtuple whose values of distinct fields are different internal methods.
- Examples:
>>> policy = Policy(cfg, model) >>> policy_learn = policy.learn_mode >>> train_output = policy_learn.forward(data) >>> state_dict = policy_learn.state_dict()
- sync_gradients(model: Module) None [source]¶
- Overview:
Synchronize (allreduce) gradients of model parameters in data-parallel multi-gpu training.
- Arguments:
model (
torch.nn.Module
): The model to synchronize gradients.
Note
This method is only used in multi-gpu training, and it should be called after
backward
method and beforestep
method. The user can also usebp_update_sync
config to control whether to synchronize gradients allreduce and optimizer updates.
CommandModePolicy¶
- class ding.policy.CommandModePolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy with command mode, which can be used in old version of DI-engine pipeline:
serial_pipeline
.CommandModePolicy
uses_get_setting_learn
,_get_setting_collect
,_get_setting_eval
methods to exchange information between different workers.- Interface:
_init_command
,_get_setting_learn
,_get_setting_collect
,_get_setting_eval
- Property:
command_mode
- abstract _get_setting_collect(command_info: Dict[str, Any]) Dict[str, Any] [source]¶
- Overview:
Accoding to
command_info
, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of collect mode, which contains dynamically changed hyperparameters for collect mode, such aseps
,temperature
, etc.- Arguments:
command_info (
Dict[str, Any]
): The global training information, which is defined incommander
.
- Returns:
setting (
Dict[str, Any]
): The latest setting of collect mode, which is usually used as extra arguments of thepolicy._forward_collect
method.
- abstract _get_setting_eval(command_info: Dict[str, Any]) Dict[str, Any] [source]¶
- Overview:
Accoding to
command_info
, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of eval mode, which contains dynamically changed hyperparameters for eval mode, such astemperature
, etc.- Arguments:
command_info (
Dict[str, Any]
): The global training information, which is defined incommander
.
- Returns:
setting (
Dict[str, Any]
): The latest setting of eval mode, which is usually used as extra arguments of thepolicy._forward_eval
method.
- abstract _get_setting_learn(command_info: Dict[str, Any]) Dict[str, Any] [source]¶
- Overview:
Accoding to
command_info
, i.e., global training information (e.g. training iteration, collected env step, evaluation results, etc.), return the setting of learn mode, which contains dynamically changed hyperparameters for learn mode, such asbatch_size
,learning_rate
, etc.- Arguments:
command_info (
Dict[str, Any]
): The global training information, which is defined incommander
.
- Returns:
setting (
Dict[str, Any]
): The latest setting of learn mode, which is usually used as extra arguments of thepolicy._forward_learn
method.
- abstract _init_command() None [source]¶
- Overview:
Initialize the command mode of policy, including related attributes and modules. This method will be called in
__init__
method ifcommand
field is inenable_field
. Almost different policies have its own command mode, so this method must be overrided in subclass.
Note
If you want to set some spacial member variables in
_init_command
method, you’d better name them with prefix_command_
to avoid conflict with other modes, such asself._command_attr1
.
- property command_mode: Policy.command_function¶
- Overview:
Return the interfaces of command mode of policy, which is used to train the model. Here we use namedtuple to define immutable interfaces and restrict the usage of policy in different mode. Moreover, derived subclass can override the interfaces to customize its own command mode.
- Returns:
interfaces (
Policy.command_function
): The interfaces of command mode, it is a namedtuple whose values of distinct fields are different internal methods.
- Examples:
>>> policy = CommandModePolicy(cfg, model) >>> policy_command = policy.command_mode >>> settings = policy_command.get_setting_learn(command_info)
create_policy¶
- ding.policy.create_policy(cfg: EasyDict, **kwargs) Policy [source]¶
- Overview:
Create a policy instance according to
cfg
and other kwargs.- Arguments:
cfg (
EasyDict
): Final merged policy config.
- ArgumentsKeys:
type (
str
): Policy type set inPOLICY_REGISTRY.register
method , such asdqn
.import_names (
List[str]
): A list of module names (paths) to import before creating policy, such asding.policy.dqn
.
- Returns:
policy (
Policy
): The created policy instance.
Tip
kwargs
contains other arguments that need to be passed to the policy constructor. You can refer to the__init__
method of the corresponding policy class for details.Note
For more details about how to merge config, please refer to the system document of DI-engine (en link).
get_policy_cls¶
- ding.policy.get_policy_cls(cfg: EasyDict) type [source]¶
- Overview:
Get policy class according to
cfg
, which is used to access related class variables/methods.- Arguments:
cfg (
EasyDict
): Final merged policy config.
- ArgumentsKeys:
type (
str
): Policy type set inPOLICY_REGISTRY.register
method , such asdqn
.import_names (
List[str]
): A list of module names (paths) to import before creating policy, such asding.policy.dqn
.
- Returns:
policy (
type
): The policy class.
DQN¶
Please refer to ding/policy/dqn.py
for more details.
DQNPolicy¶
- class ding.policy.DQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of DQN algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
dqn
RL policy register name, refer toregistryPOLICY_REGISTRY
This arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for networkThis arg can be diff-erent from modes3
on_policy
bool
False
Whether the RL algorithm is on-policyor off-policy4
priority
bool
False
Whether use priority(PER)Priority sample,update priority5
priority_IS
_weight
bool
False
Whether use Importance SamplingWeight to correct biased update. IfTrue, priority must be True.6
discount_
factor
float
0.97, [0.95, 0.999]
Reward’s future discount factor, aka.gammaMay be 1 when sparsereward env7
nstep
int
1, [3, 5]
N-step reward discount sum for targetq_value estimation8
model.dueling
bool
True
dueling head architecture9
model.encoder
_hidden
_size_list
list (int)
[32, 64, 64, 128]
Sequence ofhidden_size
ofsubsequent conv layers and thefinal dense layer.default kernel_sizeis [8, 4, 3]default stride is[4, 2 ,1]10
model.dropout
float
None
Dropout rate for dropout layers.[0,1]If set toNone
means no dropout11
learn.update
per_collect
int
3
How many updates(iterations) to trainafter collector’s one collection.Only valid in serial trainingThis args can be varyfrom envs. Bigger valmeans more off-policy12
learn.batch_
size
int
64
The number of samples of an iteration13
learn.learning
_rate
float
0.001
Gradient step length of an iteration.14
learn.target_
update_freq
int
100
Frequence of target network update.Hard(assign) update15
learn.target_
theta
float
0.005
Frequence of target network update.Only one of [target_update_freq,target_theta] should be setSoft(assign) update16
learn.ignore_
done
bool
False
Whether ignore done for target valuecalculation.Enable it for somefake termination env17
collect.n_sample
int
[8, 128]
The number of training samples of acall of collector.It varies fromdifferent envs18
collect.n_episode
int
8
The number of training episodes of acall of collectoronly one of [n_sample,n_episode] shouldbe set19
collect.unroll
_len
int
1
unroll length of an iterationIn RNN, unroll_len>120
other.eps.type
str
exp
exploration rate decay typeSupport [‘exp’,‘linear’].21
other.eps.
start
float
0.95
start value of exploration rate[0,1]22
other.eps.
end
float
0.1
end value of exploration rate[0,1]23
other.eps.
decay
int
10000
decay length of explorationgreater than 0. setdecay=10000 meansthe exploration ratedecay from startvalue to end valueduring decay length.
- _forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs
eps
argument for exploration, i.e., classic epsilon-greedy exploration strategy.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.eps (
float
): The epsilon value for exploration.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DQNPolicy:
ding.policy.tests.test_dqn
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DQNPolicy:
ding.policy.tests.test_dqn
.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For DQN, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andvalue_gamma
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement your own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DQNPolicy:
ding.policy.tests.test_dqn
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DQN with nstep TD, a train sample is a processed transition. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize relevant time consumption. In addition, you can also implement this method as an identity function and do the data processing in
self._forward_learn
method.- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is in the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is similar in format to input transitions, but may contain more data for training, such as nstep reward and target obs.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For DQN, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism, and other algorithm-specific arguments such as unroll_len and nstep. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.Tip
Some variables need to initialize independently in different modes, such as gamma and nstep in DQN. This design is for the convenience of parallel execution of different policy modes.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For DQN, it contains the eval model to greedily select action with argmax q_value mechanism. This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For DQN, it mainly contains optimizer, algorithm-specific arguments such as nstep and gamma, main and target model. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DQN, it contains obs, next_obs, action, reward, done.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For DQN, it contains the action and the logit (q_value) of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model, target_model and optimizer.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
Note
The user can define and use customized network model but must obey the same interface definition indicated by import_names path. For example about DQN, its registered name is
dqn
and the import_names isding.model.template.q_learning
.
DQNSTDIMPolicy¶
- class ding.policy.DQNSTDIMPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of DQN algorithm, extended by ST-DIM auxiliary objectives. ST-DIM paper link: https://arxiv.org/abs/1906.08226.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
dqn_stdim
RL policy register name, refer toregistryPOLICY_REGISTRY
This arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for networkThis arg can be diff-erent from modes3
on_policy
bool
False
Whether the RL algorithm is on-policyor off-policy4
priority
bool
False
Whether use priority(PER)Priority sample,update priority5
priority_IS
_weight
bool
False
Whether use Importance Sampling Weightto correct biased update. If True,priority must be True.6
discount_
factor
float
0.97, [0.95, 0.999]
Reward’s future discount factor, aka.gammaMay be 1 when sparsereward env7
nstep
int
1, [3, 5]
N-step reward discount sum for targetq_value estimation8
learn.update
per_collect
_gpu
int
3
How many updates(iterations) to trainafter collector’s one collection. Onlyvalid in serial trainingThis args can be varyfrom envs. Bigger valmeans more off-policy10
learn.batch_
size
int
64
The number of samples of an iteration11
learn.learning
_rate
float
0.001
Gradient step length of an iteration.12
learn.target_
update_freq
int
100
Frequence of target network update.Hard(assign) update13
learn.ignore_
done
bool
False
Whether ignore done for target valuecalculation.Enable it for somefake termination env14
collect.n_sample
int
[8, 128]
The number of training samples of acall of collector.It varies fromdifferent envs15
collect.unroll
_len
int
1
unroll length of an iterationIn RNN, unroll_len>116
other.eps.type
str
exp
exploration rate decay typeSupport [‘exp’,‘linear’].17
other.eps.
start
float
0.95
start value of exploration rate[0,1]18
other.eps.
end
float
0.1
end value of exploration rate[0,1]19
other.eps.
decay
int
10000
decay length of explorationgreater than 0. setdecay=10000 meansthe exploration ratedecay from startvalue to end valueduring decay length.20
aux_loss
_weight
float
0.001
the ratio of the auxiliary loss tothe TD lossany real value,typically in[-0.1, 0.1].
- _forward_learn(data: Dict[str, Any]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, priority, aux_loss.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For DQNSTDIM, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andvalue_gamma
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For DQNSTDIM, it first call super class’s
_init_learn
method, then initialize extra auxiliary model, its optimizer, and the loss weight. This method will be called in__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): the dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _model_encode(data: dict) Tuple[Tensor] [source]¶
- Overview:
Get the encoding of the main model as input for the auxiliary model.
- Arguments:
data (
dict
): Dict type data, same as the _forward_learn input.
- Returns:
(
Tuple[torch.Tensor]
): the tuple of two tensors to apply contrastive embedding learning. In ST-DIM algorithm, these two variables are the dqn encoding of obs and next_obs respectively.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
PPO¶
Please refer to ding/policy/ppo.py
for more details.
PPOPolicy¶
- class ding.policy.PPOPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of on-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347.
- _forward_collect(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Tip
If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PPOPolicy:
ding.policy.tests.test_ppo
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
_forward_eval
in PPO often uses deterministic sample method to get actions while_forward_collect
usually uses stochastic sample method for balance exploration and exploitation.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PPOPolicy:
ding.policy.tests.test_ppo
.
- _forward_learn(data: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac, approx_kl.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including the latest collected training samples for on-policy algorithms like PPO. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For PPO, each element in list is a dict containing at least the following keys:obs
,action
,reward
,logit
,value
,done
. Sometimes, it also contains other keys such asweight
.
- Returns:
return_infos (
List[Dict[str, Any]]
): The information list that indicated training result, each training iteration contains append a information dict into the final list. The list will be precessed and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Tip
The training procedure of PPO is two for loops. The outer loop trains all the collected training samples with
epoch_per_collect
epochs. The inner loop splits all the data into different mini-batch with the length ofbatch_size
.Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PPOPolicy:
ding.policy.tests.test_ppo
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PPO, a train sample is a processed transition with new computed
traj_flag
andadv
field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing inself._forward_learn
method.- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as GAE advantage.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For PPO, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.Tip
Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. This design is for the convenience of parallel execution of different policy modes.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For PPO, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For PPO, it mainly contains optimizer, algorithm-specific arguments such as loss weight, clip_ratio and recompute_adv. This method also executes some special network initializations and prepares running mean/std monitor for value. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For PPO, it contains the state value, action and the logit of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
Note
next_obs
is used to calculate nstep return when necessary, so we place in into transition by default. You can delete this field to save memory occupancy if you do not need nstep return.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
Note
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about PPO, its registered name is
ppo
and the import_names isding.model.template.vac
.Note
Because now PPO supports both single-agent and multi-agent usages, so we can implement these functions with the same policy and two different default models, which is controled by
self._cfg.multi_agent
.
PPOPGPolicy¶
- class ding.policy.PPOPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of on policy version PPO algorithm (pure policy gradient without value network). Paper link: https://arxiv.org/abs/1707.06347.
- _forward_collect(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data (action logit) for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Tip
If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
_forward_eval
in PPO often uses deterministic sample method to get actions while_forward_collect
usually uses stochastic sample method for balance exploration and exploitation.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PPOPGPolicy:
ding.policy.tests.test_ppo
.
- _forward_learn(data: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac, approx_kl.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including the latest collected training samples for on-policy algorithms like PPO. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For PPOPG, each element in list is a dict containing at least the following keys:obs
,action
,return
,logit
,done
. Sometimes, it also contains other keys such asweight
.
- Returns:
return_infos (
List[Dict[str, Any]]
): The information list that indicated training result, each training iteration contains append a information dict into the final list. The list will be precessed and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Tip
The training procedure of PPOPG is two for loops. The outer loop trains all the collected training samples with
epoch_per_collect
epochs. The inner loop splits all the data into different mini-batch with the length ofbatch_size
.Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _get_train_sample(data: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given entire episode data (a list of transition), process it into a list of sample that can be used for training directly. In PPOPG, a train sample is a processed transition with new computed
return
field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing inself._forward_learn
method.- Arguments:
data (
List[Dict[str, Any]
): The episode data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as discounted episode return.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For PPOPG, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.Tip
Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. This design is for the convenience of parallel execution of different policy modes.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For PPOPG, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For PPOPG, it mainly contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method also executes some special network initializations. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPOPG, it contains obs, action, reward, done, logit.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For PPOPG, it contains the action and the logit of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
PPOOffPolicy¶
- class ding.policy.PPOOffPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of off-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347. This version is more suitable for large-scale distributed training.
- _forward_collect(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Tip
If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PPOOffPolicy:
ding.policy.tests.test_ppo
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
_forward_eval
in PPO often uses deterministic sample method to get actions while_forward_collect
usually uses stochastic sample method for balance exploration and exploitation.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PPOOffPolicy:
ding.policy.tests.test_ppo
.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, clipfrac and approx_kl.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For PPOOff, each element in list is a dict containing at least the following keys:obs
,adv
,action
,logit
,value
,done
. Sometimes, it also contains other keys such asweight
andvalue_gamma
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PPO, a train sample is a processed transition with new computed
traj_flag
andadv
field. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing inself._forward_learn
method.- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as GAE advantage.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For PPOOff, it contains collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.Tip
Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPOOff. This design is for the convenience of parallel execution of different policy modes.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For PPOOff, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For PPOOff, it mainly contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method also executes some special network initializations and prepares running mean/std monitor for value. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For PPO, it contains the state value, action and the logit of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
Note
next_obs
is used to calculate nstep return when necessary, so we place in into transition by default. You can delete this field to save memory occupancy if you do not need nstep return.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
PPOSTDIMPolicy¶
- class ding.policy.PPOSTDIMPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of on policy version PPO algorithm with ST-DIM auxiliary model. PPO paper link: https://arxiv.org/abs/1707.06347. ST-DIM paper link: https://arxiv.org/abs/1906.08226.
- _forward_learn(data: Dict[str, Any]) Dict[str, Any] [source]¶
- Overview:
Forward and backward function of learn mode.
- Arguments:
data (
dict
): Dict type data
- Returns:
info_dict (
Dict[str, Any]
): Including current lr, total_loss, policy_loss, value_loss, entropy_loss, adv_abs_max, approx_kl, clipfrac
- _init_learn() None [source]¶
- Overview:
Learn mode init method. Called by
self.__init__
. Init the auxiliary model, its optimizer, and the axuliary loss weight to the main loss.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _model_encode(data)[source]¶
- Overview:
Get the encoding of the main model as input for the auxiliary model.
- Arguments:
data (
dict
): Dict type data, same as the _forward_learn input.
- Returns:
- (
Tuple[Tensor]
): the tuple of two tensors to apply contrastive embedding learning. In ST-DIM algorithm, these two variables are the dqn encoding of obs and next_obs respectively.
- (
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
BC¶
Please refer to ding/policy/bc.py
for more details.
BehaviourCloningPolicy¶
- class ding.policy.BehaviourCloningPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Behaviour Cloning (BC) policy class, which supports both discrete and continuous action space. The policy is trained by supervised learning, and the data is a offline dataset collected by expert.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss and time.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For BC, each element in list is a dict containing at least the following keys:obs
,action
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _init_collect() None [source]¶
- Overview:
BC policy uses offline dataset so it does not need to collect data. However, sometimes we need to use the trained BC policy to collect data for other purposes.
- _init_eval()[source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For BC, it contains the eval model to greedily select action with argmax q_value mechanism for discrete action space. This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For BC, it mainly contains optimizer, algorithm-specific arguments such as lr_scheduler, loss, etc. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
Note
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about discrete BC, its registered name is
discrete_bc
and the import_names isding.model.template.bc
.
DDPG¶
Please refer to ding/policy/ddpg.py
for more details.
DDPGPolicy¶
- class ding.policy.DDPGPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of DDPG algorithm. Paper link: https://arxiv.org/abs/1509.02971.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
ddpg
RL policy register name, referto registryPOLICY_REGISTRY
this arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for network3
random_
collect_size
int
25000
Number of randomly collectedtraining samples in replaybuffer when training starts.Default to 25000 forDDPG/TD3, 10000 forsac.4
model.twin_
critic
bool
False
Whether to use two criticnetworks or only one.Default False forDDPG, Clipped DoubleQ-learning method inTD3 paper.5
learn.learning
_rate_actor
float
1e-3
Learning rate for actornetwork(aka. policy).6
learn.learning
_rate_critic
float
1e-3
Learning rates for criticnetwork (aka. Q-network).7
learn.actor_
update_freq
int
2
When critic network updatesonce, how many times will actornetwork update.Default 1 for DDPG,2 for TD3. DelayedPolicy Updates methodin TD3 paper.8
learn.noise
bool
False
Whether to add noise on targetnetwork’s action.Default False forDDPG, True for TD3.Target Policy Smoo-thing Regularizationin TD3 paper.9
learn.-
ignore_done
bool
False
Determine whether to ignoredone flag.Use ignore_done onlyin halfcheetah env.10
learn.-
target_theta
float
0.005
Used for soft update of thetarget network.aka. Interpolationfactor in polyak aver-aging for targetnetworks.11
collect.-
noise_sigma
float
0.1
Used for add noise during co-llection, through controllingthe sigma of distributionSample noise from dis-tribution, Ornstein-Uhlenbeck process inDDPG paper, Gaussianprocess in ours.
- _forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e., environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DDPGPolicy:
ding.policy.tests.test_ddpg
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DDPGPolicy:
ding.policy.tests.test_ddpg
.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For DDPG, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andlogit
which is used for hybrid action space.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DDPGPolicy:
ding.policy.tests.test_ddpg
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DDPG, a train sample is a processed transition (unroll_len=1).
- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For DDPG, it contains the collect_model to balance the exploration and exploitation with the perturbed noise mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For DDPG, it contains the eval model to greedily select action type with argmax q_value mechanism for hybrid action space. This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For DDPG, it mainly contains two optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DDPG, it contains obs, next_obs, action, reward, done.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For DDPG, it contains the action and the logit of the action (in hybrid action space).timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model, target_model and optimizers.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
TD3¶
Please refer to ding/policy/td3.py
for more details.
TD3Policy¶
- class ding.policy.TD3Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of TD3 algorithm. Since DDPG and TD3 share many common things, we can easily derive this TD3 class from DDPG class by changing
_actor_update_freq
,_twin_critic
and noise in model wrapper. Paper link: https://arxiv.org/pdf/1802.09477.pdf
Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
td3
RL policy register name, referto registryPOLICY_REGISTRY
this arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for network3
random_
collect_size
int
25000
Number of randomly collectedtraining samples in replaybuffer when training starts.Default to 25000 forDDPG/TD3, 10000 forsac.4
model.twin_
critic
bool
True
Whether to use two criticnetworks or only one.Default True for TD3,Clipped DoubleQ-learning method inTD3 paper.5
learn.learning
_rate_actor
float
1e-3
Learning rate for actornetwork(aka. policy).6
learn.learning
_rate_critic
float
1e-3
Learning rates for criticnetwork (aka. Q-network).7
learn.actor_
update_freq
int
2
When critic network updatesonce, how many times will actornetwork update.Default 2 for TD3, 1for DDPG. DelayedPolicy Updates methodin TD3 paper.8
learn.noise
bool
True
Whether to add noise on targetnetwork’s action.Default True for TD3,False for DDPG.Target Policy Smoo-thing Regularizationin TD3 paper.9
learn.noise_
range
dict
dict(min=-0.5,max=0.5,)Limit for range of targetpolicy smoothing noise,aka. noise_clip.10
learn.-
ignore_done
bool
False
Determine whether to ignoredone flag.Use ignore_done onlyin halfcheetah env.11
learn.-
target_theta
float
0.005
Used for soft update of thetarget network.aka. Interpolationfactor in polyak aver-aging for targetnetworks.12
collect.-
noise_sigma
float
0.1
Used for add noise during co-llection, through controllingthe sigma of distributionSample noise from dis-tribution, Ornstein-Uhlenbeck process inDDPG paper, Gaussianprocess in ours.- _forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any] ¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e., environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DDPGPolicy:
ding.policy.tests.test_ddpg
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] ¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DDPGPolicy:
ding.policy.tests.test_ddpg
.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] ¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For DDPG, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andlogit
which is used for hybrid action space.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DDPGPolicy:
ding.policy.tests.test_ddpg
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] ¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In DDPG, a train sample is a processed transition (unroll_len=1).
- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.
- _init_collect() None ¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For DDPG, it contains the collect_model to balance the exploration and exploitation with the perturbed noise mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.
- _init_eval() None ¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For DDPG, it contains the eval model to greedily select action type with argmax q_value mechanism for hybrid action space. This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None ¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For DDPG, it mainly contains two optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None ¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] ¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For DDPG, it contains obs, next_obs, action, reward, done.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For DDPG, it contains the action and the logit of the action (in hybrid action space).timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _state_dict_learn() Dict[str, Any] ¶
- Overview:
Return the state_dict of learn mode, usually including model, target_model and optimizers.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] ¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
SAC¶
Please refer to ding/policy/sac.py
for more details.
SACPolicy¶
- class ding.policy.SACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of continuous SAC algorithm. Paper link: https://arxiv.org/pdf/1801.01290.pdf
- Config:
ID
Symbol
Type
Default Value
Description
Other
1
type
str
sac
RL policy register name, referto registryPOLICY_REGISTRY
this arg is optional,a placeholder2
cuda
bool
True
Whether to use cuda for network3
on_policy
bool
False
SAC is an off-policyalgorithm.4
priority
bool
False
Whether to use prioritysampling in buffer.5
priority_IS_
weight
bool
False
Whether use Importance Samplingweight to correct biased update6
random_
collect_size
int
10000
Number of randomly collectedtraining samples in replaybuffer when training starts.Default to 10000 forSAC, 25000 for DDPG/TD3.7
learn.learning
_rate_q
float
3e-4
Learning rate for soft qnetwork.Defalut to 1e-38
learn.learning
_rate_policy
float
3e-4
Learning rate for policynetwork.Defalut to 1e-39
learn.alpha
float
0.2
Entropy regularizationcoefficient.alpha is initiali-zation for autoalpha, whenauto_alpha is True10
learn.
auto_alpha
bool
False
Determine whether to useauto temperature parameteralpha.Temperature parameterdetermines therelative importanceof the entropy termagainst the reward.11
learn.-
ignore_done
bool
False
Determine whether to ignoredone flag.Use ignore_done onlyin env like Pendulum12
learn.-
target_theta
float
0.005
Used for soft update of thetarget network.aka. Interpolationfactor in polyak averaging for targetnetworks.
- _forward_collect(data: Dict[int, Any], **kwargs) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
logit
in SAC means the mu and sigma of Gaussioan distribution. Here we use this name for consistency.Note
For more detailed examples, please refer to our unittest for SACPolicy:
ding.policy.tests.test_sac
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
logit
in SAC means the mu and sigma of Gaussioan distribution. Here we use this name for consistency.Note
For more detailed examples, please refer to our unittest for SACPolicy:
ding.policy.tests.test_sac
.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For SAC, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for SACPolicy:
ding.policy.tests.test_sac
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In continuous SAC, a train sample is a processed transition (unroll_len=1).
- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For SAC, it contains the collect_model other algorithm-specific arguments such as unroll_len. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For SAC, it contains the eval model, which is equipped with
base
model wrapper to ensure compability. This method will be called in__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the
auto_alpha
mechanism for balancing max entropy target is also initialized here. This method will be called in__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For continuous SAC, it contains obs, next_obs, action, reward, done. The logit will be also added when
collector_logit
is True.- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For continuous SAC, it contains the action and the logit (mu and sigma) of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model, target_model and optimizers.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
DiscreteSACPolicy¶
- class ding.policy.DiscreteSACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of discrete SAC algorithm. Paper link: https://arxiv.org/abs/1910.07207.
- _forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs
eps
argument for exploration, i.e., classic epsilon-greedy exploration strategy.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.eps (
float
): The epsilon value for exploration.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DiscreteSACPolicy:
ding.policy.tests.test_discrete_sac
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DiscreteSACPolicy:
ding.policy.tests.test_discrete_sac
.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For SAC, each element in list is a dict containing at least the following keys:obs
,action
,logit
,reward
,next_obs
,done
. Sometimes, it also contains other keys likeweight
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for DiscreteSACPolicy:
ding.policy.tests.test_discrete_sac
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In discrete SAC, a train sample is a processed transition (unroll_len=1).
- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For SAC, it contains the collect_model to balance the exploration and exploitation with the epsilon and multinomial sample mechanism, and other algorithm-specific arguments such as unroll_len. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For DiscreteSAC, it contains the eval model to greedily select action type with argmax q_value mechanism. This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For DiscreteSAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the
auto_alpha
mechanism for balancing max entropy target is also initialized here. This method will be called in__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For discrete SAC, it contains obs, next_obs, logit, action, reward, done.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For discrete SAC, it contains the action and the logit of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model, target_model and optimizers.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
SQILSACPolicy¶
- class ding.policy.SQILSACPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of continuous SAC algorithm with SQIL extension. SAC paper link: https://arxiv.org/pdf/1801.01290.pdf SQIL paper link: https://arxiv.org/abs/1905.11108
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For SAC, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
For SQIL + SAC, input data is composed of two parts with the same size: agent data and expert data. Both of them are relabelled with new reward according to SQIL algorithm.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for SACPolicy:
ding.policy.tests.test_sac
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma and twin_critic, main and target model. Especially, the
auto_alpha
mechanism for balancing max entropy target is also initialized here. This method will be called in__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
R2D2¶
Please refer to ding/policy/r2d2.py
for more details.
R2D2Policy¶
- class ding.policy.R2D2Policy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of R2D2, from paper Recurrent Experience Replay in Distributed Reinforcement Learning . R2D2 proposes that several tricks should be used to improve upon DRQN, namely some recurrent experience replay tricks and the burn-in mechanism for off-policy training.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
r2d2
RL policy register name, refer toregistryPOLICY_REGISTRY
This arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for networkThis arg can be diff-erent from modes3
on_policy
bool
False
Whether the RL algorithm is on-policyor off-policy4
priority
bool
False
Whether use priority(PER)Priority sample,update priority5
priority_IS
_weight
bool
False
Whether use Importance Sampling Weightto correct biased update. If True,priority must be True.6
discount_
factor
float
0.997, [0.95, 0.999]
Reward’s future discount factor, aka.gammaMay be 1 when sparsereward env7
nstep
int
3, [3, 5]
N-step reward discount sum for targetq_value estimation8
burnin_step
int
2
The timestep of burnin operation,which is designed to RNN hidden statedifference caused by off-policy9
learn.update
per_collect
int
1
How many updates(iterations) to trainafter collector’s one collection. Onlyvalid in serial trainingThis args can be varyfrom envs. Bigger valmeans more off-policy10
learn.batch_
size
int
64
The number of samples of an iteration11
learn.learning
_rate
float
0.001
Gradient step length of an iteration.12
learn.value_
rescale
bool
True
Whether use value_rescale function forpredicted value13
learn.target_
update_freq
int
100
Frequence of target network update.Hard(assign) update14
learn.ignore_
done
bool
False
Whether ignore done for target valuecalculation.Enable it for somefake termination env15
collect.n_sample
int
[8, 128]
The number of training samples of acall of collector.It varies fromdifferent envs16
collect.unroll
_len
int
1
unroll length of an iterationIn RNN, unroll_len>1
- _forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs
eps
argument for exploration, i.e., classic epsilon-greedy exploration strategy.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.eps (
float
): The epsilon value for exploration.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data (prev_state) for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Note
RNN’s hidden states are maintained in the policy, so we don’t need pass them into data but to reset the hidden states with
_reset_collect
method when episode ends. Besides, the previous hidden states are necessary for training, so we need to return them in_process_transition
method.Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for R2D2Policy:
ding.policy.tests.test_r2d2
.
- _forward_learn(data: List[List[Dict[str, Any]]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data (trajectory for R2D2) from the replay buffer and then returns the output result, including various training information such as loss, q value, priority.
- Arguments:
data (
List[List[Dict[int, Any]]]
): The input data used for policy forward, including a batch of training samples. For each dict element, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the time and batch dimension by the utility functionsself._data_preprocess_learn
. For R2D2, each element in list is a trajectory with the length ofunroll_len
, and the element in trajectory list is a dict containing at least the following keys:obs
,action
,prev_state
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andvalue_gamma
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for R2D2Policy:
ding.policy.tests.test_r2d2
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In R2D2, a train sample is processed transitions with unroll_len length. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in
self._forward_learn
method.- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each sample is a fixed-length trajectory, and each element in a sample is the similar format as input transitions, but may contain more data for training, such as nstep reward and value_gamma factor.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For R2D2, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and maintain the hidden state of rnn. Besides, there are some initialization operations about other algorithm-specific arguments such as burnin_step, unroll_len and nstep. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.Tip
Some variables need to initialize independently in different modes, such as gamma and nstep in R2D2. This design is for the convenience of parallel execution of different policy modes.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including some attributes and modules. For R2D2, it mainly contains optimizer, algorithm-specific arguments such as burnin_step, value_rescale and gamma, main and target model. Because of the use of RNN, all the models should be wrappered with
hidden_state
which needs to be initialized with proper size. This method will be called in__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For R2D2, it contains obs, action, prev_state, reward, and done.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network given the observation as input. For R2D2, it contains the action and the prev_state of RNN.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _reset_collect(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different environments/episodes in evaluation indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in R2D2) specified bydata_id
.
- _reset_eval(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different environments/episodes in evaluation indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in R2D2) specified bydata_id
.
- _reset_learn(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different trajectories indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables (i.e. RNN hidden_state in R2D2) specified bydata_id
.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model, target_model and optimizer.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
Note
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about R2D2, its registered name is
drqn
and the import_names isding.model.template.q_learning
.
IMPALA¶
Please refer to ding/policy/impala.py
for more details.
IMPALAPolicy¶
- class ding.policy.IMPALAPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of IMPALA algorithm. Paper link: https://arxiv.org/abs/1802.01561.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
impala
RL policy register name, refer toregistryPOLICY_REGISTRY
this arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for networkthis arg can be diff-erent from modes3
on_policy
bool
False
Whether the RL algorithm is on-policyor off-policypriority
bool
False
Whether use priority(PER)priority sample,update priority5
priority_
IS_weight
bool
False
Whether use Importance Sampling WeightIf True, prioritymust be True6
unroll_len
int
32
trajectory length to calculate v-tracetarget7
learn.update
per_collect
int
4
How many updates(iterations) to trainafter collector’s one collection. Onlyvalid in serial trainingthis args can be varyfrom envs. Bigger valmeans more off-policy
- _forward_collect(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data (action logit and value) for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Tip
If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass related data as extra keyword arguments of this method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to unittest for IMPALAPolicy:
ding.policy.tests.test_impala
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
_forward_eval
in IMPALA often uses deterministic sample to get actions while_forward_collect
usually uses stochastic sample method for balance exploration and exploitation.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to unittest for IMPALAPolicy:
ding.policy.tests.test_impala
.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss and current learning rate.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For IMPALA, each element in list is a dict containing at least the following keys:obs
,action
,logit
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to unittest for IMPALAPolicy:
ding.policy.tests.test_impala
.
- _get_train_sample(data: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training. In IMPALA, a train sample is processed transitions with unroll_len length.
- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For IMPALA, it contains the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in discrete action space), and other algorithm-specific arguments such as unroll_len. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For IMPALA, it contains the eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For IMPALA, it mainly contains optimizer, algorithm-specific arguments such as loss weight and gamma, main (learn) model. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For IMPALA, it contains obs, next_obs, action, reward, done, logit.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For IMPALA, it contains the action and the logit of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
Note
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about IMPALA , its registered name is
vac
and the import_names isding.model.template.vac
.
QMIX¶
Please refer to ding/policy/qmix.py
for more details.
QMIXPolicy¶
- class ding.policy.QMIXPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of QMIX algorithm. QMIX is a multi-agent reinforcement learning algorithm, you can view the paper in the following link https://arxiv.org/abs/1803.11485.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
qmix
RL policy register name, refer toregistryPOLICY_REGISTRY
this arg is optional,a placeholder2
cuda
bool
True
Whether to use cuda for networkthis arg can be diff-erent from modes3
on_policy
bool
False
Whether the RL algorithm is on-policyor off-policypriority
bool
False
Whether use priority(PER)priority sample,update priority5
priority_
IS_weight
bool
False
Whether use Importance SamplingWeight to correct biased update.IS weight6
learn.update_
per_collect
int
20
How many updates(iterations) to trainafter collector’s one collection. Onlyvalid in serial trainingthis args can be varyfrom envs. Bigger valmeans more off-policy7
learn.target_
update_theta
float
0.001
Target network update momentumparameter.between[0,1]8
learn.discount
_factor
float
0.99
Reward’s future discount factor, aka.gammamay be 1 when sparsereward env
- _forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs
eps
argument for exploration, i.e., classic epsilon-greedy exploration strategy.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.eps (
float
): The epsilon value for exploration.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data (prev_state) for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Note
RNN’s hidden states are maintained in the policy, so we don’t need pass them into data but to reset the hidden states with
_reset_collect
method when episode ends. Besides, the previous hidden states are necessary for training, so we need to return them in_process_transition
method.Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for QMIXPolicy:
ding.policy.tests.test_qmix
.
- _forward_learn(data: List[List[Dict[str, Any]]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data (trajectory for QMIX) from the replay buffer and then returns the output result, including various training information such as loss, q value, grad_norm.
- Arguments:
data (
List[List[Dict[int, Any]]]
): The input data used for policy forward, including a batch of training samples. For each dict element, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the time and batch dimension by the utility functionsself._data_preprocess_learn
. For QMIX, each element in list is a trajectory with the length ofunroll_len
, and the element in trajectory list is a dict containing at least the following keys:obs
,action
,prev_state
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andvalue_gamma
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for QMIXPolicy:
ding.policy.tests.test_qmix
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In QMIX, a train sample is processed transitions with unroll_len length. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in
self._forward_learn
method.- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each sample is a fixed-length trajectory, and each element in a sample is the similar format as input transitions.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For QMIX, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and maintain the hidden state of rnn. Besides, there are some initialization operations about other algorithm-specific arguments such as burnin_step, unroll_len and nstep. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including some attributes and modules. For QMIX, it mainly contains optimizer, algorithm-specific arguments such as gamma, main and target model. Because of the use of RNN, all the models should be wrappered with
hidden_state
which needs to be initialized with proper size. This method will be called in__init__
method iflearn
field is inenable_field
.
Tip
For multi-agent algorithm, we often need to use
agent_num
to initialize some necessary variables.Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
. - agent_num (int
): Since this is a multi-agent algorithm, we need to input the agent num.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): The dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For QMIX, it contains obs, next_obs, action, prev_state, reward, done.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, usually includingagent_obs
andglobal_obs
in multi-agent environment like MPE and SMAC.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For QMIX, it contains the action and the prev_state of RNN.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _reset_collect(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different environments/episodes in evaluation indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in QMIX) specified bydata_id
.
- _reset_eval(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for eval mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different environments/episodes in evaluation indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables (i.e., RNN hidden_state in QMIX) specified bydata_id
.
- _reset_learn(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for learn mode when necessary, such as the hidden state of RNN or the memory bank of some special algortihms. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different trajectories indata_id
will have different hidden state in RNN.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables (i.e. RNN hidden_state in QMIX) specified bydata_id
.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model, target_model and optimizer.
- Returns:
state_dict (
Dict[str, Any]
): The dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default model setting for demonstration.
- Returns:
model_info (
Tuple[str, List[str]]
): model name and mode import_names
Note
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For QMIX,
ding.model.qmix.qmix
CQL¶
Please refer to ding/policy/cql.py
for more details.
CQLPolicy¶
- class ding.policy.CQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of CQL algorithm for continuous control. Paper link: https://arxiv.org/abs/2006.04779.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
cql
RL policy register name, referto registryPOLICY_REGISTRY
this arg is optional,a placeholder2
cuda
bool
True
Whether to use cuda for network3
random_
collect_size
int
10000
Number of randomly collectedtraining samples in replaybuffer when training starts.Default to 10000 forSAC, 25000 for DDPG/TD3.4
model.policy_
embedding_size
int
256
Linear layer size for policynetwork.5
model.soft_q_
embedding_size
int
256
Linear layer size for soft qnetwork.6
model.value_
embedding_size
int
256
Linear layer size for valuenetwork.Defalut to None whenmodel.value_networkis False.7
learn.learning
_rate_q
float
3e-4
Learning rate for soft qnetwork.Defalut to 1e-3, whenmodel.value_networkis True.8
learn.learning
_rate_policy
float
3e-4
Learning rate for policynetwork.Defalut to 1e-3, whenmodel.value_networkis True.9
learn.learning
_rate_value
float
3e-4
Learning rate for policynetwork.Defalut to None whenmodel.value_networkis False.10
learn.alpha
float
0.2
Entropy regularizationcoefficient.alpha is initiali-zation for autoalpha, whenauto_alpha is True11
learn.repara_
meterization
bool
True
Determine whether to usereparameterization trick.12
learn.
auto_alpha
bool
False
Determine whether to useauto temperature parameteralpha.Temperature parameterdetermines therelative importanceof the entropy termagainst the reward.13
learn.-
ignore_done
bool
False
Determine whether to ignoredone flag.Use ignore_done onlyin halfcheetah env.14
learn.-
target_theta
float
0.005
Used for soft update of thetarget network.aka. Interpolationfactor in polyak averaging for targetnetworks.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, action, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For CQL, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For SAC, it mainly contains three optimizers, algorithm-specific arguments such as gamma, min_q_weight, with_lagrange and with_q_entropy, main and target model. Especially, the
auto_alpha
mechanism for balancing max entropy target is also initialized here. This method will be called in__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
DiscreteCQLPolicy¶
- class ding.policy.DiscreteCQLPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of discrete CQL algorithm in discrete action space environments. Paper link: https://arxiv.org/abs/2006.04779.
- _forward_learn(data: List[Dict[str, Any]]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, action, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For DiscreteCQL, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys likeweight
andvalue_gamma
for nstep return computation.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For DiscreteCQL, it mainly contains the optimizer, algorithm-specific arguments such as gamma, nstep and min_q_weight, main and target model. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
DecisionTransformer¶
Please refer to ding/policy/dt.py
for more details.
DTPolicy¶
- class ding.policy.DTPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of Decision Transformer algorithm in discrete environments. Paper link: https://arxiv.org/abs/2106.01345.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance, such as interacting with envs. Forward means that the policy gets some input data (current obs/return-to-go and historical information) from the envs and then returns the output data, such as the action to interact with the envs. Arguments: - data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs and reward to calculate running return-to-go. The key of the dict is environment id and the value is the corresponding data of the env.- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
Decision Transformer will do different operations for different types of envs in evaluation.
- _forward_learn(data: List[Tensor]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the offline dataset and then returns the output result, including various training information such as loss, current learning rate.
- Arguments:
data (
List[torch.Tensor]
): The input data used for policy forward, including a series of processed torch.Tensor data, i.e., timesteps, states, actions, returns_to_go, traj_mask.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For DQN, it contains the eval model, some algorithm-specific parameters such as context_len, max_eval_ep_len, etc. This method will be called in
__init__
method ifeval
field is inenable_field
.
Tip
For the evaluation of complete episodes, we need to maintain some historical information for transformer inference. These variables need to be initialized in
_init_eval
and reset in_reset_eval
when necessary.Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For Decision Transformer, it mainly contains the optimizer, algorithm-specific arguments such as rtg_scale and lr scheduler. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _reset_eval(data_id: List[int] | None = None) None [source]¶
- Overview:
Reset some stateful variables for eval mode when necessary, such as the historical info of transformer for decision transformer. If
data_id
is None, it means to reset all the stateful varaibles. Otherwise, it will reset the stateful variables according to thedata_id
. For example, different environments/episodes in evaluation indata_id
will have different history.- Arguments:
data_id (
Optional[List[int]]
): The id of the data, which is used to reset the stateful variables specified bydata_id
.
PDQN¶
Please refer to ding/policy/pdqn.py
for more details.
PDQNPolicy¶
- class ding.policy.PDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of PDQN algorithm, which extends the DQN algorithm on discrete-continuous hybrid action spaces. Paper link: https://arxiv.org/abs/1810.06394.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
pdqn
RL policy register name, refer toregistryPOLICY_REGISTRY
This arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for networkThis arg can be diff-erent from modes3
on_policy
bool
False
Whether the RL algorithm is on-policyor off-policyThis value is alwaysFalse for PDQN4
priority
bool
False
Whether use priority(PER)Priority sample,update priority5
priority_IS
_weight
bool
False
Whether use Importance Sampling Weightto correct biased update. If True,priority must be True.6
discount_
factor
float
0.97, [0.95, 0.999]
Reward’s future discount factor, aka.gammaMay be 1 when sparsereward env7
nstep
int
1, [3, 5]
N-step reward discount sum for targetq_value estimation8
learn.update
per_collect
int
3
How many updates(iterations) to trainafter collector’s one collection. Onlyvalid in serial trainingThis args can be varyfrom envs. Bigger valmeans more off-policy9
learn.batch_
size
_gpu
int
64
The number of samples of an iteration11
learn.learning
_rate
float
0.001
Gradient step length of an iteration.12
learn.target_
update_freq
int
100
Frequence of target network update.Hard(assign) update13
learn.ignore_
done
bool
False
Whether ignore done for target valuecalculation.Enable it for somefake termination env14
collect.n_sample
int
[8, 128]
The number of training samples of acall of collector.It varies fromdifferent envs15
collect.unroll
_len
int
1
unroll length of an iterationIn RNN, unroll_len>116
collect.noise
_sigma
float
0.1
add noise to continuous argsduring collection17
other.eps.type
str
exp
exploration rate decay typeSupport [‘exp’,‘linear’].18
other.eps.
start
float
0.95
start value of exploration rate[0,1]19
other.eps.
end
float
0.05
end value of exploration rate[0,1]20
other.eps.
decay
int
10000
decay length of explorationgreater than 0. setdecay=10000 meansthe exploration ratedecay from startvalue to end valueduring decay length.
- _forward_collect(data: Dict[int, Any], eps: float) Dict[int, Any] [source]¶
- Overview:
Policy forward function of collect mode (collecting training data by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the output data, such as the action to interact with the envs. Besides, this policy also needs
eps
argument for exploration, i.e., classic epsilon-greedy exploration strategy.- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.eps (
float
): The epsilon value for exploration.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action and other necessary data for learn mode defined inself._process_transition
method. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PDQNPolicy:
ding.policy.tests.test_pdqn
.
- _forward_eval(data: Dict[int, Any]) Dict[int, Any] [source]¶
- Overview:
Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward means that the policy gets some necessary data (mainly observation) from the envs and then returns the action to interact with the envs.
- Arguments:
data (
Dict[int, Any]
): The input data used for policy forward, including at least the obs. The key of the dict is environment id and the value is the corresponding data of the env.
- Returns:
output (
Dict[int, Any]
): The output data of policy forward, including at least the action. The key of the dict is the same as the input data, i.e. environment id.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PDQNPolicy:
ding.policy.tests.test_pdqn
.
- _forward_learn(data: Dict[str, Any]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, q value, target_q_value, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For PDQN, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andvalue_gamma
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for PDQNPolicy:
ding.policy.tests.test_pdqn
.
- _get_train_sample(transitions: List[Dict[str, Any]]) List[Dict[str, Any]] [source]¶
- Overview:
For a given trajectory (transitions, a list of transition) data, process it into a list of sample that can be used for training directly. In PDQN, a train sample is a processed transition. This method is usually used in collectors to execute necessary RL data preprocessing before training, which can help learner amortize revelant time consumption. In addition, you can also implement this method as an identity function and do the data processing in
self._forward_learn
method.- Arguments:
transitions (
List[Dict[str, Any]
): The trajectory data (a list of transition), each element is the same format as the return value ofself._process_transition
method.
- Returns:
samples (
List[Dict[str, Any]]
): The processed train samples, each element is the similar format as input transitions, but may contain more data for training, such as nstep reward and target obs.
- _init_collect() None [source]¶
- Overview:
Initialize the collect mode of policy, including related attributes and modules. For PDQN, it contains the collect_model to balance the exploration and exploitation with epsilon-greedy sample mechanism and continuous action mechanism, besides, other algorithm-specific arguments such as unroll_len and nstep are also initialized here. This method will be called in
__init__
method ifcollect
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_collect
method, you’d better name them with prefix_collect_
to avoid conflict with other modes, such asself._collect_attr1
.Tip
Some variables need to initialize independently in different modes, such as gamma and nstep in PDQN. This design is for the convenience of parallel execution of different policy modes.
- _init_eval() None [source]¶
- Overview:
Initialize the eval mode of policy, including related attributes and modules. For PDQN, it contains the eval model to greedily select action with argmax q_value mechanism. This method will be called in
__init__
method ifeval
field is inenable_field
.
Note
If you want to set some spacial member variables in
_init_eval
method, you’d better name them with prefix_eval_
to avoid conflict with other modes, such asself._eval_attr1
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For PDQN, it mainly contains two optimizers, algorithm-specific arguments such as nstep and gamma, main and target model. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _load_state_dict_learn(state_dict: Dict[str, Any]) None [source]¶
- Overview:
Load the state_dict variable into policy learn mode.
- Arguments:
state_dict (
Dict[str, Any]
): the dict of policy learn state saved before.
Tip
If you want to only load some parts of model, you can simply set the
strict
argument in load_state_dict toFalse
, or refer toding.torch_utils.checkpoint_helper
for more complicated operation.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
- _process_transition(obs: Tensor, policy_output: Dict[str, Tensor], timestep: namedtuple) Dict[str, Tensor] [source]¶
- Overview:
Process and pack one timestep transition data into a dict, which can be directly used for training and saved in replay buffer. For PDQN, it contains obs, next_obs, action, reward, done and logit.
- Arguments:
obs (
torch.Tensor
): The env observation of current timestep, such as stacked 2D image in Atari.policy_output (
Dict[str, torch.Tensor]
): The output of the policy network with the observation as input. For PDQN, it contains the hybrid action and the logit (discrete part q_value) of the action.timestep (
namedtuple
): The execution result namedtuple returned by the environment step method, except all the elements have been transformed into tensor data. Usually, it contains the next obs, reward, done, info, etc.
- Returns:
transition (
Dict[str, torch.Tensor]
): The processed transition data of the current timestep.
- _state_dict_learn() Dict[str, Any] [source]¶
- Overview:
Return the state_dict of learn mode, usually including model, target model, discrete part optimizer, and continuous part optimizer.
- Returns:
state_dict (
Dict[str, Any]
): the dict of current policy learn state, for saving and restoring.
- default_model() Tuple[str, List[str]] [source]¶
- Overview:
Return this algorithm default neural network model setting for demonstration.
__init__
method will automatically call this method to get the default model setting and create model.- Returns:
model_info (
Tuple[str, List[str]]
): The registered model name and model’s import_names.
Note
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about PDQN, its registered name is
pdqn
and the import_names isding.model.template.pdqn
.
MDQN¶
Please refer to ding/policy/mdqn.py
for more details.
MDQNPolicy¶
- class ding.policy.MDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]¶
- Overview:
Policy class of Munchausen DQN algorithm, extended by auxiliary objectives. Paper link: https://arxiv.org/abs/2007.14430.
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
type
str
mdqn
RL policy register name, refer toregistryPOLICY_REGISTRY
This arg is optional,a placeholder2
cuda
bool
False
Whether to use cuda for networkThis arg can be diff-erent from modes3
on_policy
bool
False
Whether the RL algorithm is on-policyor off-policy4
priority
bool
False
Whether use priority(PER)Priority sample,update priority5
priority_IS
_weight
bool
False
Whether use Importance Sampling Weightto correct biased update. If True,priority must be True.6
discount_
factor
float
0.97, [0.95, 0.999]
Reward’s future discount factor, aka.gammaMay be 1 when sparsereward env7
nstep
int
1, [3, 5]
N-step reward discount sum for targetq_value estimation8
learn.update
per_collect
_gpu
int
1
How many updates(iterations) to trainafter collector’s one collection. Onlyvalid in serial trainingThis args can be varyfrom envs. Bigger valmeans more off-policy10
learn.batch_
size
int
32
The number of samples of an iteration11
learn.learning
_rate
float
0.001
Gradient step length of an iteration.12
learn.target_
update_freq
int
2000
Frequence of target network update.Hard(assign) update13
learn.ignore_
done
bool
False
Whether ignore done for target valuecalculation.Enable it for somefake termination env14
collect.n_sample
int
4
The number of training samples of acall of collector.It varies fromdifferent envs15
collect.unroll
_len
int
1
unroll length of an iterationIn RNN, unroll_len>116
other.eps.type
str
exp
exploration rate decay typeSupport [‘exp’,‘linear’].17
other.eps.
start
float
0.01
start value of exploration rate[0,1]18
other.eps.
end
float
0.001
end value of exploration rate[0,1]19
other.eps.
decay
int
250000
decay length of explorationgreater than 0. setdecay=250000 meansthe exploration ratedecay from startvalue to end valueduring decay length.20
entropy_tau
float
0.003
the ration of entropy in TD loss21
alpha
float
0.9
the ration of Munchausen term to theTD loss
- _forward_learn(data: Dict[str, Any]) Dict[str, Any] [source]¶
- Overview:
Policy forward function of learn mode (training policy and updating parameters). Forward means that the policy inputs some training batch data from the replay buffer and then returns the output result, including various training information such as loss, action_gap, clip_frac, priority.
- Arguments:
data (
List[Dict[int, Any]]
): The input data used for policy forward, including a batch of training samples. For each element in list, the key of the dict is the name of data items and the value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list combinations. In the_forward_learn
method, data often need to first be stacked in the batch dimension by some utility functions such asdefault_preprocess_learn
. For MDQN, each element in list is a dict containing at least the following keys:obs
,action
,reward
,next_obs
,done
. Sometimes, it also contains other keys such asweight
andvalue_gamma
.
- Returns:
info_dict (
Dict[str, Any]
): The information dict that indicated training result, which will be recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the detailed definition of the dict, refer to the code of_monitor_vars_learn
method.
Note
The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. For the data type that not supported, the main reason is that the corresponding model does not support it. You can implement you own model rather than use the default model. For more information, please raise an issue in GitHub repo and we will continue to follow up.
Note
For more detailed examples, please refer to our unittest for MDQNPolicy:
ding.policy.tests.test_mdqn
.
- _init_learn() None [source]¶
- Overview:
Initialize the learn mode of policy, including related attributes and modules. For MDQN, it contains optimizer, algorithm-specific arguments such as entropy_tau, m_alpha and nstep, main and target model. This method will be called in
__init__
method iflearn
field is inenable_field
.
Note
For the member variables that need to be saved and loaded, please refer to the
_state_dict_learn
and_load_state_dict_learn
methods.Note
For the member variables that need to be monitored, please refer to the
_monitor_vars_learn
method.Note
If you want to set some spacial member variables in
_init_learn
method, you’d better name them with prefix_learn_
to avoid conflict with other modes, such asself._learn_attr1
.
- _monitor_vars_learn() List[str] [source]¶
- Overview:
Return the necessary keys for logging the return dict of
self._forward_learn
. The logger module, such as text logger, tensorboard logger, will use these keys to save the corresponding data.- Returns:
necessary_keys (
List[str]
): The list of the necessary keys to be logged.
Policy Factory¶
Please refer to ding/policy/policy_factory.py
for more details.
PolicyFactory¶
- class ding.policy.PolicyFactory[source]¶
- Overview:
Policy factory class, used to generate different policies for general purpose. Such as random action policy, which is used for initial sample collecting for better exploration when
random_collect_size
> 0.- Interfaces:
get_random_policy
- static get_random_policy(policy: Policy.collect_mode, action_space: gym.spaces.Space = None, forward_fn: Callable = None) Policy.collect_mode [source]¶
- Overview:
According to the given action space, define the forward function of the random policy, then pack it with other interfaces of the given policy, and return the final collect mode interfaces of policy.
- Arguments:
policy (
Policy.collect_mode
): The collect mode interfaces of the policy.action_space (
gym.spaces.Space
): The action space of the environment, gym-style.forward_fn (
Callable
): It action space is too complex, you can define your own forward function and pass it to this function, note you should setaction_space
toNone
in this case.
- Returns:
random_policy (
Policy.collect_mode
): The collect mode intefaces of the random policy.
get_random_policy¶
- ding.policy.get_random_policy(cfg: EasyDict, policy: Policy.collect_mode, env: BaseEnvManager) Policy.collect_mode [source]¶
- Overview:
The entry function to get the corresponding random policy. If a policy needs special data items in a transition, then return itself, otherwise, we will use
PolicyFactory
to return a general random policy.- Arguments:
cfg (
EasyDict
): The EasyDict-type dict configuration.policy (
Policy.collect_mode
): The collect mode interfaces of the policy.env (
BaseEnvManager
): The env manager instance, which is used to get the action space for random action generation.
- Returns:
random_policy (
Policy.collect_mode
): The collect mode intefaces of the random policy.
Common Utilities¶
Please refer to ding/policy/common_utils.py
for more details.
default_preprocess_learn¶
- ding.policy.default_preprocess_learn(data: List[Any], use_priority_IS_weight: bool = False, use_priority: bool = False, use_nstep: bool = False, ignore_done: bool = False) Dict[str, Tensor] [source]¶
- Overview:
Default data pre-processing in policy’s
_forward_learn
method, including stacking batch data, preprocess ignore done, nstep and priority IS weight.- Arguments:
data (
List[Any]
): The list of a training batch samples, each sample is a dict of PyTorch Tensor.use_priority_IS_weight (
bool
): Whether to use priority IS weight correction, if True, this function will set the weight of each sample to the priority IS weight.use_priority (
bool
): Whether to use priority, if True, this function will set the priority IS weight.use_nstep (
bool
): Whether to use nstep TD error, if True, this function will reshape the reward.ignore_done (
bool
): Whether to ignore done, if True, this function will set the done to 0.
- Returns:
data (
Dict[str, torch.Tensor]
): The preprocessed dict data whose values can be directly used for the following model forward and loss computation.
single_env_forward_wrapper¶
- ding.policy.single_env_forward_wrapper(forward_fn: Callable) Callable [source]¶
- Overview:
Wrap policy to support gym-style interaction between policy and single environment.
- Arguments:
forward_fn (
Callable
): The original forward function of policy.
- Returns:
wrapped_forward_fn (
Callable
): The wrapped forward function of policy.
- Examples:
>>> env = gym.make('CartPole-v0') >>> policy = DQNPolicy(...) >>> forward_fn = single_env_forward_wrapper(policy.eval_mode.forward) >>> obs = env.reset() >>> action = forward_fn(obs) >>> next_obs, rew, done, info = env.step(action)
single_env_forward_wrapper_ttorch¶
- ding.policy.single_env_forward_wrapper_ttorch(forward_fn: Callable, cuda: bool = True) Callable [source]¶
- Overview:
Wrap policy to support gym-style interaction between policy and single environment for treetensor (ttorch) data.
- Arguments:
forward_fn (
Callable
): The original forward function of policy.cuda (
bool
): Whether to use cuda in policy, if True, this function will move the input data to cuda.
- Returns:
wrapped_forward_fn (
Callable
): The wrapped forward function of policy.
- Examples:
>>> env = gym.make('CartPole-v0') >>> policy = PPOFPolicy(...) >>> forward_fn = single_env_forward_wrapper_ttorch(policy.eval) >>> obs = env.reset() >>> action = forward_fn(obs) >>> next_obs, rew, done, info = env.step(action)