ding.rl_utils¶
a2c¶
Please refer to ding/rl_utils/a2c
for more details.
a2c_error¶
- ding.rl_utils.a2c_error(data: namedtuple) namedtuple [source]¶
- Overview:
Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for discrete action space
- Arguments:
data (
namedtuple
): a2c input data with fieids shown ina2c_data
- Returns:
a2c_loss (
namedtuple
): the a2c loss item, all of them are the differentiable 0-dim tensor
- Shapes:
logit (
torch.FloatTensor
): \((B, N)\), where B is batch size and N is action dimaction (
torch.LongTensor
): \((B, )\)value (
torch.FloatTensor
): \((B, )\)adv (
torch.FloatTensor
): \((B, )\)return (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorvalue_loss (
torch.FloatTensor
): \(()\)entropy_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> data = a2c_data( >>> logit=torch.randn(2, 3), >>> action=torch.randint(0, 3, (2, )), >>> value=torch.randn(2, ), >>> adv=torch.randn(2, ), >>> return_=torch.randn(2, ), >>> weight=torch.ones(2, ), >>> ) >>> loss = a2c_error(data)
a2c_error_continuous¶
- ding.rl_utils.a2c_error_continuous(data: namedtuple) namedtuple [source]¶
- Overview:
Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for continuous action space
- Arguments:
data (
namedtuple
): a2c input data with fieids shown ina2c_data
- Returns:
a2c_loss (
namedtuple
): the a2c loss item, all of them are the differentiable 0-dim tensor
- Shapes:
logit (
torch.FloatTensor
): \((B, N)\), where B is batch size and N is action dimaction (
torch.LongTensor
): \((B, N)\)value (
torch.FloatTensor
): \((B, )\)adv (
torch.FloatTensor
): \((B, )\)return (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorvalue_loss (
torch.FloatTensor
): \(()\)entropy_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> data = a2c_data( >>> logit={'mu': torch.randn(2, 3), 'sigma': torch.sqrt(torch.randn(2, 3)**2)}, >>> action=torch.randn(2, 3), >>> value=torch.randn(2, ), >>> adv=torch.randn(2, ), >>> return_=torch.randn(2, ), >>> weight=torch.ones(2, ), >>> ) >>> loss = a2c_error_continuous(data)
acer¶
Please refer to ding/rl_utils/acer
for more details.
acer_policy_error¶
- ding.rl_utils.acer_policy_error(q_values: Tensor, q_retraces: Tensor, v_pred: Tensor, target_logit: Tensor, actions: Tensor, ratio: Tensor, c_clip_ratio: float = 10.0) Tuple[Tensor, Tensor] [source]¶
- Overview:
Get ACER policy loss.
- Arguments:
q_values (
torch.Tensor
): Q valuesq_retraces (
torch.Tensor
): Q values (be calculated by retrace method)v_pred (
torch.Tensor
): V valuestarget_pi (
torch.Tensor
): The new policy’s probabilityactions (
torch.Tensor
): The actions in replay bufferratio (
torch.Tensor
): ratio of new polcy with behavior policyc_clip_ratio (
float
): clip value for ratio
- Returns:
actor_loss (
torch.Tensor
): policy loss from q_retracebc_loss (
torch.Tensor
): correct policy loss
- Shapes:
q_values (
torch.FloatTensor
): \((T, B, N)\), where B is batch size and N is action dimq_retraces (
torch.FloatTensor
): \((T, B, 1)\)v_pred (
torch.FloatTensor
): \((T, B, 1)\)target_pi (
torch.FloatTensor
): \((T, B, N)\)actions (
torch.LongTensor
): \((T, B)\)ratio (
torch.FloatTensor
): \((T, B, N)\)actor_loss (
torch.FloatTensor
): \((T, B, 1)\)bc_loss (
torch.FloatTensor
): \((T, B, 1)\)
- Examples:
>>> q_values=torch.randn(2, 3, 4), >>> q_retraces=torch.randn(2, 3, 1), >>> v_pred=torch.randn(2, 3, 1), >>> target_pi=torch.randn(2, 3, 4), >>> actions=torch.randint(0, 4, (2, 3)), >>> ratio=torch.randn(2, 3, 4), >>> loss = acer_policy_error(q_values, q_retraces, v_pred, target_pi, actions, ratio)
acer_value_error¶
- ding.rl_utils.acer_value_error(q_values, q_retraces, actions)[source]¶
- Overview:
Get ACER critic loss.
- Arguments:
q_values (
torch.Tensor
): Q valuesq_retraces (
torch.Tensor
): Q values (be calculated by retrace method)actions (
torch.Tensor
): The actions in replay bufferratio (
torch.Tensor
): ratio of new polcy with behavior policy
- Returns:
critic_loss (
torch.Tensor
): critic loss
- Shapes:
q_values (
torch.FloatTensor
): \((T, B, N)\), where B is batch size and N is action dimq_retraces (
torch.FloatTensor
): \((T, B, 1)\)actions (
torch.LongTensor
): \((T, B)\)critic_loss (
torch.FloatTensor
): \((T, B, 1)\)
- Examples:
>>> q_values=torch.randn(2, 3, 4) >>> q_retraces=torch.randn(2, 3, 1) >>> actions=torch.randint(0, 4, (2, 3)) >>> loss = acer_value_error(q_values, q_retraces, actions)
acer_trust_region_update¶
- ding.rl_utils.acer_trust_region_update(actor_gradients: List[Tensor], target_logit: Tensor, avg_logit: Tensor, trust_region_value: float) List[Tensor] [source]¶
- Overview:
calcuate gradient with trust region constrain
- Arguments:
actor_gradients (
list(torch.Tensor)
): gradients value’s for different parttarget_pi (
torch.Tensor
): The new policy’s probabilityavg_pi (
torch.Tensor
): The average policy’s probabilitytrust_region_value (
float
): the range of trust region
- Returns:
update_gradients (
list(torch.Tensor)
): gradients with trust region constraint
- Shapes:
target_pi (
torch.FloatTensor
): \((T, B, N)\)avg_pi (
torch.FloatTensor
): \((T, B, N)\)update_gradients (
list(torch.FloatTensor)
): \((T, B, N)\)
- Examples:
>>> actor_gradients=[torch.randn(2, 3, 4)] >>> target_pi=torch.randn(2, 3, 4) >>> avg_pi=torch.randn(2, 3, 4) >>> loss = acer_trust_region_update(actor_gradients, target_pi, avg_pi, 0.1)
adder¶
Please refer to ding/rl_utils/adder
for more details.
Adder¶
- class ding.rl_utils.adder.Adder[source]¶
- Overview:
Adder is a component that handles different transformations and calculations for transitions in Collector Module(data generation and processing), such as GAE, n-step return, transition sampling etc.
- Interface:
__init__, get_gae, get_gae_with_default_last_value, get_nstep_return_data, get_train_sample
- classmethod _get_null_transition(template: dict, null_transition: dict | None = None) dict [source]¶
- Overview:
Get null transition for padding. If
cls._null_transition
is None, return inputtemplate
instead.- Arguments:
template (
dict
): The template for null transition.null_transition (
Optional[dict]
): Dict type null transition, used innull_padding
- Returns:
null_transition (
dict
): The deepcopied null transition.
- classmethod get_gae(data: List[Dict[str, Any]], last_value: Tensor, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]] [source]¶
- Overview:
Get GAE advantage for stacked transitions(T timestep, 1 batch). Call
gae
for calculation.- Arguments:
data (
list
): Transitions list, each element is a transition dict with at least['value', 'reward']
.last_value (
torch.Tensor
): The last value(i.e.: the T+1 timestep)gamma (
float
): The future discount factor, should be in [0, 1], defaults to 0.99.gae_lambda (
float
): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.cuda (
bool
): Whether use cuda in GAE computation
- Returns:
data (
list
): transitions list like input one, but each element owns extra advantage key ‘adv’
- Examples:
>>> B, T = 2, 3 # batch_size, timestep >>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)] >>> last_value = torch.randn(B) >>> gamma = 0.99 >>> gae_lambda = 0.95 >>> cuda = False >>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)
- classmethod get_gae_with_default_last_value(data: deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]] [source]¶
- Overview:
Like
get_gae
above to get GAE advantage for stacked transitions. However, this function is designed in caselast_value
is not passed. If transition is not done yet, it wouold assign last value indata
aslast_value
, discard the last element indata
(i.e. len(data) would decrease by 1), and then callget_gae
. Otherwise it would makelast_value
equal to 0.- Arguments:
data (
deque
): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]done (
bool
): Whether the transition reaches the end of an episode(i.e. whether the env is done)gamma (
float
): The future discount factor, should be in [0, 1], defaults to 0.99.gae_lambda (
float
): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.cuda (
bool
): Whether use cuda in GAE computation
- Returns:
data (
List[Dict[str, Any]]
): transitions list like input one, but each element owns extra advantage key ‘adv’
- Examples:
>>> B, T = 2, 3 # batch_size, timestep >>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)] >>> done = False >>> gamma = 0.99 >>> gae_lambda = 0.95 >>> cuda = False >>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)
- classmethod get_nstep_return_data(data: deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) deque [source]¶
- Overview:
Process raw traj data by updating keys
['next_obs', 'reward', 'done']
in data’s dict element.- Arguments:
data (
deque
): Transitions list, each element is a transition dictnstep (
int
): Number of steps. If equals to 1, returndata
directly; Otherwise update with nstep value.
- Returns:
data (
deque
): Transitions list like input one, but each element updated with nstep value.
- Examples:
>>> data = [dict( >>> obs=torch.randn(B), >>> reward=torch.randn(1), >>> next_obs=torch.randn(B), >>> done=False) for _ in range(T)] >>> nstep = 2 >>> data = Adder.get_nstep_return_data(data, nstep)
- classmethod get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: dict | None = None) List[Dict[str, Any]] [source]¶
- Overview:
Process raw traj data by updating keys
['next_obs', 'reward', 'done']
in data’s dict element. Ifunroll_len
equals to 1, which means no process is needed, can directly returndata
. Otherwise,data
will be splitted according tounroll_len
, process residual part according tolast_fn_type
and calllists_to_dicts
to form sampled training data.- Arguments:
data (
List[Dict[str, Any]]
): Transitions list, each element is a transition dictunroll_len (
int
): Learn training unroll lengthlast_fn_type (
str
): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]null_transition (
Optional[dict]
): Dict type null transition, used innull_padding
- Returns:
data (
List[Dict[str, Any]]
): Transitions list processed after unrolling
get_gae¶
- ding.rl_utils.adder.get_gae(data: List[Dict[str, Any]], last_value: Tensor, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]] ¶
- Overview:
Get GAE advantage for stacked transitions(T timestep, 1 batch). Call
gae
for calculation.- Arguments:
data (
list
): Transitions list, each element is a transition dict with at least['value', 'reward']
.last_value (
torch.Tensor
): The last value(i.e.: the T+1 timestep)gamma (
float
): The future discount factor, should be in [0, 1], defaults to 0.99.gae_lambda (
float
): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.cuda (
bool
): Whether use cuda in GAE computation
- Returns:
data (
list
): transitions list like input one, but each element owns extra advantage key ‘adv’
- Examples:
>>> B, T = 2, 3 # batch_size, timestep >>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)] >>> last_value = torch.randn(B) >>> gamma = 0.99 >>> gae_lambda = 0.95 >>> cuda = False >>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)
get_gae_with_default_last_value¶
- ding.rl_utils.adder.get_gae_with_default_last_value(data: deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]] ¶
- Overview:
Like
get_gae
above to get GAE advantage for stacked transitions. However, this function is designed in caselast_value
is not passed. If transition is not done yet, it wouold assign last value indata
aslast_value
, discard the last element indata
(i.e. len(data) would decrease by 1), and then callget_gae
. Otherwise it would makelast_value
equal to 0.- Arguments:
data (
deque
): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]done (
bool
): Whether the transition reaches the end of an episode(i.e. whether the env is done)gamma (
float
): The future discount factor, should be in [0, 1], defaults to 0.99.gae_lambda (
float
): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.cuda (
bool
): Whether use cuda in GAE computation
- Returns:
data (
List[Dict[str, Any]]
): transitions list like input one, but each element owns extra advantage key ‘adv’
- Examples:
>>> B, T = 2, 3 # batch_size, timestep >>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)] >>> done = False >>> gamma = 0.99 >>> gae_lambda = 0.95 >>> cuda = False >>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)
get_nstep_return_data¶
- ding.rl_utils.adder.get_nstep_return_data(data: deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) deque ¶
- Overview:
Process raw traj data by updating keys
['next_obs', 'reward', 'done']
in data’s dict element.- Arguments:
data (
deque
): Transitions list, each element is a transition dictnstep (
int
): Number of steps. If equals to 1, returndata
directly; Otherwise update with nstep value.
- Returns:
data (
deque
): Transitions list like input one, but each element updated with nstep value.
- Examples:
>>> data = [dict( >>> obs=torch.randn(B), >>> reward=torch.randn(1), >>> next_obs=torch.randn(B), >>> done=False) for _ in range(T)] >>> nstep = 2 >>> data = Adder.get_nstep_return_data(data, nstep)
get_train_sample¶
- ding.rl_utils.adder.get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: dict | None = None) List[Dict[str, Any]] ¶
- Overview:
Process raw traj data by updating keys
['next_obs', 'reward', 'done']
in data’s dict element. Ifunroll_len
equals to 1, which means no process is needed, can directly returndata
. Otherwise,data
will be splitted according tounroll_len
, process residual part according tolast_fn_type
and calllists_to_dicts
to form sampled training data.- Arguments:
data (
List[Dict[str, Any]]
): Transitions list, each element is a transition dictunroll_len (
int
): Learn training unroll lengthlast_fn_type (
str
): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]null_transition (
Optional[dict]
): Dict type null transition, used innull_padding
- Returns:
data (
List[Dict[str, Any]]
): Transitions list processed after unrolling
beta_function¶
Please refer to ding/rl_utils/beta_function
for more details.
cpw¶
- ding.rl_utils.beta_function.cpw(x: Tensor | float, eta: float = 0.71) Tensor | float [source]¶
- Overview:
The implementation of CPW function.
- Arguments:
x (
Union[torch.Tensor, float]
): The input value.eta (
float
): The hyperparameter of CPW function.
- Returns:
output (
Union[torch.Tensor, float]
): The output value.
CVaR¶
- ding.rl_utils.beta_function.CVaR(x: Tensor | float, eta: float = 0.71) Tensor | float [source]¶
- Overview:
The implementation of CVaR function, which is a risk-averse function.
- Arguments:
x (
Union[torch.Tensor, float]
): The input value.eta (
float
): The hyperparameter of CVaR function.
- Returns:
output (
Union[torch.Tensor, float]
): The output value.
beta_function_map¶
- rl_utils.beta_function_map = {'CPW': <function cpw>, 'CVaR': <function CVaR>, 'Pow': <function Pow>, 'uniform': <function <lambda>>}¶
coma¶
Please refer to ding/rl_utils/coma
for more details.
coma_error¶
- ding.rl_utils.coma_error(data: namedtuple, gamma: float, lambda_: float) namedtuple [source]¶
- Overview:
Implementation of COMA
- Arguments:
data (
namedtuple
): coma input data with fieids shown incoma_data
- Returns:
coma_loss (
namedtuple
): the coma loss item, all of them are the differentiable 0-dim tensor
- Shapes:
logit (
torch.FloatTensor
): \((T, B, A, N)\), where B is batch size A is the agent num, and N is action dimaction (
torch.LongTensor
): \((T, B, A)\)q_value (
torch.FloatTensor
): \((T, B, A, N)\)target_q_value (
torch.FloatTensor
): \((T, B, A, N)\)reward (
torch.FloatTensor
): \((T, B)\)weight (
torch.FloatTensor
orNone
): \((T ,B, A)\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorvalue_loss (
torch.FloatTensor
): \(()\)entropy_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> action_dim = 4 >>> agent_num = 3 >>> data = coma_data( >>> logit=torch.randn(2, 3, agent_num, action_dim), >>> action=torch.randint(0, action_dim, (2, 3, agent_num)), >>> q_value=torch.randn(2, 3, agent_num, action_dim), >>> target_q_value=torch.randn(2, 3, agent_num, action_dim), >>> reward=torch.randn(2, 3), >>> weight=torch.ones(2, 3, agent_num), >>> ) >>> loss = coma_error(data, 0.99, 0.99)
exploration¶
Please refer to ding/rl_utils/exploration
for more details.
get_epsilon_greedy_fn¶
- ding.rl_utils.exploration.get_epsilon_greedy_fn(start: float, end: float, decay: int, type_: str = 'exp') Callable [source]¶
- Overview:
Generate an epsilon_greedy function with decay, which inputs current timestep and outputs current epsilon.
- Arguments:
start (
float
): Epsilon start value. Forlinear
, it should be 1.0.end (
float
): Epsilon end value.decay (
int
): Controls the speed that epsilon decreases fromstart
toend
. We recommend epsilon decays according to env step rather than iteration.type (
str
): How epsilon decays, now supports['linear', 'exp'(exponential)]
.
- Returns:
eps_fn (
function
): The epsilon greedy function with decay.
BaseNoise¶
- class ding.rl_utils.exploration.BaseNoise[source]¶
- Overview:
Base class for action noise
- Interface:
__init__, __call__
- Examples:
>>> noise_generator = OUNoise() # init one type of noise >>> noise = noise_generator(action.shape, action.device) # generate noise
- abstract __call__(shape: tuple, device: str) Tensor [source]¶
- Overview:
Generate noise according to action tensor’s shape, device.
- Arguments:
shape (
tuple
): size of the action tensor, output noise’s size should be the same.device (
str
): device of the action tensor, output noise’s device should be the same as it.
- Returns:
noise (
torch.Tensor
): generated action noise, have the same shape and device with the input action tensor.
GaussianNoise¶
- class ding.rl_utils.exploration.GaussianNoise(mu: float = 0.0, sigma: float = 1.0)[source]¶
- Overview:
Derived class for generating gaussian noise, which satisfies \(X \sim N(\mu, \sigma^2)\)
- Interface:
__init__, __call__
- __call__(shape: tuple, device: str) Tensor [source]¶
- Overview:
Generate gaussian noise according to action tensor’s shape, device
- Arguments:
shape (
tuple
): size of the action tensor, output noise’s size should be the samedevice (
str
): device of the action tensor, output noise’s device should be the same as it
- Returns:
noise (
torch.Tensor
): generated action noise, have the same shape and device with the input action tensor
OUNoise¶
- class ding.rl_utils.exploration.OUNoise(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: float | Tensor | None = 0.0)[source]¶
- Overview:
Derived class for generating Ornstein-Uhlenbeck process noise. Satisfies \(dx_t=\theta(\mu-x_t)dt + \sigma dW_t\), where \(W_t\) denotes Weiner Process, acting as a random perturbation term.
- Interface:
__init__, reset, __call__
- __call__(shape: tuple, device: str, mu: float | None = None) Tensor [source]¶
- Overview:
Generate gaussian noise according to action tensor’s shape, device.
- Arguments:
shape (
tuple
): The size of the action tensor, output noise’s size should be the same.device (
str
): The device of the action tensor, output noise’s device should be the same as it.mu (
float
): The new mean value \(\mu\), you can set it to None if don’t need it.
- Returns:
noise (
torch.Tensor
): generated action noise, have the same shape and device with the input action tensor.
- __init__(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: float | Tensor | None = 0.0) None [source]¶
- Overview:
Initialize
_alpha
\(= heta * dt\`, ``beta`\) \(= \sigma * \sqrt{dt}\), in Ornstein-Uhlenbeck process.- Arguments:
mu (
float
): \(\mu\) , mean value.sigma (
float
): \(\sigma\) , standard deviation of the perturbation noise.theta (
float
): How strongly the noise reacts to perturbations, greater value means stronger reaction.dt (
float
): The derivative of time t.x0 (
Union[float, torch.Tensor]
): The initial state of the noise, should be a scalar or tensor with the same shape as the action tensor.
create_noise_generator¶
- ding.rl_utils.exploration.create_noise_generator(noise_type: str, noise_kwargs: dict) BaseNoise [source]¶
- Overview:
Given the key (noise_type), create a new noise generator instance if in noise_mapping’s values, or raise an KeyError. In other words, a derived noise generator must first register, then call
create_noise generator
to get the instance object.- Arguments:
noise_type (
str
): the type of noise generator to be created.
- Returns:
noise (
BaseNoise
): the created new noise generator, should be an instance of one of noise_mapping’s values.
gae¶
Please refer to ding/rl_utils/gae
for more details.
gae_data¶
- class ding.rl_utils.gae.gae_data(value, next_value, reward, done, traj_flag)¶
shape_fn_gae¶
gae¶
- ding.rl_utils.gae.gae(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.97) FloatTensor [source]¶
- Overview:
Implementation of Generalized Advantage Estimator (arXiv:1506.02438)
- Arguments:
data (
namedtuple
): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories data.gamma (
float
): the future discount factor, should be in [0, 1], defaults to 0.99.lambda (
float
): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
- Returns:
adv (
torch.FloatTensor
): the calculated advantage
- Shapes:
value (
torch.FloatTensor
): \((T, B)\), where T is trajectory length and B is batch sizenext_value (
torch.FloatTensor
): \((T, B)\)reward (
torch.FloatTensor
): \((T, B)\)adv (
torch.FloatTensor
): \((T, B)\)
- Examples:
>>> value = torch.randn(2, 3) >>> next_value = torch.randn(2, 3) >>> reward = torch.randn(2, 3) >>> data = gae_data(value, next_value, reward, None, None) >>> adv = gae(data)
isw¶
Please refer to ding/rl_utils/isw
for more details.
compute_importance_weights¶
- ding.rl_utils.isw.compute_importance_weights(target_output: Tensor | dict, behaviour_output: Tensor | dict, action: Tensor, action_space_type: str = 'discrete', requires_grad: bool = False)[source]¶
- Overview:
Computing importance sampling weight with given output and action
- Arguments:
target_output (
Union[torch.Tensor,dict]
): the output taking the action by the current policy network, usually this output is network output logit if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.behaviour_output (
Union[torch.Tensor,dict]
): the output taking the action by the behaviour policy network, usually this output is network output logit, if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.action (
torch.Tensor
): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_actionaction_space_type (
str
): action space types in [‘discrete’, ‘continuous’]requires_grad (
bool
): whether requires grad computation
- Returns:
rhos (
torch.Tensor
): Importance sampling weight
- Shapes:
target_output (
Union[torch.FloatTensor,dict]
): \((T, B, N)\), where T is timestep, B is batch size and N is action dimbehaviour_output (
Union[torch.FloatTensor,dict]
): \((T, B, N)\)action (
torch.LongTensor
): \((T, B)\)rhos (
torch.FloatTensor
): \((T, B)\)
- Examples:
>>> target_output = torch.randn(2, 3, 4) >>> behaviour_output = torch.randn(2, 3, 4) >>> action = torch.randint(0, 4, (2, 3)) >>> rhos = compute_importance_weights(target_output, behaviour_output, action)
ppg¶
Please refer to ding/rl_utils/ppg
for more details.
ppg_data¶
- class ding.rl_utils.ppg.ppg_data(logit_new, logit_old, action, value_new, value_old, return_, weight)¶
ppg_joint_loss¶
- class ding.rl_utils.ppg.ppg_joint_loss(auxiliary_loss, behavioral_cloning_loss)¶
ppg_joint_error¶
- ding.rl_utils.ppg.ppg_joint_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) Tuple[namedtuple, namedtuple] [source]¶
- Overview:
Get PPG joint loss
- Arguments:
data (
namedtuple
): ppg input data with fieids shown inppg_data
clip_ratio (
float
): clip value for ratiouse_value_clip (
bool
): whether use value clip
- Returns:
ppg_joint_loss (
namedtuple
): the ppg loss item, all of them are the differentiable 0-dim tensor
- Shapes:
logit_new (
torch.FloatTensor
): \((B, N)\), where B is batch size and N is action dimlogit_old (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B,)\)value_new (
torch.FloatTensor
): \((B, 1)\)value_old (
torch.FloatTensor
): \((B, 1)\)return (
torch.FloatTensor
): \((B, 1)\)weight (
torch.FloatTensor
): \((B,)\)auxiliary_loss (
torch.FloatTensor
): \(()\), 0-dim tensorbehavioral_cloning_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> action_dim = 4 >>> data = ppg_data( >>> logit_new=torch.randn(3, action_dim), >>> logit_old=torch.randn(3, action_dim), >>> action=torch.randint(0, action_dim, (3,)), >>> value_new=torch.randn(3, 1), >>> value_old=torch.randn(3, 1), >>> return_=torch.randn(3, 1), >>> weight=torch.ones(3), >>> ) >>> loss = ppg_joint_error(data, 0.99, 0.99)
ppo¶
Please refer to ding/rl_utils/ppo
for more details.
ppo_data¶
- class ding.rl_utils.ppo.ppo_data(logit_new, logit_old, action, value_new, value_old, adv, return_, weight)¶
ppo_policy_data¶
- class ding.rl_utils.ppo.ppo_policy_data(logit_new, logit_old, action, adv, weight)¶
ppo_value_data¶
- class ding.rl_utils.ppo.ppo_value_data(value_new, value_old, return_, weight)
ppo_loss¶
- class ding.rl_utils.ppo.ppo_loss(policy_loss, value_loss, entropy_loss)¶
ppo_policy_loss¶
- class ding.rl_utils.ppo.ppo_policy_loss(policy_loss, entropy_loss)¶
ppo_info¶
- class ding.rl_utils.ppo.ppo_info(approx_kl, clipfrac)¶
shape_fn_ppo¶
ppo_error¶
- ding.rl_utils.ppo.ppo_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: float | None = None) Tuple[namedtuple, namedtuple] [source]¶
- Overview:
Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip
- Arguments:
data (
namedtuple
): the ppo input data with fieids shown inppo_data
clip_ratio (
float
): the ppo clip ratio for the constraint of policy update, defaults to 0.2use_value_clip (
bool
): whether to use clip in value loss with the same ratio as policydual_clip (
float
): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
- Returns:
ppo_loss (
namedtuple
): the ppo loss item, all of them are the differentiable 0-dim tensorppo_info (
namedtuple
): the ppo optim information for monitoring, all of them are Python scalar
- Shapes:
logit_new (
torch.FloatTensor
): \((B, N)\), where B is batch size and N is action dimlogit_old (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)value_new (
torch.FloatTensor
): \((B, )\)value_old (
torch.FloatTensor
): \((B, )\)adv (
torch.FloatTensor
): \((B, )\)return (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorvalue_loss (
torch.FloatTensor
): \(()\)entropy_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> action_dim = 4 >>> data = ppo_data( >>> logit_new=torch.randn(3, action_dim), >>> logit_old=torch.randn(3, action_dim), >>> action=torch.randint(0, action_dim, (3,)), >>> value_new=torch.randn(3), >>> value_old=torch.randn(3), >>> adv=torch.randn(3), >>> return_=torch.randn(3), >>> weight=torch.ones(3), >>> ) >>> loss, info = ppo_error(data)
Note
adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.
ppo_policy_error¶
- ding.rl_utils.ppo.ppo_policy_error(data: namedtuple, clip_ratio: float = 0.2, dual_clip: float | None = None) Tuple[namedtuple, namedtuple] [source]
- Overview:
Get PPO policy loss
- Arguments:
data (
namedtuple
): ppo input data with fieids shown inppo_policy_data
clip_ratio (
float
): clip value for ratiodual_clip (
float
): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
- Returns:
ppo_policy_loss (
namedtuple
): the ppo policy loss item, all of them are the differentiable 0-dim tensorppo_info (
namedtuple
): the ppo optim information for monitoring, all of them are Python scalar
- Shapes:
logit_new (
torch.FloatTensor
): \((B, N)\), where B is batch size and N is action dimlogit_old (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)adv (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorentropy_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> action_dim = 4 >>> data = ppo_policy_data( >>> logit_new=torch.randn(3, action_dim), >>> logit_old=torch.randn(3, action_dim), >>> action=torch.randint(0, action_dim, (3,)), >>> adv=torch.randn(3), >>> weight=torch.ones(3), >>> ) >>> loss, info = ppo_policy_error(data)
ppo_value_error¶
- ding.rl_utils.ppo.ppo_value_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) Tensor [source]
- Overview:
Get PPO value loss
- Arguments:
data (
namedtuple
): ppo input data with fieids shown inppo_value_data
clip_ratio (
float
): clip value for ratiouse_value_clip (
bool
): whether use value clip
- Returns:
value_loss (
torch.FloatTensor
): the ppo value loss item, all of them are the differentiable 0-dim tensor
- Shapes:
value_new (
torch.FloatTensor
): \((B, )\), where B is batch sizevalue_old (
torch.FloatTensor
): \((B, )\)return (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)value_loss (
torch.FloatTensor
): \(()\), 0-dim tensor
- Examples:
>>> action_dim = 4 >>> data = ppo_value_data( >>> value_new=torch.randn(3), >>> value_old=torch.randn(3), >>> return_=torch.randn(3), >>> weight=torch.ones(3), >>> ) >>> loss, info = ppo_value_error(data)
ppo_error_continuous¶
- ding.rl_utils.ppo.ppo_error_continuous(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: float | None = None) Tuple[namedtuple, namedtuple] [source]¶
- Overview:
Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip
- Arguments:
data (
namedtuple
): the ppo input data with fieids shown inppo_data
clip_ratio (
float
): the ppo clip ratio for the constraint of policy update, defaults to 0.2use_value_clip (
bool
): whether to use clip in value loss with the same ratio as policydual_clip (
float
): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
- Returns:
ppo_loss (
namedtuple
): the ppo loss item, all of them are the differentiable 0-dim tensorppo_info (
namedtuple
): the ppo optim information for monitoring, all of them are Python scalar
- Shapes:
mu_sigma_new (
tuple
): \(((B, N), (B, N))\), where B is batch size and N is action dimmu_sigma_old (
tuple
): \(((B, N), (B, N))\), where B is batch size and N is action dimaction (
torch.LongTensor
): \((B, )\)value_new (
torch.FloatTensor
): \((B, )\)value_old (
torch.FloatTensor
): \((B, )\)adv (
torch.FloatTensor
): \((B, )\)return (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorvalue_loss (
torch.FloatTensor
): \(()\)entropy_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> action_dim = 4 >>> data = ppo_data_continuous( >>> mu_sigma_new= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2), >>> mu_sigma_old= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2), >>> action=torch.randn(3, action_dim), >>> value_new=torch.randn(3), >>> value_old=torch.randn(3), >>> adv=torch.randn(3), >>> return_=torch.randn(3), >>> weight=torch.ones(3), >>> ) >>> loss, info = ppo_error(data)
Note
adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.
ppo_policy_error_continuous¶
- ding.rl_utils.ppo.ppo_policy_error_continuous(data: namedtuple, clip_ratio: float = 0.2, dual_clip: float | None = None) Tuple[namedtuple, namedtuple] [source]¶
- Overview:
Implementation of Proximal Policy Optimization (arXiv:1707.06347) with dual_clip
- Arguments:
data (
namedtuple
): the ppo input data with fieids shown inppo_data
clip_ratio (
float
): the ppo clip ratio for the constraint of policy update, defaults to 0.2dual_clip (
float
): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
- Returns:
ppo_loss (
namedtuple
): the ppo loss item, all of them are the differentiable 0-dim tensorppo_info (
namedtuple
): the ppo optim information for monitoring, all of them are Python scalar
- Shapes:
mu_sigma_new (
tuple
): \(((B, N), (B, N))\), where B is batch size and N is action dimmu_sigma_old (
tuple
): \(((B, N), (B, N))\), where B is batch size and N is action dimaction (
torch.LongTensor
): \((B, )\)adv (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorentropy_loss (
torch.FloatTensor
): \(()\)
- Examples:
>>> action_dim = 4 >>> data = ppo_policy_data_continuous( >>> mu_sigma_new=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2), >>> mu_sigma_old=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2), >>> action=torch.randn(3, action_dim), >>> adv=torch.randn(3), >>> weight=torch.ones(3), >>> ) >>> loss, info = ppo_policy_error_continuous(data)
retrace¶
Please refer to ding/rl_utils/retrace
for more details.
compute_q_retraces¶
- ding.rl_utils.retrace.compute_q_retraces(q_values: Tensor, v_pred: Tensor, rewards: Tensor, actions: Tensor, weights: Tensor, ratio: Tensor, gamma: float = 0.9) Tensor [source]¶
- Shapes:
q_values (
torch.Tensor
): \((T + 1, B, N)\), where T is unroll_len, B is batch size, N is discrete action dim.v_pred (
torch.Tensor
): \((T + 1, B, 1)\)rewards (
torch.Tensor
): \((T, B)\)actions (
torch.Tensor
): \((T, B)\)weights (
torch.Tensor
): \((T, B)\)ratio (
torch.Tensor
): \((T, B, N)\)q_retraces (
torch.Tensor
): \((T + 1, B, 1)\)
- Examples:
>>> T=2 >>> B=3 >>> N=4 >>> q_values=torch.randn(T+1, B, N) >>> v_pred=torch.randn(T+1, B, 1) >>> rewards=torch.randn(T, B) >>> actions=torch.randint(0, N, (T, B)) >>> weights=torch.ones(T, B) >>> ratio=torch.randn(T, B, N) >>> q_retraces = compute_q_retraces(q_values, v_pred, rewards, actions, weights, ratio)
Note
q_retrace operation doesn’t need to compute gradient, just executes forward computation.
sampler¶
Please refer to ding/rl_utils/sampler
for more details.
ArgmaxSampler¶
MultinomialSampler¶
MuSampler¶
ReparameterizationSampler¶
HybridStochasticSampler¶
HybridDeterminsticSampler¶
td¶
Please refer to ding/rl_utils/td
for more details.
q_1step_td_data¶
- class ding.rl_utils.td.q_1step_td_data(q, next_q, act, next_act, reward, done, weight)¶
q_1step_td_error¶
- ding.rl_utils.td.q_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
1 step td_error, support single agent case and multi agent case.
- Arguments:
data (
q_1step_td_data
): The input data, q_1step_td_data to calculate lossgamma (
float
): Discount factorcriterion (
torch.nn.modules
): Loss function criterion
- Returns:
loss (
torch.Tensor
): 1step td error
- Shapes:
data (
q_1step_td_data
): the q_1step_td_data containing [‘q’, ‘next_q’, ‘act’, ‘next_act’, ‘reward’, ‘done’, ‘weight’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]next_q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]act (
torch.LongTensor
): \((B, )\)next_act (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \(( , B)\)done (
torch.BoolTensor
) \((B, )\), whether done in last timestepweight (
torch.FloatTensor
or None): \((B, )\), the training sample weight
- Examples:
>>> action_dim = 4 >>> data = q_1step_td_data( >>> q=torch.randn(3, action_dim), >>> next_q=torch.randn(3, action_dim), >>> act=torch.randint(0, action_dim, (3,)), >>> next_act=torch.randint(0, action_dim, (3,)), >>> reward=torch.randn(3), >>> done=torch.randint(0, 2, (3,)).bool(), >>> weight=torch.ones(3), >>> ) >>> loss = q_1step_td_error(data, 0.99)
m_q_1step_td_data¶
- class ding.rl_utils.td.m_q_1step_td_data(q, target_q, next_q, act, reward, done, weight)¶
m_q_1step_td_error¶
- ding.rl_utils.td.m_q_1step_td_error(data: ~collections.namedtuple, gamma: float, tau: float, alpha: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
Munchausen td_error for DQN algorithm, support 1 step td error.
- Arguments:
data (
m_q_1step_td_data
): The input data, m_q_1step_td_data to calculate lossgamma (
float
): Discount factortau (
float
): Entropy factor for Munchausen DQNalpha (
float
): Discount factor for Munchausen termcriterion (
torch.nn.modules
): Loss function criterion
- Returns:
loss (
torch.Tensor
): 1step td error, 0-dim tensor
- Shapes:
data (
m_q_1step_td_data
): the m_q_1step_td_data containing [‘q’, ‘target_q’, ‘next_q’, ‘act’, ‘reward’, ‘done’, ‘weight’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]target_q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]next_q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]act (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \(( , B)\)done (
torch.BoolTensor
) \((B, )\), whether done in last timestepweight (
torch.FloatTensor
or None): \((B, )\), the training sample weight
- Examples:
>>> action_dim = 4 >>> data = m_q_1step_td_data( >>> q=torch.randn(3, action_dim), >>> target_q=torch.randn(3, action_dim), >>> next_q=torch.randn(3, action_dim), >>> act=torch.randint(0, action_dim, (3,)), >>> reward=torch.randn(3), >>> done=torch.randint(0, 2, (3,)), >>> weight=torch.ones(3), >>> ) >>> loss = m_q_1step_td_error(data, 0.99, 0.01, 0.01)
q_v_1step_td_data¶
- class ding.rl_utils.td.q_v_1step_td_data(q, v, act, reward, done, weight)¶
q_v_1step_td_error¶
- ding.rl_utils.td.q_v_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
td_error between q and v value for SAC algorithm, support 1 step td error.
- Arguments:
data (
q_v_1step_td_data
): The input data, q_v_1step_td_data to calculate lossgamma (
float
): Discount factorcriterion (
torch.nn.modules
): Loss function criterion
- Returns:
loss (
torch.Tensor
): 1step td error, 0-dim tensor
- Shapes:
data (
q_v_1step_td_data
): the q_v_1step_td_data containing [‘q’, ‘v’, ‘act’, ‘reward’, ‘done’, ‘weight’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]v (
torch.FloatTensor
): \((B, )\)act (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \(( , B)\)done (
torch.BoolTensor
) \((B, )\), whether done in last timestepweight (
torch.FloatTensor
or None): \((B, )\), the training sample weight
- Examples:
>>> action_dim = 4 >>> data = q_v_1step_td_data( >>> q=torch.randn(3, action_dim), >>> v=torch.randn(3), >>> act=torch.randint(0, action_dim, (3,)), >>> reward=torch.randn(3), >>> done=torch.randint(0, 2, (3,)), >>> weight=torch.ones(3), >>> ) >>> loss = q_v_1step_td_error(data, 0.99)
nstep_return_data¶
- class ding.rl_utils.td.nstep_return_data(reward, next_value, done)¶
nstep_return¶
- ding.rl_utils.td.nstep_return(data: namedtuple, gamma: float | list, nstep: int, value_gamma: Tensor | None = None)[source]¶
- Overview:
Calculate nstep return for DQN algorithm, support single agent case and multi agent case.
- Arguments:
data (
nstep_return_data
): The input data, nstep_return_data to calculate lossgamma (
float
): Discount factornstep (
int
): nstep numvalue_gamma (
torch.Tensor
): Discount factor for value
- Returns:
return (
torch.Tensor
): nstep return
- Shapes:
data (
nstep_return_data
): the nstep_return_data containing [‘reward’, ‘next_value’, ‘done’]reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)next_value (
torch.FloatTensor
): \((, B)\)done (
torch.BoolTensor
) \((B, )\), whether done in last timestep
- Examples:
>>> data = nstep_return_data( >>> reward=torch.randn(3, 3), >>> next_value=torch.randn(3), >>> done=torch.randint(0, 2, (3,)), >>> ) >>> loss = nstep_return(data, 0.99, 3)
dist_1step_td_data¶
- class ding.rl_utils.td.dist_1step_td_data(dist, next_dist, act, next_act, reward, done, weight)¶
dist_1step_td_error¶
- ding.rl_utils.td.dist_1step_td_error(data: namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int) Tensor [source]¶
- Overview:
1 step td_error for distributed q-learning based algorithm
- Arguments:
data (
dist_1step_td_data
): The input data, dist_nstep_td_data to calculate lossgamma (
float
): Discount factorv_min (
float
): The min value of supportv_max (
float
): The max value of supportn_atom (
int
): The num of atom
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensor
- Shapes:
data (
dist_1step_td_data
): the dist_1step_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]dist (
torch.FloatTensor
): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]next_dist (
torch.FloatTensor
): \((B, N, n_atom)\)act (
torch.LongTensor
): \((B, )\)next_act (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((, B)\)done (
torch.BoolTensor
) \((B, )\), whether done in last timestepweight (
torch.FloatTensor
or None): \((B, )\), the training sample weight
- Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True) >>> next_dist = torch.randn(4, 3, 51).abs() >>> act = torch.randint(0, 3, (4,)) >>> next_act = torch.randint(0, 3, (4,)) >>> reward = torch.randn(4) >>> done = torch.randint(0, 2, (4,)) >>> data = dist_1step_td_data(dist, next_dist, act, next_act, reward, done, None) >>> loss = dist_1step_td_error(data, 0.99, -10.0, 10.0, 51)
dist_nstep_td_data¶
- ding.rl_utils.td.dist_nstep_td_data¶
alias of
dist_1step_td_data
shape_fn_dntd¶
dist_nstep_td_error¶
- ding.rl_utils.td.dist_nstep_td_error(data: namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int, nstep: int = 1, value_gamma: Tensor | None = None) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error for distributed q-learning based algorithm, support single agent case and multi agent case.
- Arguments:
data (
dist_nstep_td_data
): The input data, dist_nstep_td_data to calculate lossgamma (
float
): Discount factornstep (
int
): nstep num, default set to 1
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensor
- Shapes:
data (
dist_nstep_td_data
): the dist_nstep_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]dist (
torch.FloatTensor
): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]next_n_dist (
torch.FloatTensor
): \((B, N, n_atom)\)act (
torch.LongTensor
): \((B, )\)next_n_act (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timestep
- Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True) >>> next_n_dist = torch.randn(4, 3, 51).abs() >>> done = torch.randn(4) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> reward = torch.randn(5, 4) >>> data = dist_nstep_td_data(dist, next_n_dist, action, next_action, reward, done, None) >>> loss, _ = dist_nstep_td_error(data, 0.95, -10.0, 10.0, 51, 5)
v_1step_td_data¶
- class ding.rl_utils.td.v_1step_td_data(v, next_v, reward, done, weight)¶
v_1step_td_error¶
- ding.rl_utils.td.v_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
1 step td_error for distributed value based algorithm
- Arguments:
data (
v_1step_td_data
): The input data, v_1step_td_data to calculate lossgamma (
float
): Discount factorcriterion (
torch.nn.modules
): Loss function criterion
- Returns:
loss (
torch.Tensor
): 1step td error, 0-dim tensor
- Shapes:
data (
v_1step_td_data
): the v_1step_td_data containing [‘v’, ‘next_v’, ‘reward’, ‘done’, ‘weight’]v (
torch.FloatTensor
): \((B, )\) i.e. [batch_size, ]next_v (
torch.FloatTensor
): \((B, )\)reward (
torch.FloatTensor
): \((, B)\)done (
torch.BoolTensor
) \((B, )\), whether done in last timestepweight (
torch.FloatTensor
or None): \((B, )\), the training sample weight
- Examples:
>>> v = torch.randn(5).requires_grad_(True) >>> next_v = torch.randn(5) >>> reward = torch.rand(5) >>> done = torch.zeros(5) >>> data = v_1step_td_data(v, next_v, reward, done, None) >>> loss, td_error_per_sample = v_1step_td_error(data, 0.99)
v_nstep_td_data¶
- class ding.rl_utils.td.v_nstep_td_data(v, next_n_v, reward, done, weight, value_gamma)¶
v_nstep_td_error¶
- ding.rl_utils.td.v_nstep_td_error(data: ~collections.namedtuple, gamma: float, nstep: int = 1, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
Multistep (n step) td_error for distributed value based algorithm
- Arguments:
data (
dist_nstep_td_data
): The input data, v_nstep_td_data to calculate lossgamma (
float
): Discount factornstep (
int
): nstep num, default set to 1
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensor
- Shapes:
data (
dist_nstep_td_data
): The v_nstep_td_data containing [‘v’, ‘next_n_v’, ‘reward’, ‘done’, ‘weight’, ‘value_gamma’]v (
torch.FloatTensor
): \((B, )\) i.e. [batch_size, ]next_v (
torch.FloatTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timestepweight (
torch.FloatTensor
or None): \((B, )\), the training sample weightvalue_gamma (
torch.Tensor
): If the remaining data in the buffer is less than n_step we use value_gamma as the gamma discount value for next_v rather than gamma**n_step
- Examples:
>>> v = torch.randn(5).requires_grad_(True) >>> next_v = torch.randn(5) >>> reward = torch.rand(5, 5) >>> done = torch.zeros(5) >>> data = v_nstep_td_data(v, next_v, reward, done, 0.9, 0.99) >>> loss, td_error_per_sample = v_nstep_td_error(data, 0.99, 5)
q_nstep_td_data¶
- class ding.rl_utils.td.q_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, weight)¶
dqfd_nstep_td_data¶
- class ding.rl_utils.td.dqfd_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, done_one_step, weight, new_n_q_one_step, next_n_action_one_step, is_expert)¶
shape_fn_qntd¶
q_nstep_td_error¶
- ding.rl_utils.td.q_nstep_td_error(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error for q-learning based algorithm
- Arguments:
data (
q_nstep_td_data
): The input data, q_nstep_td_data to calculate lossgamma (
float
): Discount factorcum_reward (
bool
): Whether to use cumulative nstep reward, which is figured out when collecting datavalue_gamma (
torch.Tensor
): Gamma discount value for target q_valuecriterion (
torch.nn.modules
): Loss function criterionnstep (
int
): nstep num, default set to 1
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensortd_error_per_sample (
torch.Tensor
): nstep td error, 1-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]next_n_q (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timesteptd_error_per_sample (
torch.FloatTensor
): \((B, )\)
- Examples:
>>> next_q = torch.randn(4, 3) >>> done = torch.randn(4) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> nstep =3 >>> q = torch.randn(4, 3).requires_grad_(True) >>> reward = torch.rand(nstep, 4) >>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None) >>> loss, td_error_per_sample = q_nstep_td_error(data, 0.95, nstep=nstep)
bdq_nstep_td_error¶
- ding.rl_utils.td.bdq_nstep_td_error(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error for BDQ algorithm, referenced paper “Action Branching Architectures for Deep Reinforcement Learning”, link: https://arxiv.org/pdf/1711.08946. In fact, the original paper only provides the 1-step TD-error calculation method, and here we extend the calculation method of n-step, i.e., TD-error:
- Arguments:
data (
q_nstep_td_data
): The input data, q_nstep_td_data to calculate lossgamma (
float
): Discount factorcum_reward (
bool
): Whether to use cumulative nstep reward, which is figured out when collecting datavalue_gamma (
torch.Tensor
): Gamma discount value for target q_valuecriterion (
torch.nn.modules
): Loss function criterionnstep (
int
): nstep num, default set to 1
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensortd_error_per_sample (
torch.Tensor
): nstep td error, 1-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]q (
torch.FloatTensor
): \((B, D, N)\) i.e. [batch_size, branch_num, action_bins_per_branch]next_n_q (
torch.FloatTensor
): \((B, D, N)\)action (
torch.LongTensor
): \((B, D)\)next_n_action (
torch.LongTensor
): \((B, D)\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timesteptd_error_per_sample (
torch.FloatTensor
): \((B, )\)
- Examples:
>>> action_per_branch = 3 >>> next_q = torch.randn(8, 6, action_per_branch) >>> done = torch.randn(8) >>> action = torch.randint(0, action_per_branch, size=(8, 6)) >>> next_action = torch.randint(0, action_per_branch, size=(8, 6)) >>> nstep =3 >>> q = torch.randn(8, 6, action_per_branch).requires_grad_(True) >>> reward = torch.rand(nstep, 8) >>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None) >>> loss, td_error_per_sample = bdq_nstep_td_error(data, 0.95, nstep=nstep)
shape_fn_qntd_rescale¶
q_nstep_td_error_with_rescale¶
- ding.rl_utils.td.q_nstep_td_error_with_rescale(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: ~typing.Callable = <function value_transform>, inv_trans_fn: ~typing.Callable = <function value_inv_transform>) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error with value rescaling
- Arguments:
data (
q_nstep_td_data
): The input data, q_nstep_td_data to calculate lossgamma (
float
): Discount factornstep (
int
): nstep num, default set to 1criterion (
torch.nn.modules
): Loss function criteriontrans_fn (
Callable
): Value transfrom function, default to value_transform (refer to rl_utils/value_rescale.py)inv_trans_fn (
Callable
): Value inverse transfrom function, default to value_inv_transform (refer to rl_utils/value_rescale.py)
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]next_n_q (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timestep
- Examples:
>>> next_q = torch.randn(4, 3) >>> done = torch.randn(4) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> nstep =3 >>> q = torch.randn(4, 3).requires_grad_(True) >>> reward = torch.rand(nstep, 4) >>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None) >>> loss, _ = q_nstep_td_error_with_rescale(data, 0.95, nstep=nstep)
dqfd_nstep_td_error¶
- ding.rl_utils.td.dqfd_nstep_td_error(data: ~collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, margin_function: float, lambda_one_step_td: float = 1.0, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd
- Arguments:
data (
dqfd_nstep_td_data
): The input data, dqfd_nstep_td_data to calculate lossgamma (
float
): discount factorcum_reward (
bool
): Whether to use cumulative nstep reward, which is figured out when collecting datavalue_gamma (
torch.Tensor
): Gamma discount value for target q_valuecriterion (
torch.nn.modules
): Loss function criterionnstep (
int
): nstep num, default set to 10
- Returns:
loss (
torch.Tensor
): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensortd_error_per_sample (
torch.Tensor
): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor
- Shapes:
data (
q_nstep_td_data
): the q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]next_n_q (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timesteptd_error_per_sample (
torch.FloatTensor
): \((B, )\)new_n_q_one_step (
torch.FloatTensor
): \((B, N)\)next_n_action_one_step (
torch.LongTensor
): \((B, )\)is_expert (
int
) : 0 or 1
- Examples:
>>> next_q = torch.randn(4, 3) >>> done = torch.randn(4) >>> done_1 = torch.randn(4) >>> next_q_one_step = torch.randn(4, 3) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> next_action_one_step = torch.randint(0, 3, size=(4, )) >>> is_expert = torch.ones((4)) >>> nstep = 3 >>> q = torch.randn(4, 3).requires_grad_(True) >>> reward = torch.rand(nstep, 4) >>> data = dqfd_nstep_td_data( >>> q, next_q, action, next_action, reward, done, done_1, None, >>> next_q_one_step, next_action_one_step, is_expert >>> ) >>> loss, td_error_per_sample, loss_statistics = dqfd_nstep_td_error( >>> data, 0.95, lambda_n_step_td=1, lambda_supervised_loss=1, >>> margin_function=0.8, nstep=nstep >>> )
dqfd_nstep_td_error_with_rescale¶
- ding.rl_utils.td.dqfd_nstep_td_error_with_rescale(data: ~collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, lambda_one_step_td: float, margin_function: float, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: ~typing.Callable = <function value_transform>, inv_trans_fn: ~typing.Callable = <function value_inv_transform>) Tensor [source]¶
- Overview:
Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd
- Arguments:
data (
dqfd_nstep_td_data
): The input data, dqfd_nstep_td_data to calculate lossgamma (
float
): Discount factorcum_reward (
bool
): Whether to use cumulative nstep reward, which is figured out when collecting datavalue_gamma (
torch.Tensor
): Gamma discount value for target q_valuecriterion (
torch.nn.modules
): Loss function criterionnstep (
int
): nstep num, default set to 10
- Returns:
loss (
torch.Tensor
): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensortd_error_per_sample (
torch.Tensor
): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]next_n_q (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timesteptd_error_per_sample (
torch.FloatTensor
): \((B, )\)new_n_q_one_step (
torch.FloatTensor
): \((B, N)\)next_n_action_one_step (
torch.LongTensor
): \((B, )\)is_expert (
int
) : 0 or 1
qrdqn_nstep_td_data¶
- class ding.rl_utils.td.qrdqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, tau, weight)¶
qrdqn_nstep_td_error¶
- ding.rl_utils.td.qrdqn_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, value_gamma: Tensor | None = None) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error with in QRDQN
- Arguments:
data (
qrdqn_nstep_td_data
): The input data, qrdqn_nstep_td_data to calculate lossgamma (
float
): Discount factornstep (
int
): nstep num, default set to 1
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]q (
torch.FloatTensor
): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]next_n_q (
torch.FloatTensor
): \((tau', B, N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timestep
- Examples:
>>> next_q = torch.randn(4, 3, 3) >>> done = torch.randn(4) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> nstep = 3 >>> q = torch.randn(4, 3, 3).requires_grad_(True) >>> reward = torch.rand(nstep, 4) >>> data = qrdqn_nstep_td_data(q, next_q, action, next_action, reward, done, 3, None) >>> loss, td_error_per_sample = qrdqn_nstep_td_error(data, 0.95, nstep=nstep)
q_nstep_sql_td_error¶
- ding.rl_utils.td.q_nstep_sql_td_error(data: ~collections.namedtuple, gamma: float, alpha: float, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error for q-learning based algorithm
- Arguments:
data (
q_nstep_td_data
): The input data, q_nstep_sql_td_data to calculate lossgamma (
float
): Discount factorAlpha (:obj:`float`): A parameter to weight entropy term in a policy equation
cum_reward (
bool
): Whether to use cumulative nstep reward, which is figured out when collecting datavalue_gamma (
torch.Tensor
): Gamma discount value for target soft_q_valuecriterion (
torch.nn.modules
): Loss function criterionnstep (
int
): nstep num, default set to 1
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensortd_error_per_sample (
torch.Tensor
): nstep td error, 1-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]q (
torch.FloatTensor
): \((B, N)\) i.e. [batch_size, action_dim]next_n_q (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timesteptd_error_per_sample (
torch.FloatTensor
): \((B, )\)
- Examples:
>>> next_q = torch.randn(4, 3) >>> done = torch.randn(4) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> nstep = 3 >>> q = torch.randn(4, 3).requires_grad_(True) >>> reward = torch.rand(nstep, 4) >>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None) >>> loss, td_error_per_sample, record_target_v = q_nstep_sql_td_error(data, 0.95, 1.0, nstep=nstep)
iqn_nstep_td_data¶
- class ding.rl_utils.td.iqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, replay_quantiles, weight)¶
iqn_nstep_td_error¶
- ding.rl_utils.td.iqn_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Tensor | None = None) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error with in IQN, referenced paper Implicit Quantile Networks for Distributional Reinforcement Learning <https://arxiv.org/pdf/1806.06923.pdf>
- Arguments:
data (
iqn_nstep_td_data
): The input data, iqn_nstep_td_data to calculate lossgamma (
float
): Discount factornstep (
int
): nstep num, default set to 1criterion (
torch.nn.modules
): Loss function criterionbeta_function (
Callable
): The risk function
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]q (
torch.FloatTensor
): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]next_n_q (
torch.FloatTensor
): \((tau', B, N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timestep
- Examples:
>>> next_q = torch.randn(3, 4, 3) >>> done = torch.randn(4) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> nstep = 3 >>> q = torch.randn(3, 4, 3).requires_grad_(True) >>> replay_quantile = torch.randn([3, 4, 1]) >>> reward = torch.rand(nstep, 4) >>> data = iqn_nstep_td_data(q, next_q, action, next_action, reward, done, replay_quantile, None) >>> loss, td_error_per_sample = iqn_nstep_td_error(data, 0.95, nstep=nstep)
fqf_nstep_td_data¶
- class ding.rl_utils.td.fqf_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, quantiles_hats, weight)¶
fqf_nstep_td_error¶
- ding.rl_utils.td.fqf_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Tensor | None = None) Tensor [source]¶
- Overview:
Multistep (1 step or n step) td_error with in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>
- Arguments:
data (
fqf_nstep_td_data
): The input data, fqf_nstep_td_data to calculate lossgamma (
float
): Discount factornstep (
int
): nstep num, default set to 1criterion (
torch.nn.modules
): Loss function criterionbeta_function (
Callable
): The risk function
- Returns:
loss (
torch.Tensor
): nstep td error, 0-dim tensor
- Shapes:
data (
q_nstep_td_data
): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]q (
torch.FloatTensor
): \((B, tau, N)\) i.e. [batch_size, tau, action_dim]next_n_q (
torch.FloatTensor
): \((B, tau', N)\)action (
torch.LongTensor
): \((B, )\)next_n_action (
torch.LongTensor
): \((B, )\)reward (
torch.FloatTensor
): \((T, B)\), where T is timestep(nstep)done (
torch.BoolTensor
) \((B, )\), whether done in last timestepquantiles_hats (
torch.FloatTensor
): \((B, tau)\)
- Examples:
>>> next_q = torch.randn(4, 3, 3) >>> done = torch.randn(4) >>> action = torch.randint(0, 3, size=(4, )) >>> next_action = torch.randint(0, 3, size=(4, )) >>> nstep = 3 >>> q = torch.randn(4, 3, 3).requires_grad_(True) >>> quantiles_hats = torch.randn([4, 3]) >>> reward = torch.rand(nstep, 4) >>> data = fqf_nstep_td_data(q, next_q, action, next_action, reward, done, quantiles_hats, None) >>> loss, td_error_per_sample = fqf_nstep_td_error(data, 0.95, nstep=nstep)
evaluate_quantile_at_action¶
fqf_calculate_fraction_loss¶
- ding.rl_utils.td.fqf_calculate_fraction_loss(q_tau_i, q_value, quantiles, actions)[source]¶
- Overview:
Calculate the fraction loss in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>
- Arguments:
q_tau_i (
torch.FloatTensor
): \((batch_size, num_quantiles-1, action_dim)\)q_value (
torch.FloatTensor
): \((batch_size, num_quantiles, action_dim)\)quantiles (
torch.FloatTensor
): \((batch_size, num_quantiles+1)\)actions (
torch.LongTensor
): \((batch_size, )\)
- Returns:
fraction_loss (
torch.Tensor
): fraction loss, 0-dim tensor
td_lambda_data¶
- class ding.rl_utils.td.td_lambda_data(value, reward, weight)¶
shape_fn_td_lambda¶
td_lambda_error¶
- ding.rl_utils.td.td_lambda_error(data: namedtuple, gamma: float = 0.9, lambda_: float = 0.8) Tensor [source]¶
- Overview:
Computing TD(lambda) loss given constant gamma and lambda. There is no special handling for terminal state value, if some state has reached the terminal, just fill in zeros for values and rewards beyond terminal (including the terminal state, values[terminal] should also be 0)
- Arguments:
data (
namedtuple
): td_lambda input data with fields [‘value’, ‘reward’, ‘weight’]gamma (
float
): Constant discount factor gamma, should be in [0, 1], defaults to 0.9lambda (
float
): Constant lambda, should be in [0, 1], defaults to 0.8
- Returns:
loss (
torch.Tensor
): Computed MSE loss, averaged over the batch
- Shapes:
value (
torch.FloatTensor
): \((T+1, B)\), where T is trajectory length and B is batch, which is the estimation of the state value at step 0 to Treward (
torch.FloatTensor
): \((T, B)\), the returns from time step 0 to T-1weight (
torch.FloatTensor
or None): \((B, )\), the training sample weightloss (
torch.FloatTensor
): \(()\), 0-dim tensor
- Examples:
>>> T, B = 8, 4 >>> value = torch.randn(T + 1, B).requires_grad_(True) >>> reward = torch.rand(T, B) >>> loss = td_lambda_error(td_lambda_data(value, reward, None))
generalized_lambda_returns¶
- ding.rl_utils.td.generalized_lambda_returns(bootstrap_values: Tensor, rewards: Tensor, gammas: float, lambda_: float, done: Tensor | None = None) Tensor [source]¶
- Overview:
Functional equivalent to trfl.value_ops.generalized_lambda_returns https://github.com/deepmind/trfl/blob/2c07ac22512a16715cc759f0072be43a5d12ae45/trfl/value_ops.py#L74 Passing in a number instead of tensor to make the value constant for all samples in batch
- Arguments:
bootstrap_values (
torch.Tensor
orfloat
): estimation of the value at step 0 to T, of size [T_traj+1, batchsize]rewards (
torch.Tensor
): The returns from 0 to T-1, of size [T_traj, batchsize]gammas (
torch.Tensor
orfloat
): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]lambda (
torch.Tensor
orfloat
): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep, of size [T_traj, batchsize]done (
torch.Tensor
orfloat
): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]
- Returns:
return (
torch.Tensor
): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]
multistep_forward_view¶
- ding.rl_utils.td.multistep_forward_view(bootstrap_values: Tensor, rewards: Tensor, gammas: float, lambda_: float, done: Tensor | None = None) Tensor [source]¶
- Overview:
Same as trfl.sequence_ops.multistep_forward_view, which implements (12.18) in Sutton & Barto. Assuming the first dim of input tensors correspond to the index in batch.
Note
result[T-1] = rewards[T-1] + gammas[T-1] * bootstrap_values[T] for t in 0…T-2 : result[t] = rewards[t] + gammas[t]*(lambdas[t]*result[t+1] + (1-lambdas[t])*bootstrap_values[t+1])
- Arguments:
bootstrap_values (
torch.Tensor
): Estimation of the value at step 1 to T, of size [T_traj, batchsize]rewards (
torch.Tensor
): The returns from 0 to T-1, of size [T_traj, batchsize]gammas (
torch.Tensor
): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]lambda (
torch.Tensor
): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep of size [T_traj, batchsize], the element for T-1 is ignored and effectively set to 0, as there is no information about future rewards.done (
torch.Tensor
orfloat
): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]
- Returns:
ret (
torch.Tensor
): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]
upgo¶
Please refer to ding/rl_utils/upgo
for more details.
upgo_returns¶
- ding.rl_utils.upgo.upgo_returns(rewards: Tensor, bootstrap_values: Tensor) Tensor [source]¶
- Overview:
Computing UPGO return targets. Also notice there is no special handling for the terminal state.
- Arguments:
rewards (
torch.Tensor
): the returns from time step 0 to T-1, of size [T_traj, batchsize]bootstrap_values (
torch.Tensor
): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]
- Returns:
ret (
torch.Tensor
): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]
- Examples:
>>> T, B, N, N2 = 4, 8, 5, 7 >>> rewards = torch.randn(T, B) >>> bootstrap_values = torch.randn(T + 1, B).requires_grad_(True) >>> returns = upgo_returns(rewards, bootstrap_values)
upgo_loss¶
- ding.rl_utils.upgo.upgo_loss(target_output: Tensor, rhos: Tensor, action: Tensor, rewards: Tensor, bootstrap_values: Tensor, mask=None) Tensor [source]¶
- Overview:
Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.
- Arguments:
target_output (
torch.Tensor
): the output computed by the target policy network, of size [T_traj, batchsize, n_output]rhos (
torch.Tensor
): the importance sampling ratio, of size [T_traj, batchsize]action (
torch.Tensor
): the action taken, of size [T_traj, batchsize]rewards (
torch.Tensor
): the returns from time step 0 to T-1, of size [T_traj, batchsize]bootstrap_values (
torch.Tensor
): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]
- Returns:
loss (
torch.Tensor
): Computed importance sampled UPGO loss, averaged over the samples, of size []
- Examples:
>>> T, B, N, N2 = 4, 8, 5, 7 >>> rhos = torch.randn(T, B) >>> loss = upgo_loss(logit, rhos, action, rewards, bootstrap_values)
value_rescale¶
Please refer to ding/rl_utils/value_rescale
for more details.
value_transform¶
- ding.rl_utils.value_rescale.value_transform(x: Tensor, eps: float = 0.01) Tensor [source]¶
- Overview:
A function to reduce the scale of the action-value function. :math: h(x) = sign(x)(sqrt{(abs(x)+1)} - 1) + epsilon * x .
- Arguments:
x: (
torch.Tensor
) The input tensor to be normalized.eps: (
float
) The coefficient of the additive regularization term to ensure inverse function is Lipschitz continuous
- Returns:
(
torch.Tensor
) Normalized tensor.
Note
Observe and Look Further: Achieving Consistent Performance on Atari (https://arxiv.org/abs/1805.11593).
value_inv_transform¶
- ding.rl_utils.value_rescale.value_inv_transform(x: Tensor, eps: float = 0.01) Tensor [source]¶
- Overview:
The inverse form of value rescale. :math: `h^{-1}(x) = sign(x)({(
- rac{sqrt{1+4epsilon(|x|+1+epsilon)}-1}{2epsilon})}^2-1)` .
- Arguments:
x: (
torch.Tensor
) The input tensor to be unnormalized.eps: (
float
) The coefficient of the additive regularization term to ensure inverse function is Lipschitz continuous
- Returns:
(
torch.Tensor
) Unnormalized tensor.
symlog¶
- ding.rl_utils.value_rescale.symlog(x: Tensor) Tensor [source]¶
- Overview:
A function to normalize the targets. :math: symlog(x) = sign(x)(ln{|x|+1}) .
- Arguments:
x: (
torch.Tensor
) The input tensor to be normalized.
- Returns:
(
torch.Tensor
) Normalized tensor.
Note
Mastering Diverse Domains through World Models (https://arxiv.org/abs/2301.04104)
inv_symlog¶
vtrace¶
Please refer to ding/rl_utils/vtrace
for more details.
vtrace_nstep_return¶
- ding.rl_utils.vtrace.vtrace_nstep_return(clipped_rhos, clipped_cs, reward, bootstrap_values, gamma=0.99, lambda_=0.95)[source]¶
- Overview:
Computation of vtrace return.
- Returns:
vtrace_return (
torch.FloatTensor
): the vtrace loss item, all of them are differentiable 0-dim tensor
- Shapes:
clipped_rhos (
torch.FloatTensor
): \((T, B)\), where T is timestep, B is batch sizeclipped_cs (
torch.FloatTensor
): \((T, B)\)reward (
torch.FloatTensor
): \((T, B)\)bootstrap_values (
torch.FloatTensor
): \((T+1, B)\)vtrace_return (
torch.FloatTensor
): \((T, B)\)
vtrace_advantage¶
- ding.rl_utils.vtrace.vtrace_advantage(clipped_pg_rhos, reward, return_, bootstrap_values, gamma)[source]¶
- Overview:
Computation of vtrace advantage.
- Returns:
vtrace_advantage (
namedtuple
): the vtrace loss item, all of them are the differentiable 0-dim tensor
- Shapes:
clipped_pg_rhos (
torch.FloatTensor
): \((T, B)\), where T is timestep, B is batch sizereward (
torch.FloatTensor
): \((T, B)\)return (
torch.FloatTensor
): \((T, B)\)bootstrap_values (
torch.FloatTensor
): \((T, B)\)vtrace_advantage (
torch.FloatTensor
): \((T, B)\)
vtrace_data¶
- class ding.rl_utils.vtrace.vtrace_data(target_output, behaviour_output, action, value, reward, weight)¶
vtrace_loss¶
- class ding.rl_utils.vtrace.vtrace_loss(policy_loss, value_loss, entropy_loss)¶
vtrace_error_discrete_action¶
- ding.rl_utils.vtrace.vtrace_error_discrete_action(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[source]¶
- Overview:
Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)
- Arguments:
- data (
namedtuple
): input data with fields shown invtrace_data
target_output (
torch.Tensor
): the output taking the action by the current policy network, usually this output is network output logitbehaviour_output (
torch.Tensor
): the output taking the action by the behaviour policy network, usually this output is network output logit, which is used to produce the trajectory(collector)action (
torch.Tensor
): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
- data (
gamma: (
float
): the future discount factor, defaults to 0.95lambda: (
float
): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0rho_clip_ratio (
float
): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)c_clip_ratio (
float
): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)rho_pg_clip_ratio (
float
): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage
- Returns:
trace_loss (
namedtuple
): the vtrace loss item, all of them are the differentiable 0-dim tensor
- Shapes:
target_output (
torch.FloatTensor
): \((T, B, N)\), where T is timestep, B is batch size and N is action dimbehaviour_output (
torch.FloatTensor
): \((T, B, N)\)action (
torch.LongTensor
): \((T, B)\)value (
torch.FloatTensor
): \((T+1, B)\)reward (
torch.LongTensor
): \((T, B)\)weight (
torch.LongTensor
): \((T, B)\)
- Examples:
>>> T, B, N = 4, 8, 16 >>> value = torch.randn(T + 1, B).requires_grad_(True) >>> reward = torch.rand(T, B) >>> target_output = torch.randn(T, B, N).requires_grad_(True) >>> behaviour_output = torch.randn(T, B, N) >>> action = torch.randint(0, N, size=(T, B)) >>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None) >>> loss = vtrace_error_discrete_action(data, rho_clip_ratio=1.1)
vtrace_error_continuous_action¶
- ding.rl_utils.vtrace.vtrace_error_continuous_action(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[source]¶
- Overview:
Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)
- Arguments:
- data (
namedtuple
): input data with fields shown invtrace_data
target_output (
dict{key:torch.Tensor}
): the output taking the action by the current policy network, usually this output is network output, which represents the distribution by reparameterization trick.behaviour_output (
dict{key:torch.Tensor}
): the output taking the action by the behaviour policy network, usually this output is network output logit, which represents the distribution by reparameterization trick.action (
torch.Tensor
): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
- data (
gamma: (
float
): the future discount factor, defaults to 0.95lambda: (
float
): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0rho_clip_ratio (
float
): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)c_clip_ratio (
float
): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)rho_pg_clip_ratio (
float
): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage
- Returns:
trace_loss (
namedtuple
): the vtrace loss item, all of them are the differentiable 0-dim tensor
- Shapes:
target_output (
dict{key:torch.FloatTensor}
): \((T, B, N)\), where T is timestep, B is batch size and N is action dim. The keys are usually parameters of reparameterization trick.behaviour_output (
dict{key:torch.FloatTensor}
): \((T, B, N)\)action (
torch.LongTensor
): \((T, B)\)value (
torch.FloatTensor
): \((T+1, B)\)reward (
torch.LongTensor
): \((T, B)\)weight (
torch.LongTensor
): \((T, B)\)
- Examples:
>>> T, B, N = 4, 8, 16 >>> value = torch.randn(T + 1, B).requires_grad_(True) >>> reward = torch.rand(T, B) >>> target_output = dict( >>> 'mu': torch.randn(T, B, N).requires_grad_(True), >>> 'sigma': torch.exp(torch.randn(T, B, N).requires_grad_(True)), >>> ) >>> behaviour_output = dict( >>> 'mu': torch.randn(T, B, N), >>> 'sigma': torch.exp(torch.randn(T, B, N)), >>> ) >>> action = torch.randn((T, B, N)) >>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None) >>> loss = vtrace_error_continuous_action(data, rho_clip_ratio=1.1)