ding.rl_utils¶

a2c¶

Please refer to ding/rl_utils/a2c for more details.

a2c_error¶

ding.rl_utils.a2c_error(data: namedtuple) → namedtuple[source]¶

Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for discrete action space

Arguments:

data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:

a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, )\)
value (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> data = a2c_data(
>>>     logit=torch.randn(2, 3),
>>>     action=torch.randint(0, 3, (2, )),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error(data)

a2c_error_continuous¶

ding.rl_utils.a2c_error_continuous(data: namedtuple) → namedtuple[source]¶

Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for continuous action space

Arguments:

data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:

a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, N)\)
value (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> data = a2c_data(
>>>     logit={'mu': torch.randn(2, 3), 'sigma': torch.sqrt(torch.randn(2, 3)**2)},
>>>     action=torch.randn(2, 3),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error_continuous(data)

acer¶

Please refer to ding/rl_utils/acer for more details.

acer_policy_error¶

ding.rl_utils.acer_policy_error(q_values: Tensor, q_retraces: Tensor, v_pred: Tensor, target_logit: Tensor, actions: Tensor, ratio: Tensor, c_clip_ratio: float = 10.0) → Tuple[Tensor, Tensor][source]¶

Overview:

Get ACER policy loss.

Arguments:

q_values (torch.Tensor): Q values
q_retraces (torch.Tensor): Q values (be calculated by retrace method)
v_pred (torch.Tensor): V values
target_pi (torch.Tensor): The new policy’s probability
actions (torch.Tensor): The actions in replay buffer
ratio (torch.Tensor): ratio of new polcy with behavior policy
c_clip_ratio (float): clip value for ratio

Returns:

actor_loss (torch.Tensor): policy loss from q_retrace
bc_loss (torch.Tensor): correct policy loss

Shapes:

q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim
q_retraces (torch.FloatTensor): \((T, B, 1)\)
v_pred (torch.FloatTensor): \((T, B, 1)\)
target_pi (torch.FloatTensor): \((T, B, N)\)
actions (torch.LongTensor): \((T, B)\)
ratio (torch.FloatTensor): \((T, B, N)\)
actor_loss (torch.FloatTensor): \((T, B, 1)\)
bc_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:

>>> q_values=torch.randn(2, 3, 4),
>>> q_retraces=torch.randn(2, 3, 1),
>>> v_pred=torch.randn(2, 3, 1),
>>> target_pi=torch.randn(2, 3, 4),
>>> actions=torch.randint(0, 4, (2, 3)),
>>> ratio=torch.randn(2, 3, 4),
>>> loss = acer_policy_error(q_values, q_retraces, v_pred, target_pi, actions, ratio)

acer_value_error¶

ding.rl_utils.acer_value_error(q_values, q_retraces, actions)[source]¶

Overview:

Get ACER critic loss.

Arguments:

q_values (torch.Tensor): Q values
q_retraces (torch.Tensor): Q values (be calculated by retrace method)
actions (torch.Tensor): The actions in replay buffer
ratio (torch.Tensor): ratio of new polcy with behavior policy

Returns:

critic_loss (torch.Tensor): critic loss

Shapes:

q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim
q_retraces (torch.FloatTensor): \((T, B, 1)\)
actions (torch.LongTensor): \((T, B)\)
critic_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:

>>> q_values=torch.randn(2, 3, 4)
>>> q_retraces=torch.randn(2, 3, 1)
>>> actions=torch.randint(0, 4, (2, 3))
>>> loss = acer_value_error(q_values, q_retraces, actions)

acer_trust_region_update¶

ding.rl_utils.acer_trust_region_update(actor_gradients: List[Tensor], target_logit: Tensor, avg_logit: Tensor, trust_region_value: float) → List[Tensor][source]¶

Overview:

calcuate gradient with trust region constrain

Arguments:

actor_gradients (list(torch.Tensor)): gradients value’s for different part
target_pi (torch.Tensor): The new policy’s probability
avg_pi (torch.Tensor): The average policy’s probability
trust_region_value (float): the range of trust region

Returns:

update_gradients (list(torch.Tensor)): gradients with trust region constraint

Shapes:

target_pi (torch.FloatTensor): \((T, B, N)\)
avg_pi (torch.FloatTensor): \((T, B, N)\)
update_gradients (list(torch.FloatTensor)): \((T, B, N)\)

Examples:

>>> actor_gradients=[torch.randn(2, 3, 4)]
>>> target_pi=torch.randn(2, 3, 4)
>>> avg_pi=torch.randn(2, 3, 4)
>>> loss = acer_trust_region_update(actor_gradients, target_pi, avg_pi, 0.1)

adder¶

Please refer to ding/rl_utils/adder for more details.

Adder¶

class ding.rl_utils.adder.Adder[source]¶

Overview:: Adder is a component that handles different transformations and calculations for transitions in Collector Module(data generation and processing), such as GAE, n-step return, transition sampling etc.
Interface:: __init__, get_gae, get_gae_with_default_last_value, get_nstep_return_data, get_train_sample

classmethod _get_null_transition(template: dict, null_transition: dict | None = None) → dict[source]¶

Overview:

Get null transition for padding. If cls._null_transition is None, return input template instead.

Arguments:

template (dict): The template for null transition.
null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:

null_transition (dict): The deepcopied null transition.

classmethod get_gae(data: List[Dict[str, Any]], last_value: Tensor, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]][source]¶

Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:

data (list): Transitions list, each element is a transition dict with at least ['value', 'reward'].
last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)

classmethod get_gae_with_default_last_value(data: deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]][source]¶

Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data (i.e. len(data) would decrease by 1), and then call get_gae. Otherwise it would make last_value equal to 0.

Arguments:

data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]
done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)

classmethod get_nstep_return_data(data: deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) → deque[source]¶

Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element.

Arguments:

data (deque): Transitions list, each element is a transition dict
nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:

data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:

>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)

classmethod get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: dict | None = None) → List[Dict[str, Any]][source]¶

Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:

data (List[Dict[str, Any]]): Transitions list, each element is a transition dict
unroll_len (int): Learn training unroll length
last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]
null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:

data (List[Dict[str, Any]]): Transitions list processed after unrolling

get_gae¶

ding.rl_utils.adder.get_gae(data: List[Dict[str, Any]], last_value: Tensor, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]]¶

Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:

data (list): Transitions list, each element is a transition dict with at least ['value', 'reward'].
last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)

get_gae_with_default_last_value¶

ding.rl_utils.adder.get_gae_with_default_last_value(data: deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) → List[Dict[str, Any]]¶

Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data (i.e. len(data) would decrease by 1), and then call get_gae. Otherwise it would make last_value equal to 0.

Arguments:

data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]
done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)
gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.
gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
cuda (bool): Whether use cuda in GAE computation

Returns:

data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:

>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)

get_nstep_return_data¶

ding.rl_utils.adder.get_nstep_return_data(data: deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) → deque¶

Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element.

Arguments:

data (deque): Transitions list, each element is a transition dict
nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:

data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:

>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)

get_train_sample¶

ding.rl_utils.adder.get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: dict | None = None) → List[Dict[str, Any]]¶

Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:

data (List[Dict[str, Any]]): Transitions list, each element is a transition dict
unroll_len (int): Learn training unroll length
last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]
null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:

data (List[Dict[str, Any]]): Transitions list processed after unrolling

beta_function¶

Please refer to ding/rl_utils/beta_function for more details.

cpw¶

ding.rl_utils.beta_function.cpw(x: Tensor | float, eta: float = 0.71) → Tensor | float[source]¶

Overview:

The implementation of CPW function.

Arguments:

x (Union[torch.Tensor, float]): The input value.
eta (float): The hyperparameter of CPW function.

Returns:

output (Union[torch.Tensor, float]): The output value.

CVaR¶

ding.rl_utils.beta_function.CVaR(x: Tensor | float, eta: float = 0.71) → Tensor | float[source]¶

Overview:

The implementation of CVaR function, which is a risk-averse function.

Arguments:

x (Union[torch.Tensor, float]): The input value.
eta (float): The hyperparameter of CVaR function.

Returns:

output (Union[torch.Tensor, float]): The output value.

beta_function_map¶

rl_utils.beta_function_map = {'CPW': <function cpw>, 'CVaR': <function CVaR>, 'Pow': <function Pow>, 'uniform': <function <lambda>>}¶

coma¶

Please refer to ding/rl_utils/coma for more details.

coma_error¶

ding.rl_utils.coma_error(data: namedtuple, gamma: float, lambda_: float) → namedtuple[source]¶

Overview:

Implementation of COMA

Arguments:

data (namedtuple): coma input data with fieids shown in coma_data

Returns:

coma_loss (namedtuple): the coma loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit (torch.FloatTensor): \((T, B, A, N)\), where B is batch size A is the agent num, and N is action dim
action (torch.LongTensor): \((T, B, A)\)
q_value (torch.FloatTensor): \((T, B, A, N)\)
target_q_value (torch.FloatTensor): \((T, B, A, N)\)
reward (torch.FloatTensor): \((T, B)\)
weight (torch.FloatTensor or None): \((T ,B, A)\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> agent_num = 3
>>> data = coma_data(
>>>     logit=torch.randn(2, 3, agent_num, action_dim),
>>>     action=torch.randint(0, action_dim, (2, 3, agent_num)),
>>>     q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     target_q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     reward=torch.randn(2, 3),
>>>     weight=torch.ones(2, 3, agent_num),
>>> )
>>> loss = coma_error(data, 0.99, 0.99)

exploration¶

Please refer to ding/rl_utils/exploration for more details.

get_epsilon_greedy_fn¶

ding.rl_utils.exploration.get_epsilon_greedy_fn(start: float, end: float, decay: int, type_: str = 'exp') → Callable[source]¶

Overview:

Generate an epsilon_greedy function with decay, which inputs current timestep and outputs current epsilon.

Arguments:

start (float): Epsilon start value. For linear , it should be 1.0.
end (float): Epsilon end value.
decay (int): Controls the speed that epsilon decreases from start to end. We recommend epsilon decays according to env step rather than iteration.
type (str): How epsilon decays, now supports ['linear', 'exp'(exponential)] .

Returns:

eps_fn (function): The epsilon greedy function with decay.

BaseNoise¶

class ding.rl_utils.exploration.BaseNoise[source]¶

Overview:

Base class for action noise

Interface:

__init__, __call__

Examples:

>>> noise_generator = OUNoise()  # init one type of noise
>>> noise = noise_generator(action.shape, action.device)  # generate noise

abstract __call__(shape: tuple, device: str) → Tensor[source]¶

Overview:

Generate noise according to action tensor’s shape, device.

Arguments:

shape (tuple): size of the action tensor, output noise’s size should be the same.
device (str): device of the action tensor, output noise’s device should be the same as it.

Returns:

noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor.

__init__() → None[source]¶

Overview:: Initialization method.

GaussianNoise¶

class ding.rl_utils.exploration.GaussianNoise(mu: float = 0.0, sigma: float = 1.0)[source]¶

Overview:: Derived class for generating gaussian noise, which satisfies \(X \sim N(\mu, \sigma^2)\)
Interface:: __init__, __call__

__call__(shape: tuple, device: str) → Tensor[source]¶

Overview:

Generate gaussian noise according to action tensor’s shape, device

Arguments:

shape (tuple): size of the action tensor, output noise’s size should be the same
device (str): device of the action tensor, output noise’s device should be the same as it

Returns:

noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor

__init__(mu: float = 0.0, sigma: float = 1.0) → None[source]¶

Overview:

Initialize \(\mu\) and \(\sigma\) in Gaussian Distribution.

Arguments:

mu (float): \(\mu\) , mean value.
sigma (float): \(\sigma\) , standard deviation, should be positive.

OUNoise¶

class ding.rl_utils.exploration.OUNoise(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: float | Tensor | None = 0.0)[source]¶

Overview:: Derived class for generating Ornstein-Uhlenbeck process noise. Satisfies \(dx_t=\theta(\mu-x_t)dt + \sigma dW_t\), where \(W_t\) denotes Weiner Process, acting as a random perturbation term.
Interface:: __init__, reset, __call__

__call__(shape: tuple, device: str, mu: float | None = None) → Tensor[source]¶

Overview:

Generate gaussian noise according to action tensor’s shape, device.

Arguments:

shape (tuple): The size of the action tensor, output noise’s size should be the same.
device (str): The device of the action tensor, output noise’s device should be the same as it.
mu (float): The new mean value \(\mu\), you can set it to None if don’t need it.

Returns:

noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor.

__init__(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: float | Tensor | None = 0.0) → None[source]¶

Overview:

Initialize _alpha \(= heta * dt\`, ``beta`\) \(= \sigma * \sqrt{dt}\), in Ornstein-Uhlenbeck process.

Arguments:

mu (float): \(\mu\) , mean value.
sigma (float): \(\sigma\) , standard deviation of the perturbation noise.
theta (float): How strongly the noise reacts to perturbations, greater value means stronger reaction.
dt (float): The derivative of time t.
x0 (Union[float, torch.Tensor]): The initial state of the noise, should be a scalar or tensor with the same shape as the action tensor.

reset() → None[source]¶

Overview:: Reset _x to the initial state _x0.

create_noise_generator¶

ding.rl_utils.exploration.create_noise_generator(noise_type: str, noise_kwargs: dict) → BaseNoise[source]¶

Overview:

Given the key (noise_type), create a new noise generator instance if in noise_mapping’s values, or raise an KeyError. In other words, a derived noise generator must first register, then call create_noise generator to get the instance object.

Arguments:

noise_type (str): the type of noise generator to be created.

Returns:

noise (BaseNoise): the created new noise generator, should be an instance of one of noise_mapping’s values.

gae¶

Please refer to ding/rl_utils/gae for more details.

gae_data¶

class ding.rl_utils.gae.gae_data(value, next_value, reward, done, traj_flag)¶

shape_fn_gae¶

ding.rl_utils.gae.shape_fn_gae(args, kwargs)[source]¶

Overview:: Return shape of gae for hpc
Returns:: shape: [T, B]

gae¶

ding.rl_utils.gae.gae(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.97) → FloatTensor[source]¶

Overview:

Implementation of Generalized Advantage Estimator (arXiv:1506.02438)

Arguments:

data (namedtuple): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories data.
gamma (float): the future discount factor, should be in [0, 1], defaults to 0.99.
lambda (float): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

Returns:

adv (torch.FloatTensor): the calculated advantage

Shapes:

value (torch.FloatTensor): \((T, B)\), where T is trajectory length and B is batch size
next_value (torch.FloatTensor): \((T, B)\)
reward (torch.FloatTensor): \((T, B)\)
adv (torch.FloatTensor): \((T, B)\)

Examples:

>>> value = torch.randn(2, 3)
>>> next_value = torch.randn(2, 3)
>>> reward = torch.randn(2, 3)
>>> data = gae_data(value, next_value, reward, None, None)
>>> adv = gae(data)

isw¶

Please refer to ding/rl_utils/isw for more details.

compute_importance_weights¶

ding.rl_utils.isw.compute_importance_weights(target_output: Tensor | dict, behaviour_output: Tensor | dict, action: Tensor, action_space_type: str = 'discrete', requires_grad: bool = False)[source]¶

Overview:

Computing importance sampling weight with given output and action

Arguments:

target_output (Union[torch.Tensor,dict]): the output taking the action by the current policy network, usually this output is network output logit if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.
behaviour_output (Union[torch.Tensor,dict]): the output taking the action by the behaviour policy network, usually this output is network output logit, if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.
action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
action_space_type (str): action space types in [‘discrete’, ‘continuous’]
requires_grad (bool): whether requires grad computation

Returns:

rhos (torch.Tensor): Importance sampling weight

Shapes:

target_output (Union[torch.FloatTensor,dict]): \((T, B, N)\), where T is timestep, B is batch size and N is action dim
behaviour_output (Union[torch.FloatTensor,dict]): \((T, B, N)\)
action (torch.LongTensor): \((T, B)\)
rhos (torch.FloatTensor): \((T, B)\)

Examples:

>>> target_output = torch.randn(2, 3, 4)
>>> behaviour_output = torch.randn(2, 3, 4)
>>> action = torch.randint(0, 4, (2, 3))
>>> rhos = compute_importance_weights(target_output, behaviour_output, action)

ppg¶

Please refer to ding/rl_utils/ppg for more details.

ppg_data¶

class ding.rl_utils.ppg.ppg_data(logit_new, logit_old, action, value_new, value_old, return_, weight)¶

ppg_joint_loss¶

class ding.rl_utils.ppg.ppg_joint_loss(auxiliary_loss, behavioral_cloning_loss)¶

ppg_joint_error¶

ding.rl_utils.ppg.ppg_joint_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) → Tuple[namedtuple, namedtuple][source]¶

Overview:

Get PPG joint loss

Arguments:

data (namedtuple): ppg input data with fieids shown in ppg_data
clip_ratio (float): clip value for ratio
use_value_clip (bool): whether use value clip

Returns:

ppg_joint_loss (namedtuple): the ppg loss item, all of them are the differentiable 0-dim tensor

Shapes:

logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
logit_old (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B,)\)
value_new (torch.FloatTensor): \((B, 1)\)
value_old (torch.FloatTensor): \((B, 1)\)
return (torch.FloatTensor): \((B, 1)\)
weight (torch.FloatTensor): \((B,)\)
auxiliary_loss (torch.FloatTensor): \(()\), 0-dim tensor
behavioral_cloning_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppg_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3, 1),
>>>     value_old=torch.randn(3, 1),
>>>     return_=torch.randn(3, 1),
>>>     weight=torch.ones(3),
>>> )
>>> loss = ppg_joint_error(data, 0.99, 0.99)

ppo¶

Please refer to ding/rl_utils/ppo for more details.

ppo_data¶

class ding.rl_utils.ppo.ppo_data(logit_new, logit_old, action, value_new, value_old, adv, return_, weight, logit_pretrained)¶

ppo_policy_data¶

class ding.rl_utils.ppo.ppo_policy_data(logit_new, logit_old, action, adv, weight, logit_pretrained)¶

ppo_value_data¶

class ding.rl_utils.ppo.ppo_value_data(value_new, value_old, return_, weight)

ppo_loss¶

class ding.rl_utils.ppo.ppo_loss(policy_loss, value_loss, entropy_loss, kl_div)¶

ppo_policy_loss¶

class ding.rl_utils.ppo.ppo_policy_loss(policy_loss, entropy_loss, kl_div)¶

ppo_info¶

class ding.rl_utils.ppo.ppo_info(approx_kl, clipfrac)¶

shape_fn_ppo¶

ding.rl_utils.ppo.shape_fn_ppo(args, kwargs)[source]¶

Overview:: Return shape of ppo for hpc
Returns:: shape: [B, N]

ppo_error¶

ding.rl_utils.ppo.ppo_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: float | None = None, kl_type: str = 'k1') → Tuple[namedtuple, namedtuple][source]¶

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:

data (namedtuple): the ppo input data with fieids shown in ppo_data
clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2
use_value_clip (bool): whether to use clip in value loss with the same ratio as policy
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
kl_type (str): which kl loss to use, default set to ‘k1’.

Returns:

ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
logit_old (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
value_new (torch.FloatTensor): \((B, )\)
value_old (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

Note

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error¶

ding.rl_utils.ppo.ppo_policy_error(data: namedtuple, clip_ratio: float = 0.2, dual_clip: float | None = None, entropy_bonus: bool = True, kl_type: str = 'k1') → Tuple[namedtuple, namedtuple][source]

Overview:

Get PPO policy loss (both for classical RL in control/video games and LLM/VLM RLHF).

Arguments:

data (namedtuple): Ppo input data with fieids shown in ppo_policy_data.
clip_ratio (float): Clip value for ratio, defaults to 0.2.
dual_clip (float): A parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
entropy_bonus (bool): Whether to use entropy bonus, defaults to True. LLM RLHF usually does not use it.
kl_type (str): which kl loss to use, default set to ‘k1’.

Returns:

ppo_policy_loss (namedtuple): the ppo policy loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
logit_old (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_policy_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error(data)

Note

This function can be extended from B to more parallel dimensions, like (B, S), where S is the sequence length in LLM/VLM.

Note

For the action mask often used in LLM/VLM, users can set the weight to the action mask.

ppo_value_error¶

ding.rl_utils.ppo.ppo_value_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) → Tensor[source]

Overview:

Get PPO value loss

Arguments:

data (namedtuple): ppo input data with fieids shown in ppo_value_data
clip_ratio (float): clip value for ratio
use_value_clip (bool): whether use value clip

Returns:

value_loss (torch.FloatTensor): the ppo value loss item, all of them are the differentiable 0-dim tensor

Shapes:

value_new (torch.FloatTensor): \((B, )\), where B is batch size
value_old (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
value_loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:

>>> action_dim = 4
>>> data = ppo_value_data(
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_value_error(data)

ppo_error_continuous¶

ding.rl_utils.ppo.ppo_error_continuous(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: float | None = None, kl_type: str = 'k1') → Tuple[namedtuple, namedtuple][source]¶

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:

data (namedtuple): the ppo input data with fieids shown in ppo_data
clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2
use_value_clip (bool): whether to use clip in value loss with the same ratio as policy
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
kl_type (str): which kl loss to use, default set to ‘k1’.

Returns:

ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, )\)
value_new (torch.FloatTensor): \((B, )\)
value_old (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_data_continuous(
>>>     mu_sigma_new= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

Note

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error_continuous¶

ding.rl_utils.ppo.ppo_policy_error_continuous(data: namedtuple, clip_ratio: float = 0.2, dual_clip: float | None = None, kl_type: str = 'k1') → Tuple[namedtuple, namedtuple][source]¶

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with dual_clip

Arguments:

data (namedtuple): the ppo input data with fieids shown in ppo_data
clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
kl_type (str): which kl loss to use, default set to ‘k1’.

Returns:

ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim
action (torch.LongTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
entropy_loss (torch.FloatTensor): \(()\)

Examples:

>>> action_dim = 4
>>> data = ppo_policy_data_continuous(
>>>     mu_sigma_new=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error_continuous(data)

retrace¶

Please refer to ding/rl_utils/retrace for more details.

compute_q_retraces¶

ding.rl_utils.retrace.compute_q_retraces(q_values: Tensor, v_pred: Tensor, rewards: Tensor, actions: Tensor, weights: Tensor, ratio: Tensor, gamma: float = 0.9) → Tensor[source]¶

Shapes:

q_values (torch.Tensor): \((T + 1, B, N)\), where T is unroll_len, B is batch size, N is discrete action dim.
v_pred (torch.Tensor): \((T + 1, B, 1)\)
rewards (torch.Tensor): \((T, B)\)
actions (torch.Tensor): \((T, B)\)
weights (torch.Tensor): \((T, B)\)
ratio (torch.Tensor): \((T, B, N)\)
q_retraces (torch.Tensor): \((T + 1, B, 1)\)

Examples:

>>> T=2
>>> B=3
>>> N=4
>>> q_values=torch.randn(T+1, B, N)
>>> v_pred=torch.randn(T+1, B, 1)
>>> rewards=torch.randn(T, B)
>>> actions=torch.randint(0, N, (T, B))
>>> weights=torch.ones(T, B)
>>> ratio=torch.randn(T, B, N)
>>> q_retraces = compute_q_retraces(q_values, v_pred, rewards, actions, weights, ratio)

Note

q_retrace operation doesn’t need to compute gradient, just executes forward computation.

sampler¶

Please refer to ding/rl_utils/sampler for more details.

ArgmaxSampler¶

class ding.rl_utils.sampler.ArgmaxSampler[source]¶

Overview:: Argmax sampler, return the index of the maximum value

__call__(logit: Tensor) → Tensor[source]¶

Overview:

Return the index of the maximum value

Arguments:

logit (torch.Tensor): The input tensor

Returns:

action (torch.Tensor): The index of the maximum value

MultinomialSampler¶

class ding.rl_utils.sampler.MultinomialSampler[source]¶

Overview:: Multinomial sampler, return the index of the sampled value

__call__(logit: Tensor) → Tensor[source]¶

Overview:

Return the index of the sampled value

Arguments:

logit (torch.Tensor): The input tensor

Returns:

action (torch.Tensor): The index of the sampled value

MuSampler¶

class ding.rl_utils.sampler.MuSampler[source]¶

Overview:: Mu sampler, return the mu of the input tensor

__call__(logit: Tensor) → Tensor[source]¶

Overview:

Return the mu of the input tensor

Arguments:

logit (ttorch.Tensor): The input tensor

Returns:

action (torch.Tensor): The mu of the input tensor

ReparameterizationSampler¶

class ding.rl_utils.sampler.ReparameterizationSampler[source]¶

Overview:: Reparameterization sampler, return the reparameterized value of the input tensor

__call__(logit: Tensor) → Tensor[source]¶

Overview:

Return the reparameterized value of the input tensor

Arguments:

logit (ttorch.Tensor): The input tensor

Returns:

action (torch.Tensor): The reparameterized value of the input tensor

HybridStochasticSampler¶

class ding.rl_utils.sampler.HybridStochasticSampler[source]¶

Overview:: Hybrid stochastic sampler, return the sampled action type and the reparameterized action args

__call__(logit: Tensor) → Tensor[source]¶

Overview:

Return the sampled action type and the reparameterized action args

Arguments:

logit (ttorch.Tensor): The input tensor

Returns:

action (ttorch.Tensor): The sampled action type and the reparameterized action args

HybridDeterminsticSampler¶

class ding.rl_utils.sampler.HybridDeterminsticSampler[source]¶

Overview:: Hybrid deterministic sampler, return the argmax action type and the mu action args

__call__(logit: Tensor) → Tensor[source]¶

Overview:

Return the argmax action type and the mu action args

Arguments:

logit (ttorch.Tensor): The input tensor

Returns:

action (ttorch.Tensor): The argmax action type and the mu action args

td¶

Please refer to ding/rl_utils/td for more details.

q_1step_td_data¶

class ding.rl_utils.td.q_1step_td_data(q, next_q, act, next_act, reward, done, weight)¶

q_1step_td_error¶

ding.rl_utils.td.q_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

1 step td_error, support single agent case and multi agent case.

Arguments:

data (q_1step_td_data): The input data, q_1step_td_data to calculate loss
gamma (float): Discount factor
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error

Shapes:

data (q_1step_td_data): the q_1step_td_data containing [‘q’, ‘next_q’, ‘act’, ‘next_act’, ‘reward’, ‘done’, ‘weight’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
act (torch.LongTensor): \((B, )\)
next_act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \(( , B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> action_dim = 4
>>> data = q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     next_act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)).bool(),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_1step_td_error(data, 0.99)

m_q_1step_td_data¶

class ding.rl_utils.td.m_q_1step_td_data(q, target_q, next_q, act, reward, done, weight)¶

m_q_1step_td_error¶

ding.rl_utils.td.m_q_1step_td_error(data: ~collections.namedtuple, gamma: float, tau: float, alpha: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

Munchausen td_error for DQN algorithm, support 1 step td error.

Arguments:

data (m_q_1step_td_data): The input data, m_q_1step_td_data to calculate loss
gamma (float): Discount factor
tau (float): Entropy factor for Munchausen DQN
alpha (float): Discount factor for Munchausen term
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:

data (m_q_1step_td_data): the m_q_1step_td_data containing [‘q’, ‘target_q’, ‘next_q’, ‘act’, ‘reward’, ‘done’, ‘weight’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
target_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \(( , B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> action_dim = 4
>>> data = m_q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     target_q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = m_q_1step_td_error(data, 0.99, 0.01, 0.01)

q_v_1step_td_data¶

class ding.rl_utils.td.q_v_1step_td_data(q, v, act, reward, done, weight)¶

q_v_1step_td_error¶

ding.rl_utils.td.q_v_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

td_error between q and v value for SAC algorithm, support 1 step td error.

Arguments:

data (q_v_1step_td_data): The input data, q_v_1step_td_data to calculate loss
gamma (float): Discount factor
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:

data (q_v_1step_td_data): the q_v_1step_td_data containing [‘q’, ‘v’, ‘act’, ‘reward’, ‘done’, ‘weight’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
v (torch.FloatTensor): \((B, )\)
act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \(( , B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> action_dim = 4
>>> data = q_v_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     v=torch.randn(3),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_v_1step_td_error(data, 0.99)

nstep_return_data¶

class ding.rl_utils.td.nstep_return_data(reward, next_value, done)¶

nstep_return¶

ding.rl_utils.td.nstep_return(data: namedtuple, gamma: float | list, nstep: int, value_gamma: Tensor | None = None)[source]¶

Overview:

Calculate nstep return for DQN algorithm, support single agent case and multi agent case.

Arguments:

data (nstep_return_data): The input data, nstep_return_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num
value_gamma (torch.Tensor): Discount factor for value

Returns:

return (torch.Tensor): nstep return

Shapes:

data (nstep_return_data): the nstep_return_data containing [‘reward’, ‘next_value’, ‘done’]
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
next_value (torch.FloatTensor): \((, B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> data = nstep_return_data(
>>>     reward=torch.randn(3, 3),
>>>     next_value=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>> )
>>> loss = nstep_return(data, 0.99, 3)

dist_1step_td_data¶

class ding.rl_utils.td.dist_1step_td_data(dist, next_dist, act, next_act, reward, done, weight)¶

dist_1step_td_error¶

ding.rl_utils.td.dist_1step_td_error(data: namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int) → Tensor[source]¶

Overview:

1 step td_error for distributed q-learning based algorithm

Arguments:

data (dist_1step_td_data): The input data, dist_nstep_td_data to calculate loss
gamma (float): Discount factor
v_min (float): The min value of support
v_max (float): The max value of support
n_atom (int): The num of atom

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (dist_1step_td_data): the dist_1step_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]
dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]
next_dist (torch.FloatTensor): \((B, N, n_atom)\)
act (torch.LongTensor): \((B, )\)
next_act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((, B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_dist = torch.randn(4, 3, 51).abs()
>>> act = torch.randint(0, 3, (4,))
>>> next_act = torch.randint(0, 3, (4,))
>>> reward = torch.randn(4)
>>> done = torch.randint(0, 2, (4,))
>>> data = dist_1step_td_data(dist, next_dist, act, next_act, reward, done, None)
>>> loss = dist_1step_td_error(data, 0.99, -10.0, 10.0, 51)

dist_nstep_td_data¶

ding.rl_utils.td.dist_nstep_td_data¶: alias of dist_1step_td_data

shape_fn_dntd¶

ding.rl_utils.td.shape_fn_dntd(args, kwargs)[source]¶

Overview:: Return dntd shape for hpc
Returns:: shape: [T, B, N, n_atom]

dist_nstep_td_error¶

ding.rl_utils.td.dist_nstep_td_error(data: namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int, nstep: int = 1, value_gamma: Tensor | None = None) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error for distributed q-learning based algorithm, support single agent case and multi agent case.

Arguments:

data (dist_nstep_td_data): The input data, dist_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (dist_nstep_td_data): the dist_nstep_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]
dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]
next_n_dist (torch.FloatTensor): \((B, N, n_atom)\)
act (torch.LongTensor): \((B, )\)
next_n_act (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_n_dist = torch.randn(4, 3, 51).abs()
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> reward = torch.randn(5, 4)
>>> data = dist_nstep_td_data(dist, next_n_dist, action, next_action, reward, done, None)
>>> loss, _ = dist_nstep_td_error(data, 0.95, -10.0, 10.0, 51, 5)

v_1step_td_data¶

class ding.rl_utils.td.v_1step_td_data(v, next_v, reward, done, weight)¶

v_1step_td_error¶

ding.rl_utils.td.v_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

1 step td_error for distributed value based algorithm

Arguments:

data (v_1step_td_data): The input data, v_1step_td_data to calculate loss
gamma (float): Discount factor
criterion (torch.nn.modules): Loss function criterion

Returns:

loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:

data (v_1step_td_data): the v_1step_td_data containing [‘v’, ‘next_v’, ‘reward’, ‘done’, ‘weight’]
v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]
next_v (torch.FloatTensor): \((B, )\)
reward (torch.FloatTensor): \((, B)\)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:

>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5)
>>> done = torch.zeros(5)
>>> data = v_1step_td_data(v, next_v, reward, done, None)
>>> loss, td_error_per_sample = v_1step_td_error(data, 0.99)

v_nstep_td_data¶

class ding.rl_utils.td.v_nstep_td_data(v, next_n_v, reward, done, weight, value_gamma)¶

v_nstep_td_error¶

ding.rl_utils.td.v_nstep_td_error(data: ~collections.namedtuple, gamma: float, nstep: int = 1, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

Multistep (n step) td_error for distributed value based algorithm

Arguments:

data (dist_nstep_td_data): The input data, v_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (dist_nstep_td_data): The v_nstep_td_data containing [‘v’, ‘next_n_v’, ‘reward’, ‘done’, ‘weight’, ‘value_gamma’]
v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]
next_v (torch.FloatTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
weight (torch.FloatTensor or None): \((B, )\), the training sample weight
value_gamma (torch.Tensor): If the remaining data in the buffer is less than n_step we use value_gamma as the gamma discount value for next_v rather than gamma**n_step

Examples:

>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5, 5)
>>> done = torch.zeros(5)
>>> data = v_nstep_td_data(v, next_v, reward, done, 0.9, 0.99)
>>> loss, td_error_per_sample = v_nstep_td_error(data, 0.99, 5)

q_nstep_td_data¶

class ding.rl_utils.td.q_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, weight)¶

dqfd_nstep_td_data¶

class ding.rl_utils.td.dqfd_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, done_one_step, weight, new_n_q_one_step, next_n_action_one_step, is_expert)¶

shape_fn_qntd¶

ding.rl_utils.td.shape_fn_qntd(args, kwargs)[source]¶

Overview:: Return qntd shape for hpc
Returns:: shape: [T, B, N]

q_nstep_td_error¶

ding.rl_utils.td.q_nstep_td_error(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:

data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
gamma (float): Discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor
td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = q_nstep_td_error(data, 0.95, nstep=nstep)

bdq_nstep_td_error¶

ding.rl_utils.td.bdq_nstep_td_error(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error for BDQ algorithm, referenced paper “Action Branching Architectures for Deep Reinforcement Learning”, link: https://arxiv.org/pdf/1711.08946. In fact, the original paper only provides the 1-step TD-error calculation method, and here we extend the calculation method of n-step, i.e., TD-error:

Arguments:

data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
gamma (float): Discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor
td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, D, N)\) i.e. [batch_size, branch_num, action_bins_per_branch]
next_n_q (torch.FloatTensor): \((B, D, N)\)
action (torch.LongTensor): \((B, D)\)
next_n_action (torch.LongTensor): \((B, D)\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:

>>> action_per_branch = 3
>>> next_q = torch.randn(8, 6, action_per_branch)
>>> done = torch.randn(8)
>>> action = torch.randint(0, action_per_branch, size=(8, 6))
>>> next_action = torch.randint(0, action_per_branch, size=(8, 6))
>>> nstep =3
>>> q = torch.randn(8, 6, action_per_branch).requires_grad_(True)
>>> reward = torch.rand(nstep, 8)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = bdq_nstep_td_error(data, 0.95, nstep=nstep)

shape_fn_qntd_rescale¶

ding.rl_utils.td.shape_fn_qntd_rescale(args, kwargs)[source]¶

Overview:: Return qntd_rescale shape for hpc
Returns:: shape: [T, B, N]

q_nstep_td_error_with_rescale¶

ding.rl_utils.td.q_nstep_td_error_with_rescale(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: ~typing.Callable = <function value_transform>, inv_trans_fn: ~typing.Callable = <function value_inv_transform>) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error with value rescaling

Arguments:

data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1
criterion (torch.nn.modules): Loss function criterion
trans_fn (Callable): Value transfrom function, default to value_transform (refer to rl_utils/value_rescale.py)
inv_trans_fn (Callable): Value inverse transfrom function, default to value_inv_transform (refer to rl_utils/value_rescale.py)

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, _ = q_nstep_td_error_with_rescale(data, 0.95, nstep=nstep)

dqfd_nstep_td_error¶

ding.rl_utils.td.dqfd_nstep_td_error(data: ~collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, margin_function: float, lambda_one_step_td: float = 1.0, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:

data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss
gamma (float): discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 10

Returns:

loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor
td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:

data (q_nstep_td_data): the q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)
new_n_q_one_step (torch.FloatTensor): \((B, N)\)
next_n_action_one_step (torch.LongTensor): \((B, )\)
is_expert (int) : 0 or 1

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> done_1 = torch.randn(4)
>>> next_q_one_step = torch.randn(4, 3)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> next_action_one_step = torch.randint(0, 3, size=(4, ))
>>> is_expert = torch.ones((4))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = dqfd_nstep_td_data(
>>>     q, next_q, action, next_action, reward, done, done_1, None,
>>>     next_q_one_step, next_action_one_step, is_expert
>>> )
>>> loss, td_error_per_sample, loss_statistics = dqfd_nstep_td_error(
>>>     data, 0.95, lambda_n_step_td=1, lambda_supervised_loss=1,
>>>     margin_function=0.8, nstep=nstep
>>> )

dqfd_nstep_td_error_with_rescale¶

ding.rl_utils.td.dqfd_nstep_td_error_with_rescale(data: ~collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, lambda_one_step_td: float, margin_function: float, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: ~typing.Callable = <function value_transform>, inv_trans_fn: ~typing.Callable = <function value_inv_transform>) → Tensor[source]¶

Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:

data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss
gamma (float): Discount factor
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 10

Returns:

loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor
td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)
new_n_q_one_step (torch.FloatTensor): \((B, N)\)
next_n_action_one_step (torch.LongTensor): \((B, )\)
is_expert (int) : 0 or 1

qrdqn_nstep_td_data¶

class ding.rl_utils.td.qrdqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, tau, weight)¶

qrdqn_nstep_td_error¶

ding.rl_utils.td.qrdqn_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, value_gamma: Tensor | None = None) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error with in QRDQN

Arguments:

data (qrdqn_nstep_td_data): The input data, qrdqn_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]
next_n_q (torch.FloatTensor): \((tau', B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = qrdqn_nstep_td_data(q, next_q, action, next_action, reward, done, 3, None)
>>> loss, td_error_per_sample = qrdqn_nstep_td_error(data, 0.95, nstep=nstep)

q_nstep_sql_td_error¶

ding.rl_utils.td.q_nstep_sql_td_error(data: ~collections.namedtuple, gamma: float, alpha: float, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:

data (q_nstep_td_data): The input data, q_nstep_sql_td_data to calculate loss
gamma (float): Discount factor
Alpha (:obj:｀float`): A parameter to weight entropy term in a policy equation
cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data
value_gamma (torch.Tensor): Gamma discount value for target soft_q_value
criterion (torch.nn.modules): Loss function criterion
nstep (int): nstep num, default set to 1

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor
td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]
next_n_q (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:

>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample, record_target_v = q_nstep_sql_td_error(data, 0.95, 1.0, nstep=nstep)

iqn_nstep_td_data¶

class ding.rl_utils.td.iqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, replay_quantiles, weight)¶

iqn_nstep_td_error¶

ding.rl_utils.td.iqn_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Tensor | None = None) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error with in IQN, referenced paper Implicit Quantile Networks for Distributional Reinforcement Learning <https://arxiv.org/pdf/1806.06923.pdf>

Arguments:

data (iqn_nstep_td_data): The input data, iqn_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1
criterion (torch.nn.modules): Loss function criterion
beta_function (Callable): The risk function

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]
next_n_q (torch.FloatTensor): \((tau', B, N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:

>>> next_q = torch.randn(3, 4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(3, 4, 3).requires_grad_(True)
>>> replay_quantile = torch.randn([3, 4, 1])
>>> reward = torch.rand(nstep, 4)
>>> data = iqn_nstep_td_data(q, next_q, action, next_action, reward, done, replay_quantile, None)
>>> loss, td_error_per_sample = iqn_nstep_td_error(data, 0.95, nstep=nstep)

fqf_nstep_td_data¶

class ding.rl_utils.td.fqf_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, quantiles_hats, weight)¶

fqf_nstep_td_error¶

ding.rl_utils.td.fqf_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Tensor | None = None) → Tensor[source]¶

Overview:

Multistep (1 step or n step) td_error with in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:

data (fqf_nstep_td_data): The input data, fqf_nstep_td_data to calculate loss
gamma (float): Discount factor
nstep (int): nstep num, default set to 1
criterion (torch.nn.modules): Loss function criterion
beta_function (Callable): The risk function

Returns:

loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:

data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]
q (torch.FloatTensor): \((B, tau, N)\) i.e. [batch_size, tau, action_dim]
next_n_q (torch.FloatTensor): \((B, tau', N)\)
action (torch.LongTensor): \((B, )\)
next_n_action (torch.LongTensor): \((B, )\)
reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)
done (torch.BoolTensor) \((B, )\), whether done in last timestep
quantiles_hats (torch.FloatTensor): \((B, tau)\)

Examples:

>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> quantiles_hats = torch.randn([4, 3])
>>> reward = torch.rand(nstep, 4)
>>> data = fqf_nstep_td_data(q, next_q, action, next_action, reward, done, quantiles_hats, None)
>>> loss, td_error_per_sample = fqf_nstep_td_error(data, 0.95, nstep=nstep)

evaluate_quantile_at_action¶

ding.rl_utils.td.evaluate_quantile_at_action(q_s, actions)[source]¶

fqf_calculate_fraction_loss¶

ding.rl_utils.td.fqf_calculate_fraction_loss(q_tau_i, q_value, quantiles, actions)[source]¶

Overview:

Calculate the fraction loss in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:

q_tau_i (torch.FloatTensor): \((batch_size, num_quantiles-1, action_dim)\)
q_value (torch.FloatTensor): \((batch_size, num_quantiles, action_dim)\)
quantiles (torch.FloatTensor): \((batch_size, num_quantiles+1)\)
actions (torch.LongTensor): \((batch_size, )\)

Returns:

fraction_loss (torch.Tensor): fraction loss, 0-dim tensor

td_lambda_data¶

class ding.rl_utils.td.td_lambda_data(value, reward, weight)¶

shape_fn_td_lambda¶

ding.rl_utils.td.shape_fn_td_lambda(args, kwargs)[source]¶

Overview:: Return td_lambda shape for hpc
Returns:: shape: [T, B]

td_lambda_error¶

ding.rl_utils.td.td_lambda_error(data: namedtuple, gamma: float = 0.9, lambda_: float = 0.8) → Tensor[source]¶

Overview:

Computing TD(lambda) loss given constant gamma and lambda. There is no special handling for terminal state value, if some state has reached the terminal, just fill in zeros for values and rewards beyond terminal (including the terminal state, values[terminal] should also be 0)

Arguments:

data (namedtuple): td_lambda input data with fields [‘value’, ‘reward’, ‘weight’]
gamma (float): Constant discount factor gamma, should be in [0, 1], defaults to 0.9
lambda (float): Constant lambda, should be in [0, 1], defaults to 0.8

Returns:

loss (torch.Tensor): Computed MSE loss, averaged over the batch

Shapes:

value (torch.FloatTensor): \((T+1, B)\), where T is trajectory length and B is batch, which is the estimation of the state value at step 0 to T
reward (torch.FloatTensor): \((T, B)\), the returns from time step 0 to T-1
weight (torch.FloatTensor or None): \((B, )\), the training sample weight
loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:

>>> T, B = 8, 4
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> loss = td_lambda_error(td_lambda_data(value, reward, None))

generalized_lambda_returns¶

ding.rl_utils.td.generalized_lambda_returns(bootstrap_values: Tensor, rewards: Tensor, gammas: float, lambda_: float, done: Tensor | None = None) → Tensor[source]¶

Overview:

Functional equivalent to trfl.value_ops.generalized_lambda_returns https://github.com/deepmind/trfl/blob/2c07ac22512a16715cc759f0072be43a5d12ae45/trfl/value_ops.py#L74 Passing in a number instead of tensor to make the value constant for all samples in batch

Arguments:

bootstrap_values (torch.Tensor or float): estimation of the value at step 0 to T, of size [T_traj+1, batchsize]
rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]
gammas (torch.Tensor or float): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]
lambda (torch.Tensor or float): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep, of size [T_traj, batchsize]
done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:

return (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

multistep_forward_view¶

ding.rl_utils.td.multistep_forward_view(bootstrap_values: Tensor, rewards: Tensor, gammas: float, lambda_: float, done: Tensor | None = None) → Tensor[source]¶

Overview:: Same as trfl.sequence_ops.multistep_forward_view, which implements (12.18) in Sutton & Barto. Assuming the first dim of input tensors correspond to the index in batch.

Note

result[T-1] = rewards[T-1] + gammas[T-1] * bootstrap_values[T] for t in 0…T-2 : result[t] = rewards[t] + gammas[t]*(lambdas[t]*result[t+1] + (1-lambdas[t])*bootstrap_values[t+1])

Arguments:

bootstrap_values (torch.Tensor): Estimation of the value at step 1 to T, of size [T_traj, batchsize]
rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]
gammas (torch.Tensor): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]
lambda (torch.Tensor): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep of size [T_traj, batchsize], the element for T-1 is ignored and effectively set to 0, as there is no information about future rewards.
done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:

ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

upgo¶

Please refer to ding/rl_utils/upgo for more details.

upgo_returns¶

ding.rl_utils.upgo.upgo_returns(rewards: Tensor, bootstrap_values: Tensor) → Tensor[source]¶

Overview:

Computing UPGO return targets. Also notice there is no special handling for the terminal state.

Arguments:

rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]
bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]

Returns:

ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

Examples:

>>> T, B, N, N2 = 4, 8, 5, 7
>>> rewards = torch.randn(T, B)
>>> bootstrap_values = torch.randn(T + 1, B).requires_grad_(True)
>>> returns = upgo_returns(rewards, bootstrap_values)

upgo_loss¶

ding.rl_utils.upgo.upgo_loss(target_output: Tensor, rhos: Tensor, action: Tensor, rewards: Tensor, bootstrap_values: Tensor, mask=None) → Tensor[source]¶

Overview:

Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.

Arguments:

target_output (torch.Tensor): the output computed by the target policy network, of size [T_traj, batchsize, n_output]
rhos (torch.Tensor): the importance sampling ratio, of size [T_traj, batchsize]
action (torch.Tensor): the action taken, of size [T_traj, batchsize]
rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]
bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]

Returns:

loss (torch.Tensor): Computed importance sampled UPGO loss, averaged over the samples, of size []

Examples:

>>> T, B, N, N2 = 4, 8, 5, 7
>>> rhos = torch.randn(T, B)
>>> loss = upgo_loss(logit, rhos, action, rewards, bootstrap_values)

value_rescale¶

Please refer to ding/rl_utils/value_rescale for more details.

value_transform¶

ding.rl_utils.value_rescale.value_transform(x: Tensor, eps: float = 0.01) → Tensor[source]¶

Overview:

A function to reduce the scale of the action-value function. :math: h(x) = sign(x)(sqrt{(abs(x)+1)} - 1) + epsilon * x .

Arguments:

x: (torch.Tensor) The input tensor to be normalized.
eps: (float) The coefficient of the additive regularization term to ensure inverse function is Lipschitz continuous

Returns:

(torch.Tensor) Normalized tensor.

Note

Observe and Look Further: Achieving Consistent Performance on Atari (https://arxiv.org/abs/1805.11593).

value_inv_transform¶

ding.rl_utils.value_rescale.value_inv_transform(x: Tensor, eps: float = 0.01) → Tensor[source]¶

Overview:
The inverse form of value rescale. :math: `h^{-1}(x) = sign(x)({(

rac{sqrt{1+4epsilon(|x|+1+epsilon)}-1}{2epsilon})}^2-1)` .

Arguments:

x: (torch.Tensor) The input tensor to be unnormalized.
eps: (float) The coefficient of the additive regularization term to ensure inverse function is Lipschitz continuous

Returns:

(torch.Tensor) Unnormalized tensor.

symlog¶

ding.rl_utils.value_rescale.symlog(x: Tensor) → Tensor[source]¶

Overview:

A function to normalize the targets. :math: symlog(x) = sign(x)(ln{|x|+1}) .

Arguments:

x: (torch.Tensor) The input tensor to be normalized.

Returns:

(torch.Tensor) Normalized tensor.

Note

Mastering Diverse Domains through World Models (https://arxiv.org/abs/2301.04104)

inv_symlog¶

ding.rl_utils.value_rescale.inv_symlog(x: Tensor) → Tensor[source]¶

Overview:

The inverse form of symlog. :math: symexp(x) = sign(x)(exp{|x|}-1) .

Arguments:

x: (torch.Tensor) The input tensor to be unnormalized.

Returns:

(torch.Tensor) Unnormalized tensor.

vtrace¶

Please refer to ding/rl_utils/vtrace for more details.

vtrace_nstep_return¶

ding.rl_utils.vtrace.vtrace_nstep_return(clipped_rhos, clipped_cs, reward, bootstrap_values, gamma=0.99, lambda_=0.95)[source]¶

Overview:

Computation of vtrace return.

Returns:

vtrace_return (torch.FloatTensor): the vtrace loss item, all of them are differentiable 0-dim tensor

Shapes:

clipped_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size
clipped_cs (torch.FloatTensor): \((T, B)\)
reward (torch.FloatTensor): \((T, B)\)
bootstrap_values (torch.FloatTensor): \((T+1, B)\)
vtrace_return (torch.FloatTensor): \((T, B)\)

vtrace_advantage¶

ding.rl_utils.vtrace.vtrace_advantage(clipped_pg_rhos, reward, return_, bootstrap_values, gamma)[source]¶

Overview:

Computation of vtrace advantage.

Returns:

vtrace_advantage (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:

clipped_pg_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size
reward (torch.FloatTensor): \((T, B)\)
return (torch.FloatTensor): \((T, B)\)
bootstrap_values (torch.FloatTensor): \((T, B)\)
vtrace_advantage (torch.FloatTensor): \((T, B)\)

vtrace_data¶

class ding.rl_utils.vtrace.vtrace_data(target_output, behaviour_output, action, value, reward, weight)¶

vtrace_loss¶

class ding.rl_utils.vtrace.vtrace_loss(policy_loss, value_loss, entropy_loss)¶

vtrace_error_discrete_action¶

ding.rl_utils.vtrace.vtrace_error_discrete_action(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[source]¶

Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:

data (namedtuple): input data with fields shown in vtrace_data
- target_output (torch.Tensor): the output taking the action by the current policy network, usually this output is network output logit
- behaviour_output (torch.Tensor): the output taking the action by the behaviour policy network, usually this output is network output logit, which is used to produce the trajectory(collector)
- action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
gamma: (float): the future discount factor, defaults to 0.95
lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0
rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)
c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)
rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:

trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:

target_output (torch.FloatTensor): \((T, B, N)\), where T is timestep, B is batch size and N is action dim
behaviour_output (torch.FloatTensor): \((T, B, N)\)
action (torch.LongTensor): \((T, B)\)
value (torch.FloatTensor): \((T+1, B)\)
reward (torch.LongTensor): \((T, B)\)
weight (torch.LongTensor): \((T, B)\)

Examples:

>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = torch.randn(T, B, N).requires_grad_(True)
>>> behaviour_output = torch.randn(T, B, N)
>>> action = torch.randint(0, N, size=(T, B))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_discrete_action(data, rho_clip_ratio=1.1)

vtrace_error_continuous_action¶

ding.rl_utils.vtrace.vtrace_error_continuous_action(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[source]¶

Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:

data (namedtuple): input data with fields shown in vtrace_data
- target_output (dict{key:torch.Tensor}): the output taking the action by the current policy network, usually this output is network output, which represents the distribution by reparameterization trick.
- behaviour_output (dict{key:torch.Tensor}): the output taking the action by the behaviour policy network, usually this output is network output logit, which represents the distribution by reparameterization trick.
- action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action
gamma: (float): the future discount factor, defaults to 0.95
lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0
rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)
c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)
rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:

trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:

target_output (dict{key:torch.FloatTensor}): \((T, B, N)\), where T is timestep, B is batch size and N is action dim. The keys are usually parameters of reparameterization trick.
behaviour_output (dict{key:torch.FloatTensor}): \((T, B, N)\)
action (torch.LongTensor): \((T, B)\)
value (torch.FloatTensor): \((T+1, B)\)
reward (torch.LongTensor): \((T, B)\)
weight (torch.LongTensor): \((T, B)\)

Examples:

>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = dict(
>>>     'mu': torch.randn(T, B, N).requires_grad_(True),
>>>     'sigma': torch.exp(torch.randn(T, B, N).requires_grad_(True)),
>>> )
>>> behaviour_output = dict(
>>>     'mu': torch.randn(T, B, N),
>>>     'sigma': torch.exp(torch.randn(T, B, N)),
>>> )
>>> action = torch.randn((T, B, N))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_continuous_action(data, rho_clip_ratio=1.1)