Shortcuts

ding.rl_utils

a2c

Please refer to ding/rl_utils/a2c for more details.

a2c_error

ding.rl_utils.a2c_error(data: namedtuple) namedtuple[source]
Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for discrete action space

Arguments:
  • data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:
  • a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, )\)

  • value (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> data = a2c_data(
>>>     logit=torch.randn(2, 3),
>>>     action=torch.randint(0, 3, (2, )),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error(data)

a2c_error_continuous

ding.rl_utils.a2c_error_continuous(data: namedtuple) namedtuple[source]
Overview:

Implementation of A2C(Advantage Actor-Critic) (arXiv:1602.01783) for continuous action space

Arguments:
  • data (namedtuple): a2c input data with fieids shown in a2c_data

Returns:
  • a2c_loss (namedtuple): the a2c loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, N)\)

  • value (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> data = a2c_data(
>>>     logit={'mu': torch.randn(2, 3), 'sigma': torch.sqrt(torch.randn(2, 3)**2)},
>>>     action=torch.randn(2, 3),
>>>     value=torch.randn(2, ),
>>>     adv=torch.randn(2, ),
>>>     return_=torch.randn(2, ),
>>>     weight=torch.ones(2, ),
>>> )
>>> loss = a2c_error_continuous(data)

acer

Please refer to ding/rl_utils/acer for more details.

acer_policy_error

ding.rl_utils.acer_policy_error(q_values: Tensor, q_retraces: Tensor, v_pred: Tensor, target_logit: Tensor, actions: Tensor, ratio: Tensor, c_clip_ratio: float = 10.0) Tuple[Tensor, Tensor][source]
Overview:

Get ACER policy loss.

Arguments:
  • q_values (torch.Tensor): Q values

  • q_retraces (torch.Tensor): Q values (be calculated by retrace method)

  • v_pred (torch.Tensor): V values

  • target_pi (torch.Tensor): The new policy’s probability

  • actions (torch.Tensor): The actions in replay buffer

  • ratio (torch.Tensor): ratio of new polcy with behavior policy

  • c_clip_ratio (float): clip value for ratio

Returns:
  • actor_loss (torch.Tensor): policy loss from q_retrace

  • bc_loss (torch.Tensor): correct policy loss

Shapes:
  • q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim

  • q_retraces (torch.FloatTensor): \((T, B, 1)\)

  • v_pred (torch.FloatTensor): \((T, B, 1)\)

  • target_pi (torch.FloatTensor): \((T, B, N)\)

  • actions (torch.LongTensor): \((T, B)\)

  • ratio (torch.FloatTensor): \((T, B, N)\)

  • actor_loss (torch.FloatTensor): \((T, B, 1)\)

  • bc_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:
>>> q_values=torch.randn(2, 3, 4),
>>> q_retraces=torch.randn(2, 3, 1),
>>> v_pred=torch.randn(2, 3, 1),
>>> target_pi=torch.randn(2, 3, 4),
>>> actions=torch.randint(0, 4, (2, 3)),
>>> ratio=torch.randn(2, 3, 4),
>>> loss = acer_policy_error(q_values, q_retraces, v_pred, target_pi, actions, ratio)

acer_value_error

ding.rl_utils.acer_value_error(q_values, q_retraces, actions)[source]
Overview:

Get ACER critic loss.

Arguments:
  • q_values (torch.Tensor): Q values

  • q_retraces (torch.Tensor): Q values (be calculated by retrace method)

  • actions (torch.Tensor): The actions in replay buffer

  • ratio (torch.Tensor): ratio of new polcy with behavior policy

Returns:
  • critic_loss (torch.Tensor): critic loss

Shapes:
  • q_values (torch.FloatTensor): \((T, B, N)\), where B is batch size and N is action dim

  • q_retraces (torch.FloatTensor): \((T, B, 1)\)

  • actions (torch.LongTensor): \((T, B)\)

  • critic_loss (torch.FloatTensor): \((T, B, 1)\)

Examples:
>>> q_values=torch.randn(2, 3, 4)
>>> q_retraces=torch.randn(2, 3, 1)
>>> actions=torch.randint(0, 4, (2, 3))
>>> loss = acer_value_error(q_values, q_retraces, actions)

acer_trust_region_update

ding.rl_utils.acer_trust_region_update(actor_gradients: List[Tensor], target_logit: Tensor, avg_logit: Tensor, trust_region_value: float) List[Tensor][source]
Overview:

calcuate gradient with trust region constrain

Arguments:
  • actor_gradients (list(torch.Tensor)): gradients value’s for different part

  • target_pi (torch.Tensor): The new policy’s probability

  • avg_pi (torch.Tensor): The average policy’s probability

  • trust_region_value (float): the range of trust region

Returns:
  • update_gradients (list(torch.Tensor)): gradients with trust region constraint

Shapes:
  • target_pi (torch.FloatTensor): \((T, B, N)\)

  • avg_pi (torch.FloatTensor): \((T, B, N)\)

  • update_gradients (list(torch.FloatTensor)): \((T, B, N)\)

Examples:
>>> actor_gradients=[torch.randn(2, 3, 4)]
>>> target_pi=torch.randn(2, 3, 4)
>>> avg_pi=torch.randn(2, 3, 4)
>>> loss = acer_trust_region_update(actor_gradients, target_pi, avg_pi, 0.1)

adder

Please refer to ding/rl_utils/adder for more details.

Adder

class ding.rl_utils.adder.Adder[source]
Overview:

Adder is a component that handles different transformations and calculations for transitions in Collector Module(data generation and processing), such as GAE, n-step return, transition sampling etc.

Interface:

__init__, get_gae, get_gae_with_default_last_value, get_nstep_return_data, get_train_sample

classmethod _get_null_transition(template: dict, null_transition: dict | None = None) dict[source]
Overview:

Get null transition for padding. If cls._null_transition is None, return input template instead.

Arguments:
  • template (dict): The template for null transition.

  • null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:
  • null_transition (dict): The deepcopied null transition.

classmethod get_gae(data: List[Dict[str, Any]], last_value: Tensor, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]][source]
Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:
  • data (list): Transitions list, each element is a transition dict with at least ['value', 'reward'].

  • last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)
classmethod get_gae_with_default_last_value(data: deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]][source]
Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data (i.e. len(data) would decrease by 1), and then call get_gae. Otherwise it would make last_value equal to 0.

Arguments:
  • data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]

  • done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)
classmethod get_nstep_return_data(data: deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) deque[source]
Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element.

Arguments:
  • data (deque): Transitions list, each element is a transition dict

  • nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:
  • data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:
>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)
classmethod get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: dict | None = None) List[Dict[str, Any]][source]
Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:
  • data (List[Dict[str, Any]]): Transitions list, each element is a transition dict

  • unroll_len (int): Learn training unroll length

  • last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]

  • null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:
  • data (List[Dict[str, Any]]): Transitions list processed after unrolling

get_gae

ding.rl_utils.adder.get_gae(data: List[Dict[str, Any]], last_value: Tensor, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]]
Overview:

Get GAE advantage for stacked transitions(T timestep, 1 batch). Call gae for calculation.

Arguments:
  • data (list): Transitions list, each element is a transition dict with at least ['value', 'reward'].

  • last_value (torch.Tensor): The last value(i.e.: the T+1 timestep)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (list): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> last_value = torch.randn(B)
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae(data, last_value, gamma, gae_lambda, cuda)

get_gae_with_default_last_value

ding.rl_utils.adder.get_gae_with_default_last_value(data: deque, done: bool, gamma: float, gae_lambda: float, cuda: bool) List[Dict[str, Any]]
Overview:

Like get_gae above to get GAE advantage for stacked transitions. However, this function is designed in case last_value is not passed. If transition is not done yet, it wouold assign last value in data as last_value, discard the last element in data (i.e. len(data) would decrease by 1), and then call get_gae. Otherwise it would make last_value equal to 0.

Arguments:
  • data (deque): Transitions list, each element is a transition dict with at least[‘value’, ‘reward’]

  • done (bool): Whether the transition reaches the end of an episode(i.e. whether the env is done)

  • gamma (float): The future discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float): GAE lambda parameter, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

  • cuda (bool): Whether use cuda in GAE computation

Returns:
  • data (List[Dict[str, Any]]): transitions list like input one, but each element owns extra advantage key ‘adv’

Examples:
>>> B, T = 2, 3 # batch_size, timestep
>>> data = [dict(value=torch.randn(B), reward=torch.randn(B)) for _ in range(T)]
>>> done = False
>>> gamma = 0.99
>>> gae_lambda = 0.95
>>> cuda = False
>>> data = Adder.get_gae_with_default_last_value(data, done, gamma, gae_lambda, cuda)

get_nstep_return_data

ding.rl_utils.adder.get_nstep_return_data(data: deque, nstep: int, cum_reward=False, correct_terminate_gamma=True, gamma=0.99) deque
Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element.

Arguments:
  • data (deque): Transitions list, each element is a transition dict

  • nstep (int): Number of steps. If equals to 1, return data directly; Otherwise update with nstep value.

Returns:
  • data (deque): Transitions list like input one, but each element updated with nstep value.

Examples:
>>> data = [dict(
>>>     obs=torch.randn(B),
>>>     reward=torch.randn(1),
>>>     next_obs=torch.randn(B),
>>>     done=False) for _ in range(T)]
>>> nstep = 2
>>> data = Adder.get_nstep_return_data(data, nstep)

get_train_sample

ding.rl_utils.adder.get_train_sample(data: List[Dict[str, Any]], unroll_len: int, last_fn_type: str = 'last', null_transition: dict | None = None) List[Dict[str, Any]]
Overview:

Process raw traj data by updating keys ['next_obs', 'reward', 'done'] in data’s dict element. If unroll_len equals to 1, which means no process is needed, can directly return data. Otherwise, data will be splitted according to unroll_len, process residual part according to last_fn_type and call lists_to_dicts to form sampled training data.

Arguments:
  • data (List[Dict[str, Any]]): Transitions list, each element is a transition dict

  • unroll_len (int): Learn training unroll length

  • last_fn_type (str): The method type name for dealing with last residual data in a traj after splitting, should be in [‘last’, ‘drop’, ‘null_padding’]

  • null_transition (Optional[dict]): Dict type null transition, used in null_padding

Returns:
  • data (List[Dict[str, Any]]): Transitions list processed after unrolling

beta_function

Please refer to ding/rl_utils/beta_function for more details.

cpw

ding.rl_utils.beta_function.cpw(x: Tensor | float, eta: float = 0.71) Tensor | float[source]
Overview:

The implementation of CPW function.

Arguments:
  • x (Union[torch.Tensor, float]): The input value.

  • eta (float): The hyperparameter of CPW function.

Returns:
  • output (Union[torch.Tensor, float]): The output value.

CVaR

ding.rl_utils.beta_function.CVaR(x: Tensor | float, eta: float = 0.71) Tensor | float[source]
Overview:

The implementation of CVaR function, which is a risk-averse function.

Arguments:
  • x (Union[torch.Tensor, float]): The input value.

  • eta (float): The hyperparameter of CVaR function.

Returns:
  • output (Union[torch.Tensor, float]): The output value.

beta_function_map

rl_utils.beta_function_map = {'CPW': <function cpw>, 'CVaR': <function CVaR>, 'Pow': <function Pow>, 'uniform': <function <lambda>>}

coma

Please refer to ding/rl_utils/coma for more details.

coma_error

ding.rl_utils.coma_error(data: namedtuple, gamma: float, lambda_: float) namedtuple[source]
Overview:

Implementation of COMA

Arguments:
  • data (namedtuple): coma input data with fieids shown in coma_data

Returns:
  • coma_loss (namedtuple): the coma loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit (torch.FloatTensor): \((T, B, A, N)\), where B is batch size A is the agent num, and N is action dim

  • action (torch.LongTensor): \((T, B, A)\)

  • q_value (torch.FloatTensor): \((T, B, A, N)\)

  • target_q_value (torch.FloatTensor): \((T, B, A, N)\)

  • reward (torch.FloatTensor): \((T, B)\)

  • weight (torch.FloatTensor or None): \((T ,B, A)\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> agent_num = 3
>>> data = coma_data(
>>>     logit=torch.randn(2, 3, agent_num, action_dim),
>>>     action=torch.randint(0, action_dim, (2, 3, agent_num)),
>>>     q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     target_q_value=torch.randn(2, 3, agent_num, action_dim),
>>>     reward=torch.randn(2, 3),
>>>     weight=torch.ones(2, 3, agent_num),
>>> )
>>> loss = coma_error(data, 0.99, 0.99)

exploration

Please refer to ding/rl_utils/exploration for more details.

get_epsilon_greedy_fn

ding.rl_utils.exploration.get_epsilon_greedy_fn(start: float, end: float, decay: int, type_: str = 'exp') Callable[source]
Overview:

Generate an epsilon_greedy function with decay, which inputs current timestep and outputs current epsilon.

Arguments:
  • start (float): Epsilon start value. For linear , it should be 1.0.

  • end (float): Epsilon end value.

  • decay (int): Controls the speed that epsilon decreases from start to end. We recommend epsilon decays according to env step rather than iteration.

  • type (str): How epsilon decays, now supports ['linear', 'exp'(exponential)] .

Returns:
  • eps_fn (function): The epsilon greedy function with decay.

BaseNoise

class ding.rl_utils.exploration.BaseNoise[source]
Overview:

Base class for action noise

Interface:

__init__, __call__

Examples:
>>> noise_generator = OUNoise()  # init one type of noise
>>> noise = noise_generator(action.shape, action.device)  # generate noise
abstract __call__(shape: tuple, device: str) Tensor[source]
Overview:

Generate noise according to action tensor’s shape, device.

Arguments:
  • shape (tuple): size of the action tensor, output noise’s size should be the same.

  • device (str): device of the action tensor, output noise’s device should be the same as it.

Returns:
  • noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor.

__init__() None[source]
Overview:

Initialization method.

GaussianNoise

class ding.rl_utils.exploration.GaussianNoise(mu: float = 0.0, sigma: float = 1.0)[source]
Overview:

Derived class for generating gaussian noise, which satisfies \(X \sim N(\mu, \sigma^2)\)

Interface:

__init__, __call__

__call__(shape: tuple, device: str) Tensor[source]
Overview:

Generate gaussian noise according to action tensor’s shape, device

Arguments:
  • shape (tuple): size of the action tensor, output noise’s size should be the same

  • device (str): device of the action tensor, output noise’s device should be the same as it

Returns:
  • noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor

__init__(mu: float = 0.0, sigma: float = 1.0) None[source]
Overview:

Initialize \(\mu\) and \(\sigma\) in Gaussian Distribution.

Arguments:
  • mu (float): \(\mu\) , mean value.

  • sigma (float): \(\sigma\) , standard deviation, should be positive.

OUNoise

class ding.rl_utils.exploration.OUNoise(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: float | Tensor | None = 0.0)[source]
Overview:

Derived class for generating Ornstein-Uhlenbeck process noise. Satisfies \(dx_t=\theta(\mu-x_t)dt + \sigma dW_t\), where \(W_t\) denotes Weiner Process, acting as a random perturbation term.

Interface:

__init__, reset, __call__

__call__(shape: tuple, device: str, mu: float | None = None) Tensor[source]
Overview:

Generate gaussian noise according to action tensor’s shape, device.

Arguments:
  • shape (tuple): The size of the action tensor, output noise’s size should be the same.

  • device (str): The device of the action tensor, output noise’s device should be the same as it.

  • mu (float): The new mean value \(\mu\), you can set it to None if don’t need it.

Returns:
  • noise (torch.Tensor): generated action noise, have the same shape and device with the input action tensor.

__init__(mu: float = 0.0, sigma: float = 0.3, theta: float = 0.15, dt: float = 0.01, x0: float | Tensor | None = 0.0) None[source]
Overview:

Initialize _alpha \(= heta * dt\`, ``beta`\) \(= \sigma * \sqrt{dt}\), in Ornstein-Uhlenbeck process.

Arguments:
  • mu (float): \(\mu\) , mean value.

  • sigma (float): \(\sigma\) , standard deviation of the perturbation noise.

  • theta (float): How strongly the noise reacts to perturbations, greater value means stronger reaction.

  • dt (float): The derivative of time t.

  • x0 (Union[float, torch.Tensor]): The initial state of the noise, should be a scalar or tensor with the same shape as the action tensor.

reset() None[source]
Overview:

Reset _x to the initial state _x0.

create_noise_generator

ding.rl_utils.exploration.create_noise_generator(noise_type: str, noise_kwargs: dict) BaseNoise[source]
Overview:

Given the key (noise_type), create a new noise generator instance if in noise_mapping’s values, or raise an KeyError. In other words, a derived noise generator must first register, then call create_noise generator to get the instance object.

Arguments:
  • noise_type (str): the type of noise generator to be created.

Returns:
  • noise (BaseNoise): the created new noise generator, should be an instance of one of noise_mapping’s values.

gae

Please refer to ding/rl_utils/gae for more details.

gae_data

class ding.rl_utils.gae.gae_data(value, next_value, reward, done, traj_flag)

shape_fn_gae

ding.rl_utils.gae.shape_fn_gae(args, kwargs)[source]
Overview:

Return shape of gae for hpc

Returns:

shape: [T, B]

gae

ding.rl_utils.gae.gae(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.97) FloatTensor[source]
Overview:

Implementation of Generalized Advantage Estimator (arXiv:1506.02438)

Arguments:
  • data (namedtuple): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories data.

  • gamma (float): the future discount factor, should be in [0, 1], defaults to 0.99.

  • lambda (float): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

Returns:
  • adv (torch.FloatTensor): the calculated advantage

Shapes:
  • value (torch.FloatTensor): \((T, B)\), where T is trajectory length and B is batch size

  • next_value (torch.FloatTensor): \((T, B)\)

  • reward (torch.FloatTensor): \((T, B)\)

  • adv (torch.FloatTensor): \((T, B)\)

Examples:
>>> value = torch.randn(2, 3)
>>> next_value = torch.randn(2, 3)
>>> reward = torch.randn(2, 3)
>>> data = gae_data(value, next_value, reward, None, None)
>>> adv = gae(data)

isw

Please refer to ding/rl_utils/isw for more details.

compute_importance_weights

ding.rl_utils.isw.compute_importance_weights(target_output: Tensor | dict, behaviour_output: Tensor | dict, action: Tensor, action_space_type: str = 'discrete', requires_grad: bool = False)[source]
Overview:

Computing importance sampling weight with given output and action

Arguments:
  • target_output (Union[torch.Tensor,dict]): the output taking the action by the current policy network, usually this output is network output logit if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.

  • behaviour_output (Union[torch.Tensor,dict]): the output taking the action by the behaviour policy network, usually this output is network output logit, if action space is discrete, or is a dict containing parameters of action distribution if action space is continuous.

  • action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action

  • action_space_type (str): action space types in [‘discrete’, ‘continuous’]

  • requires_grad (bool): whether requires grad computation

Returns:
  • rhos (torch.Tensor): Importance sampling weight

Shapes:
  • target_output (Union[torch.FloatTensor,dict]): \((T, B, N)\), where T is timestep, B is batch size and N is action dim

  • behaviour_output (Union[torch.FloatTensor,dict]): \((T, B, N)\)

  • action (torch.LongTensor): \((T, B)\)

  • rhos (torch.FloatTensor): \((T, B)\)

Examples:
>>> target_output = torch.randn(2, 3, 4)
>>> behaviour_output = torch.randn(2, 3, 4)
>>> action = torch.randint(0, 4, (2, 3))
>>> rhos = compute_importance_weights(target_output, behaviour_output, action)

ppg

Please refer to ding/rl_utils/ppg for more details.

ppg_data

class ding.rl_utils.ppg.ppg_data(logit_new, logit_old, action, value_new, value_old, return_, weight)

ppg_joint_loss

class ding.rl_utils.ppg.ppg_joint_loss(auxiliary_loss, behavioral_cloning_loss)

ppg_joint_error

ding.rl_utils.ppg.ppg_joint_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) Tuple[namedtuple, namedtuple][source]
Overview:

Get PPG joint loss

Arguments:
  • data (namedtuple): ppg input data with fieids shown in ppg_data

  • clip_ratio (float): clip value for ratio

  • use_value_clip (bool): whether use value clip

Returns:
  • ppg_joint_loss (namedtuple): the ppg loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • logit_old (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B,)\)

  • value_new (torch.FloatTensor): \((B, 1)\)

  • value_old (torch.FloatTensor): \((B, 1)\)

  • return (torch.FloatTensor): \((B, 1)\)

  • weight (torch.FloatTensor): \((B,)\)

  • auxiliary_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • behavioral_cloning_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppg_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3, 1),
>>>     value_old=torch.randn(3, 1),
>>>     return_=torch.randn(3, 1),
>>>     weight=torch.ones(3),
>>> )
>>> loss = ppg_joint_error(data, 0.99, 0.99)

ppo

Please refer to ding/rl_utils/ppo for more details.

ppo_data

class ding.rl_utils.ppo.ppo_data(logit_new, logit_old, action, value_new, value_old, adv, return_, weight)

ppo_policy_data

class ding.rl_utils.ppo.ppo_policy_data(logit_new, logit_old, action, adv, weight)

ppo_value_data

class ding.rl_utils.ppo.ppo_value_data(value_new, value_old, return_, weight)

ppo_loss

class ding.rl_utils.ppo.ppo_loss(policy_loss, value_loss, entropy_loss)

ppo_policy_loss

class ding.rl_utils.ppo.ppo_policy_loss(policy_loss, entropy_loss)

ppo_info

class ding.rl_utils.ppo.ppo_info(approx_kl, clipfrac)

shape_fn_ppo

ding.rl_utils.ppo.shape_fn_ppo(args, kwargs)[source]
Overview:

Return shape of ppo for hpc

Returns:

shape: [B, N]

ppo_error

ding.rl_utils.ppo.ppo_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: float | None = None) Tuple[namedtuple, namedtuple][source]
Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:
  • data (namedtuple): the ppo input data with fieids shown in ppo_data

  • clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2

  • use_value_clip (bool): whether to use clip in value loss with the same ratio as policy

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • logit_old (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • value_new (torch.FloatTensor): \((B, )\)

  • value_old (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

Note

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error

ding.rl_utils.ppo.ppo_policy_error(data: namedtuple, clip_ratio: float = 0.2, dual_clip: float | None = None) Tuple[namedtuple, namedtuple][source]
Overview:

Get PPO policy loss

Arguments:
  • data (namedtuple): ppo input data with fieids shown in ppo_policy_data

  • clip_ratio (float): clip value for ratio

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_policy_loss (namedtuple): the ppo policy loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • logit_old (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_policy_data(
>>>     logit_new=torch.randn(3, action_dim),
>>>     logit_old=torch.randn(3, action_dim),
>>>     action=torch.randint(0, action_dim, (3,)),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error(data)

ppo_value_error

ding.rl_utils.ppo.ppo_value_error(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True) Tensor[source]
Overview:

Get PPO value loss

Arguments:
  • data (namedtuple): ppo input data with fieids shown in ppo_value_data

  • clip_ratio (float): clip value for ratio

  • use_value_clip (bool): whether use value clip

Returns:
  • value_loss (torch.FloatTensor): the ppo value loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • value_new (torch.FloatTensor): \((B, )\), where B is batch size

  • value_old (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • value_loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:
>>> action_dim = 4
>>> data = ppo_value_data(
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_value_error(data)

ppo_error_continuous

ding.rl_utils.ppo.ppo_error_continuous(data: namedtuple, clip_ratio: float = 0.2, use_value_clip: bool = True, dual_clip: float | None = None) Tuple[namedtuple, namedtuple][source]
Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:
  • data (namedtuple): the ppo input data with fieids shown in ppo_data

  • clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2

  • use_value_clip (bool): whether to use clip in value loss with the same ratio as policy

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, )\)

  • value_new (torch.FloatTensor): \((B, )\)

  • value_old (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_data_continuous(
>>>     mu_sigma_new= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old= dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     value_new=torch.randn(3),
>>>     value_old=torch.randn(3),
>>>     adv=torch.randn(3),
>>>     return_=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_error(data)

Note

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.

ppo_policy_error_continuous

ding.rl_utils.ppo.ppo_policy_error_continuous(data: namedtuple, clip_ratio: float = 0.2, dual_clip: float | None = None) Tuple[namedtuple, namedtuple][source]
Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with dual_clip

Arguments:
  • data (namedtuple): the ppo input data with fieids shown in ppo_data

  • clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • mu_sigma_new (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • mu_sigma_old (tuple): \(((B, N), (B, N))\), where B is batch size and N is action dim

  • action (torch.LongTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • entropy_loss (torch.FloatTensor): \(()\)

Examples:
>>> action_dim = 4
>>> data = ppo_policy_data_continuous(
>>>     mu_sigma_new=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     mu_sigma_old=dict(mu=torch.randn(3, action_dim), sigma=torch.randn(3, action_dim)**2),
>>>     action=torch.randn(3, action_dim),
>>>     adv=torch.randn(3),
>>>     weight=torch.ones(3),
>>> )
>>> loss, info = ppo_policy_error_continuous(data)

retrace

Please refer to ding/rl_utils/retrace for more details.

compute_q_retraces

ding.rl_utils.retrace.compute_q_retraces(q_values: Tensor, v_pred: Tensor, rewards: Tensor, actions: Tensor, weights: Tensor, ratio: Tensor, gamma: float = 0.9) Tensor[source]
Shapes:
  • q_values (torch.Tensor): \((T + 1, B, N)\), where T is unroll_len, B is batch size, N is discrete action dim.

  • v_pred (torch.Tensor): \((T + 1, B, 1)\)

  • rewards (torch.Tensor): \((T, B)\)

  • actions (torch.Tensor): \((T, B)\)

  • weights (torch.Tensor): \((T, B)\)

  • ratio (torch.Tensor): \((T, B, N)\)

  • q_retraces (torch.Tensor): \((T + 1, B, 1)\)

Examples:
>>> T=2
>>> B=3
>>> N=4
>>> q_values=torch.randn(T+1, B, N)
>>> v_pred=torch.randn(T+1, B, 1)
>>> rewards=torch.randn(T, B)
>>> actions=torch.randint(0, N, (T, B))
>>> weights=torch.ones(T, B)
>>> ratio=torch.randn(T, B, N)
>>> q_retraces = compute_q_retraces(q_values, v_pred, rewards, actions, weights, ratio)

Note

q_retrace operation doesn’t need to compute gradient, just executes forward computation.

sampler

Please refer to ding/rl_utils/sampler for more details.

ArgmaxSampler

class ding.rl_utils.sampler.ArgmaxSampler[source]
Overview:

Argmax sampler, return the index of the maximum value

__call__(logit: Tensor) Tensor[source]
Overview:

Return the index of the maximum value

Arguments:
  • logit (torch.Tensor): The input tensor

Returns:
  • action (torch.Tensor): The index of the maximum value

MultinomialSampler

class ding.rl_utils.sampler.MultinomialSampler[source]
Overview:

Multinomial sampler, return the index of the sampled value

__call__(logit: Tensor) Tensor[source]
Overview:

Return the index of the sampled value

Arguments:
  • logit (torch.Tensor): The input tensor

Returns:
  • action (torch.Tensor): The index of the sampled value

MuSampler

class ding.rl_utils.sampler.MuSampler[source]
Overview:

Mu sampler, return the mu of the input tensor

__call__(logit: Tensor) Tensor[source]
Overview:

Return the mu of the input tensor

Arguments:
  • logit (ttorch.Tensor): The input tensor

Returns:
  • action (torch.Tensor): The mu of the input tensor

ReparameterizationSampler

class ding.rl_utils.sampler.ReparameterizationSampler[source]
Overview:

Reparameterization sampler, return the reparameterized value of the input tensor

__call__(logit: Tensor) Tensor[source]
Overview:

Return the reparameterized value of the input tensor

Arguments:
  • logit (ttorch.Tensor): The input tensor

Returns:
  • action (torch.Tensor): The reparameterized value of the input tensor

HybridStochasticSampler

class ding.rl_utils.sampler.HybridStochasticSampler[source]
Overview:

Hybrid stochastic sampler, return the sampled action type and the reparameterized action args

__call__(logit: Tensor) Tensor[source]
Overview:

Return the sampled action type and the reparameterized action args

Arguments:
  • logit (ttorch.Tensor): The input tensor

Returns:
  • action (ttorch.Tensor): The sampled action type and the reparameterized action args

HybridDeterminsticSampler

class ding.rl_utils.sampler.HybridDeterminsticSampler[source]
Overview:

Hybrid deterministic sampler, return the argmax action type and the mu action args

__call__(logit: Tensor) Tensor[source]
Overview:

Return the argmax action type and the mu action args

Arguments:
  • logit (ttorch.Tensor): The input tensor

Returns:
  • action (ttorch.Tensor): The argmax action type and the mu action args

td

Please refer to ding/rl_utils/td for more details.

q_1step_td_data

class ding.rl_utils.td.q_1step_td_data(q, next_q, act, next_act, reward, done, weight)

q_1step_td_error

ding.rl_utils.td.q_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

1 step td_error, support single agent case and multi agent case.

Arguments:
  • data (q_1step_td_data): The input data, q_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error

Shapes:
  • data (q_1step_td_data): the q_1step_td_data containing [‘q’, ‘next_q’, ‘act’, ‘next_act’, ‘reward’, ‘done’, ‘weight’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • act (torch.LongTensor): \((B, )\)

  • next_act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \(( , B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> action_dim = 4
>>> data = q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     next_act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)).bool(),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_1step_td_error(data, 0.99)

m_q_1step_td_data

class ding.rl_utils.td.m_q_1step_td_data(q, target_q, next_q, act, reward, done, weight)

m_q_1step_td_error

ding.rl_utils.td.m_q_1step_td_error(data: ~collections.namedtuple, gamma: float, tau: float, alpha: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

Munchausen td_error for DQN algorithm, support 1 step td error.

Arguments:
  • data (m_q_1step_td_data): The input data, m_q_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • tau (float): Entropy factor for Munchausen DQN

  • alpha (float): Discount factor for Munchausen term

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:
  • data (m_q_1step_td_data): the m_q_1step_td_data containing [‘q’, ‘target_q’, ‘next_q’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • target_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \(( , B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> action_dim = 4
>>> data = m_q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     target_q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = m_q_1step_td_error(data, 0.99, 0.01, 0.01)

q_v_1step_td_data

class ding.rl_utils.td.q_v_1step_td_data(q, v, act, reward, done, weight)

q_v_1step_td_error

ding.rl_utils.td.q_v_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

td_error between q and v value for SAC algorithm, support 1 step td error.

Arguments:
  • data (q_v_1step_td_data): The input data, q_v_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:
  • data (q_v_1step_td_data): the q_v_1step_td_data containing [‘q’, ‘v’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • v (torch.FloatTensor): \((B, )\)

  • act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \(( , B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> action_dim = 4
>>> data = q_v_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     v=torch.randn(3),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = q_v_1step_td_error(data, 0.99)

nstep_return_data

class ding.rl_utils.td.nstep_return_data(reward, next_value, done)

nstep_return

ding.rl_utils.td.nstep_return(data: namedtuple, gamma: float | list, nstep: int, value_gamma: Tensor | None = None)[source]
Overview:

Calculate nstep return for DQN algorithm, support single agent case and multi agent case.

Arguments:
  • data (nstep_return_data): The input data, nstep_return_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num

  • value_gamma (torch.Tensor): Discount factor for value

Returns:
  • return (torch.Tensor): nstep return

Shapes:
  • data (nstep_return_data): the nstep_return_data containing [‘reward’, ‘next_value’, ‘done’]

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • next_value (torch.FloatTensor): \((, B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> data = nstep_return_data(
>>>     reward=torch.randn(3, 3),
>>>     next_value=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>> )
>>> loss = nstep_return(data, 0.99, 3)

dist_1step_td_data

class ding.rl_utils.td.dist_1step_td_data(dist, next_dist, act, next_act, reward, done, weight)

dist_1step_td_error

ding.rl_utils.td.dist_1step_td_error(data: namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int) Tensor[source]
Overview:

1 step td_error for distributed q-learning based algorithm

Arguments:
  • data (dist_1step_td_data): The input data, dist_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • v_min (float): The min value of support

  • v_max (float): The max value of support

  • n_atom (int): The num of atom

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (dist_1step_td_data): the dist_1step_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]

  • next_dist (torch.FloatTensor): \((B, N, n_atom)\)

  • act (torch.LongTensor): \((B, )\)

  • next_act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((, B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_dist = torch.randn(4, 3, 51).abs()
>>> act = torch.randint(0, 3, (4,))
>>> next_act = torch.randint(0, 3, (4,))
>>> reward = torch.randn(4)
>>> done = torch.randint(0, 2, (4,))
>>> data = dist_1step_td_data(dist, next_dist, act, next_act, reward, done, None)
>>> loss = dist_1step_td_error(data, 0.99, -10.0, 10.0, 51)

dist_nstep_td_data

ding.rl_utils.td.dist_nstep_td_data

alias of dist_1step_td_data

shape_fn_dntd

ding.rl_utils.td.shape_fn_dntd(args, kwargs)[source]
Overview:

Return dntd shape for hpc

Returns:

shape: [T, B, N, n_atom]

dist_nstep_td_error

ding.rl_utils.td.dist_nstep_td_error(data: namedtuple, gamma: float, v_min: float, v_max: float, n_atom: int, nstep: int = 1, value_gamma: Tensor | None = None) Tensor[source]
Overview:

Multistep (1 step or n step) td_error for distributed q-learning based algorithm, support single agent case and multi agent case.

Arguments:
  • data (dist_nstep_td_data): The input data, dist_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (dist_nstep_td_data): the dist_nstep_td_data containing [‘dist’, ‘next_n_dist’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • dist (torch.FloatTensor): \((B, N, n_atom)\) i.e. [batch_size, action_dim, n_atom]

  • next_n_dist (torch.FloatTensor): \((B, N, n_atom)\)

  • act (torch.LongTensor): \((B, )\)

  • next_n_act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_n_dist = torch.randn(4, 3, 51).abs()
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> reward = torch.randn(5, 4)
>>> data = dist_nstep_td_data(dist, next_n_dist, action, next_action, reward, done, None)
>>> loss, _ = dist_nstep_td_error(data, 0.95, -10.0, 10.0, 51, 5)

v_1step_td_data

class ding.rl_utils.td.v_1step_td_data(v, next_v, reward, done, weight)

v_1step_td_error

ding.rl_utils.td.v_1step_td_error(data: ~collections.namedtuple, gamma: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

1 step td_error for distributed value based algorithm

Arguments:
  • data (v_1step_td_data): The input data, v_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:
  • data (v_1step_td_data): the v_1step_td_data containing [‘v’, ‘next_v’, ‘reward’, ‘done’, ‘weight’]

  • v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]

  • next_v (torch.FloatTensor): \((B, )\)

  • reward (torch.FloatTensor): \((, B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5)
>>> done = torch.zeros(5)
>>> data = v_1step_td_data(v, next_v, reward, done, None)
>>> loss, td_error_per_sample = v_1step_td_error(data, 0.99)

v_nstep_td_data

class ding.rl_utils.td.v_nstep_td_data(v, next_n_v, reward, done, weight, value_gamma)

v_nstep_td_error

ding.rl_utils.td.v_nstep_td_error(data: ~collections.namedtuple, gamma: float, nstep: int = 1, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

Multistep (n step) td_error for distributed value based algorithm

Arguments:
  • data (dist_nstep_td_data): The input data, v_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (dist_nstep_td_data): The v_nstep_td_data containing [‘v’, ‘next_n_v’, ‘reward’, ‘done’, ‘weight’, ‘value_gamma’]

  • v (torch.FloatTensor): \((B, )\) i.e. [batch_size, ]

  • next_v (torch.FloatTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

  • value_gamma (torch.Tensor): If the remaining data in the buffer is less than n_step we use value_gamma as the gamma discount value for next_v rather than gamma**n_step

Examples:
>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5, 5)
>>> done = torch.zeros(5)
>>> data = v_nstep_td_data(v, next_v, reward, done, 0.9, 0.99)
>>> loss, td_error_per_sample = v_nstep_td_error(data, 0.99, 5)

q_nstep_td_data

class ding.rl_utils.td.q_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, weight)

dqfd_nstep_td_data

class ding.rl_utils.td.dqfd_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, done_one_step, weight, new_n_q_one_step, next_n_action_one_step, is_expert)

shape_fn_qntd

ding.rl_utils.td.shape_fn_qntd(args, kwargs)[source]
Overview:

Return qntd shape for hpc

Returns:

shape: [T, B, N]

q_nstep_td_error

ding.rl_utils.td.q_nstep_td_error(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

  • td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = q_nstep_td_error(data, 0.95, nstep=nstep)

bdq_nstep_td_error

ding.rl_utils.td.bdq_nstep_td_error(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

Multistep (1 step or n step) td_error for BDQ algorithm, referenced paper “Action Branching Architectures for Deep Reinforcement Learning”, link: https://arxiv.org/pdf/1711.08946. In fact, the original paper only provides the 1-step TD-error calculation method, and here we extend the calculation method of n-step, i.e., TD-error:

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

  • td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, D, N)\) i.e. [batch_size, branch_num, action_bins_per_branch]

  • next_n_q (torch.FloatTensor): \((B, D, N)\)

  • action (torch.LongTensor): \((B, D)\)

  • next_n_action (torch.LongTensor): \((B, D)\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:
>>> action_per_branch = 3
>>> next_q = torch.randn(8, 6, action_per_branch)
>>> done = torch.randn(8)
>>> action = torch.randint(0, action_per_branch, size=(8, 6))
>>> next_action = torch.randint(0, action_per_branch, size=(8, 6))
>>> nstep =3
>>> q = torch.randn(8, 6, action_per_branch).requires_grad_(True)
>>> reward = torch.rand(nstep, 8)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = bdq_nstep_td_error(data, 0.95, nstep=nstep)

shape_fn_qntd_rescale

ding.rl_utils.td.shape_fn_qntd_rescale(args, kwargs)[source]
Overview:

Return qntd_rescale shape for hpc

Returns:

shape: [T, B, N]

q_nstep_td_error_with_rescale

ding.rl_utils.td.q_nstep_td_error_with_rescale(data: ~collections.namedtuple, gamma: float | list, nstep: int = 1, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: ~typing.Callable = <function value_transform>, inv_trans_fn: ~typing.Callable = <function value_inv_transform>) Tensor[source]
Overview:

Multistep (1 step or n step) td_error with value rescaling

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

  • criterion (torch.nn.modules): Loss function criterion

  • trans_fn (Callable): Value transfrom function, default to value_transform (refer to rl_utils/value_rescale.py)

  • inv_trans_fn (Callable): Value inverse transfrom function, default to value_inv_transform (refer to rl_utils/value_rescale.py)

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, _ = q_nstep_td_error_with_rescale(data, 0.95, nstep=nstep)

dqfd_nstep_td_error

ding.rl_utils.td.dqfd_nstep_td_error(data: ~collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, margin_function: float, lambda_one_step_td: float = 1.0, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:
  • data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss

  • gamma (float): discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 10

Returns:
  • loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor

  • td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): the q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

  • new_n_q_one_step (torch.FloatTensor): \((B, N)\)

  • next_n_action_one_step (torch.LongTensor): \((B, )\)

  • is_expert (int) : 0 or 1

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> done_1 = torch.randn(4)
>>> next_q_one_step = torch.randn(4, 3)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> next_action_one_step = torch.randint(0, 3, size=(4, ))
>>> is_expert = torch.ones((4))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = dqfd_nstep_td_data(
>>>     q, next_q, action, next_action, reward, done, done_1, None,
>>>     next_q_one_step, next_action_one_step, is_expert
>>> )
>>> loss, td_error_per_sample, loss_statistics = dqfd_nstep_td_error(
>>>     data, 0.95, lambda_n_step_td=1, lambda_supervised_loss=1,
>>>     margin_function=0.8, nstep=nstep
>>> )

dqfd_nstep_td_error_with_rescale

ding.rl_utils.td.dqfd_nstep_td_error_with_rescale(data: ~collections.namedtuple, gamma: float, lambda_n_step_td: float, lambda_supervised_loss: float, lambda_one_step_td: float, margin_function: float, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss(), trans_fn: ~typing.Callable = <function value_transform>, inv_trans_fn: ~typing.Callable = <function value_inv_transform>) Tensor[source]
Overview:

Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd

Arguments:
  • data (dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 10

Returns:
  • loss (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor

  • td_error_per_sample (torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘next_n_action’, ‘reward’, ‘done’, ‘weight’ , ‘new_n_q_one_step’, ‘next_n_action_one_step’, ‘is_expert’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

  • new_n_q_one_step (torch.FloatTensor): \((B, N)\)

  • next_n_action_one_step (torch.LongTensor): \((B, )\)

  • is_expert (int) : 0 or 1

qrdqn_nstep_td_data

class ding.rl_utils.td.qrdqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, tau, weight)

qrdqn_nstep_td_error

ding.rl_utils.td.qrdqn_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, value_gamma: Tensor | None = None) Tensor[source]
Overview:

Multistep (1 step or n step) td_error with in QRDQN

Arguments:
  • data (qrdqn_nstep_td_data): The input data, qrdqn_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((tau', B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = qrdqn_nstep_td_data(q, next_q, action, next_action, reward, done, 3, None)
>>> loss, td_error_per_sample = qrdqn_nstep_td_error(data, 0.95, nstep=nstep)

q_nstep_sql_td_error

ding.rl_utils.td.q_nstep_sql_td_error(data: ~collections.namedtuple, gamma: float, alpha: float, nstep: int = 1, cum_reward: bool = False, value_gamma: ~torch.Tensor | None = None, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

Multistep (1 step or n step) td_error for q-learning based algorithm

Arguments:
  • data (q_nstep_td_data): The input data, q_nstep_sql_td_data to calculate loss

  • gamma (float): Discount factor

  • Alpha (:obj:`float`): A parameter to weight entropy term in a policy equation

  • cum_reward (bool): Whether to use cumulative nstep reward, which is figured out when collecting data

  • value_gamma (torch.Tensor): Gamma discount value for target soft_q_value

  • criterion (torch.nn.modules): Loss function criterion

  • nstep (int): nstep num, default set to 1

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

  • td_error_per_sample (torch.Tensor): nstep td error, 1-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • td_error_per_sample (torch.FloatTensor): \((B, )\)

Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample, record_target_v = q_nstep_sql_td_error(data, 0.95, 1.0, nstep=nstep)

iqn_nstep_td_data

class ding.rl_utils.td.iqn_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, replay_quantiles, weight)

iqn_nstep_td_error

ding.rl_utils.td.iqn_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Tensor | None = None) Tensor[source]
Overview:

Multistep (1 step or n step) td_error with in IQN, referenced paper Implicit Quantile Networks for Distributional Reinforcement Learning <https://arxiv.org/pdf/1806.06923.pdf>

Arguments:
  • data (iqn_nstep_td_data): The input data, iqn_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

  • criterion (torch.nn.modules): Loss function criterion

  • beta_function (Callable): The risk function

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((tau, B, N)\) i.e. [tau x batch_size, action_dim]

  • next_n_q (torch.FloatTensor): \((tau', B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

Examples:
>>> next_q = torch.randn(3, 4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(3, 4, 3).requires_grad_(True)
>>> replay_quantile = torch.randn([3, 4, 1])
>>> reward = torch.rand(nstep, 4)
>>> data = iqn_nstep_td_data(q, next_q, action, next_action, reward, done, replay_quantile, None)
>>> loss, td_error_per_sample = iqn_nstep_td_error(data, 0.95, nstep=nstep)

fqf_nstep_td_data

class ding.rl_utils.td.fqf_nstep_td_data(q, next_n_q, action, next_n_action, reward, done, quantiles_hats, weight)

fqf_nstep_td_error

ding.rl_utils.td.fqf_nstep_td_error(data: namedtuple, gamma: float, nstep: int = 1, kappa: float = 1.0, value_gamma: Tensor | None = None) Tensor[source]
Overview:

Multistep (1 step or n step) td_error with in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:
  • data (fqf_nstep_td_data): The input data, fqf_nstep_td_data to calculate loss

  • gamma (float): Discount factor

  • nstep (int): nstep num, default set to 1

  • criterion (torch.nn.modules): Loss function criterion

  • beta_function (Callable): The risk function

Returns:
  • loss (torch.Tensor): nstep td error, 0-dim tensor

Shapes:
  • data (q_nstep_td_data): The q_nstep_td_data containing [‘q’, ‘next_n_q’, ‘action’, ‘reward’, ‘done’]

  • q (torch.FloatTensor): \((B, tau, N)\) i.e. [batch_size, tau, action_dim]

  • next_n_q (torch.FloatTensor): \((B, tau', N)\)

  • action (torch.LongTensor): \((B, )\)

  • next_n_action (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \((T, B)\), where T is timestep(nstep)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • quantiles_hats (torch.FloatTensor): \((B, tau)\)

Examples:
>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> quantiles_hats = torch.randn([4, 3])
>>> reward = torch.rand(nstep, 4)
>>> data = fqf_nstep_td_data(q, next_q, action, next_action, reward, done, quantiles_hats, None)
>>> loss, td_error_per_sample = fqf_nstep_td_error(data, 0.95, nstep=nstep)

evaluate_quantile_at_action

ding.rl_utils.td.evaluate_quantile_at_action(q_s, actions)[source]

fqf_calculate_fraction_loss

ding.rl_utils.td.fqf_calculate_fraction_loss(q_tau_i, q_value, quantiles, actions)[source]
Overview:

Calculate the fraction loss in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning <https://arxiv.org/pdf/1911.02140.pdf>

Arguments:
  • q_tau_i (torch.FloatTensor): \((batch_size, num_quantiles-1, action_dim)\)

  • q_value (torch.FloatTensor): \((batch_size, num_quantiles, action_dim)\)

  • quantiles (torch.FloatTensor): \((batch_size, num_quantiles+1)\)

  • actions (torch.LongTensor): \((batch_size, )\)

Returns:
  • fraction_loss (torch.Tensor): fraction loss, 0-dim tensor

td_lambda_data

class ding.rl_utils.td.td_lambda_data(value, reward, weight)

shape_fn_td_lambda

ding.rl_utils.td.shape_fn_td_lambda(args, kwargs)[source]
Overview:

Return td_lambda shape for hpc

Returns:

shape: [T, B]

td_lambda_error

ding.rl_utils.td.td_lambda_error(data: namedtuple, gamma: float = 0.9, lambda_: float = 0.8) Tensor[source]
Overview:

Computing TD(lambda) loss given constant gamma and lambda. There is no special handling for terminal state value, if some state has reached the terminal, just fill in zeros for values and rewards beyond terminal (including the terminal state, values[terminal] should also be 0)

Arguments:
  • data (namedtuple): td_lambda input data with fields [‘value’, ‘reward’, ‘weight’]

  • gamma (float): Constant discount factor gamma, should be in [0, 1], defaults to 0.9

  • lambda (float): Constant lambda, should be in [0, 1], defaults to 0.8

Returns:
  • loss (torch.Tensor): Computed MSE loss, averaged over the batch

Shapes:
  • value (torch.FloatTensor): \((T+1, B)\), where T is trajectory length and B is batch, which is the estimation of the state value at step 0 to T

  • reward (torch.FloatTensor): \((T, B)\), the returns from time step 0 to T-1

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

  • loss (torch.FloatTensor): \(()\), 0-dim tensor

Examples:
>>> T, B = 8, 4
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> loss = td_lambda_error(td_lambda_data(value, reward, None))

generalized_lambda_returns

ding.rl_utils.td.generalized_lambda_returns(bootstrap_values: Tensor, rewards: Tensor, gammas: float, lambda_: float, done: Tensor | None = None) Tensor[source]
Overview:

Functional equivalent to trfl.value_ops.generalized_lambda_returns https://github.com/deepmind/trfl/blob/2c07ac22512a16715cc759f0072be43a5d12ae45/trfl/value_ops.py#L74 Passing in a number instead of tensor to make the value constant for all samples in batch

Arguments:
  • bootstrap_values (torch.Tensor or float): estimation of the value at step 0 to T, of size [T_traj+1, batchsize]

  • rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]

  • gammas (torch.Tensor or float): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]

  • lambda (torch.Tensor or float): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep, of size [T_traj, batchsize]

  • done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:
  • return (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

multistep_forward_view

ding.rl_utils.td.multistep_forward_view(bootstrap_values: Tensor, rewards: Tensor, gammas: float, lambda_: float, done: Tensor | None = None) Tensor[source]
Overview:

Same as trfl.sequence_ops.multistep_forward_view, which implements (12.18) in Sutton & Barto. Assuming the first dim of input tensors correspond to the index in batch.

Note

result[T-1] = rewards[T-1] + gammas[T-1] * bootstrap_values[T] for t in 0…T-2 : result[t] = rewards[t] + gammas[t]*(lambdas[t]*result[t+1] + (1-lambdas[t])*bootstrap_values[t+1])

Arguments:
  • bootstrap_values (torch.Tensor): Estimation of the value at step 1 to T, of size [T_traj, batchsize]

  • rewards (torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]

  • gammas (torch.Tensor): Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]

  • lambda (torch.Tensor): Determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep of size [T_traj, batchsize], the element for T-1 is ignored and effectively set to 0, as there is no information about future rewards.

  • done (torch.Tensor or float): Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]

Returns:
  • ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

upgo

Please refer to ding/rl_utils/upgo for more details.

upgo_returns

ding.rl_utils.upgo.upgo_returns(rewards: Tensor, bootstrap_values: Tensor) Tensor[source]
Overview:

Computing UPGO return targets. Also notice there is no special handling for the terminal state.

Arguments:
  • rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]

  • bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]

Returns:
  • ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]

Examples:
>>> T, B, N, N2 = 4, 8, 5, 7
>>> rewards = torch.randn(T, B)
>>> bootstrap_values = torch.randn(T + 1, B).requires_grad_(True)
>>> returns = upgo_returns(rewards, bootstrap_values)

upgo_loss

ding.rl_utils.upgo.upgo_loss(target_output: Tensor, rhos: Tensor, action: Tensor, rewards: Tensor, bootstrap_values: Tensor, mask=None) Tensor[source]
Overview:

Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.

Arguments:
  • target_output (torch.Tensor): the output computed by the target policy network, of size [T_traj, batchsize, n_output]

  • rhos (torch.Tensor): the importance sampling ratio, of size [T_traj, batchsize]

  • action (torch.Tensor): the action taken, of size [T_traj, batchsize]

  • rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]

  • bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]

Returns:
  • loss (torch.Tensor): Computed importance sampled UPGO loss, averaged over the samples, of size []

Examples:
>>> T, B, N, N2 = 4, 8, 5, 7
>>> rhos = torch.randn(T, B)
>>> loss = upgo_loss(logit, rhos, action, rewards, bootstrap_values)

value_rescale

Please refer to ding/rl_utils/value_rescale for more details.

value_transform

ding.rl_utils.value_rescale.value_transform(x: Tensor, eps: float = 0.01) Tensor[source]
Overview:

A function to reduce the scale of the action-value function. :math: h(x) = sign(x)(sqrt{(abs(x)+1)} - 1) + epsilon * x .

Arguments:
  • x: (torch.Tensor) The input tensor to be normalized.

  • eps: (float) The coefficient of the additive regularization term to ensure inverse function is Lipschitz continuous

Returns:
  • (torch.Tensor) Normalized tensor.

Note

Observe and Look Further: Achieving Consistent Performance on Atari (https://arxiv.org/abs/1805.11593).

value_inv_transform

ding.rl_utils.value_rescale.value_inv_transform(x: Tensor, eps: float = 0.01) Tensor[source]
Overview:

The inverse form of value rescale. :math: `h^{-1}(x) = sign(x)({(

rac{sqrt{1+4epsilon(|x|+1+epsilon)}-1}{2epsilon})}^2-1)` .
Arguments:
  • x: (torch.Tensor) The input tensor to be unnormalized.

  • eps: (float) The coefficient of the additive regularization term to ensure inverse function is Lipschitz continuous

Returns:
  • (torch.Tensor) Unnormalized tensor.

symlog

ding.rl_utils.value_rescale.symlog(x: Tensor) Tensor[source]
Overview:

A function to normalize the targets. :math: symlog(x) = sign(x)(ln{|x|+1}) .

Arguments:
  • x: (torch.Tensor) The input tensor to be normalized.

Returns:
  • (torch.Tensor) Normalized tensor.

Note

Mastering Diverse Domains through World Models (https://arxiv.org/abs/2301.04104)

inv_symlog

ding.rl_utils.value_rescale.inv_symlog(x: Tensor) Tensor[source]
Overview:

The inverse form of symlog. :math: symexp(x) = sign(x)(exp{|x|}-1) .

Arguments:
  • x: (torch.Tensor) The input tensor to be unnormalized.

Returns:
  • (torch.Tensor) Unnormalized tensor.

vtrace

Please refer to ding/rl_utils/vtrace for more details.

vtrace_nstep_return

ding.rl_utils.vtrace.vtrace_nstep_return(clipped_rhos, clipped_cs, reward, bootstrap_values, gamma=0.99, lambda_=0.95)[source]
Overview:

Computation of vtrace return.

Returns:
  • vtrace_return (torch.FloatTensor): the vtrace loss item, all of them are differentiable 0-dim tensor

Shapes:
  • clipped_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size

  • clipped_cs (torch.FloatTensor): \((T, B)\)

  • reward (torch.FloatTensor): \((T, B)\)

  • bootstrap_values (torch.FloatTensor): \((T+1, B)\)

  • vtrace_return (torch.FloatTensor): \((T, B)\)

vtrace_advantage

ding.rl_utils.vtrace.vtrace_advantage(clipped_pg_rhos, reward, return_, bootstrap_values, gamma)[source]
Overview:

Computation of vtrace advantage.

Returns:
  • vtrace_advantage (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • clipped_pg_rhos (torch.FloatTensor): \((T, B)\), where T is timestep, B is batch size

  • reward (torch.FloatTensor): \((T, B)\)

  • return (torch.FloatTensor): \((T, B)\)

  • bootstrap_values (torch.FloatTensor): \((T, B)\)

  • vtrace_advantage (torch.FloatTensor): \((T, B)\)

vtrace_data

class ding.rl_utils.vtrace.vtrace_data(target_output, behaviour_output, action, value, reward, weight)

vtrace_loss

class ding.rl_utils.vtrace.vtrace_loss(policy_loss, value_loss, entropy_loss)

vtrace_error_discrete_action

ding.rl_utils.vtrace.vtrace_error_discrete_action(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[source]
Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:
  • data (namedtuple): input data with fields shown in vtrace_data
    • target_output (torch.Tensor): the output taking the action by the current policy network, usually this output is network output logit

    • behaviour_output (torch.Tensor): the output taking the action by the behaviour policy network, usually this output is network output logit, which is used to produce the trajectory(collector)

    • action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action

  • gamma: (float): the future discount factor, defaults to 0.95

  • lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0

  • rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)

  • c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)

  • rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:
  • trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • target_output (torch.FloatTensor): \((T, B, N)\), where T is timestep, B is batch size and N is action dim

  • behaviour_output (torch.FloatTensor): \((T, B, N)\)

  • action (torch.LongTensor): \((T, B)\)

  • value (torch.FloatTensor): \((T+1, B)\)

  • reward (torch.LongTensor): \((T, B)\)

  • weight (torch.LongTensor): \((T, B)\)

Examples:
>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = torch.randn(T, B, N).requires_grad_(True)
>>> behaviour_output = torch.randn(T, B, N)
>>> action = torch.randint(0, N, size=(T, B))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_discrete_action(data, rho_clip_ratio=1.1)

vtrace_error_continuous_action

ding.rl_utils.vtrace.vtrace_error_continuous_action(data: namedtuple, gamma: float = 0.99, lambda_: float = 0.95, rho_clip_ratio: float = 1.0, c_clip_ratio: float = 1.0, rho_pg_clip_ratio: float = 1.0)[source]
Overview:

Implementation of vtrace(IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures), (arXiv:1802.01561)

Arguments:
  • data (namedtuple): input data with fields shown in vtrace_data
    • target_output (dict{key:torch.Tensor}): the output taking the action by the current policy network, usually this output is network output, which represents the distribution by reparameterization trick.

    • behaviour_output (dict{key:torch.Tensor}): the output taking the action by the behaviour policy network, usually this output is network output logit, which represents the distribution by reparameterization trick.

    • action (torch.Tensor): the chosen action(index for the discrete action space) in trajectory, i.e.: behaviour_action

  • gamma: (float): the future discount factor, defaults to 0.95

  • lambda: (float): mix factor between 1-step (lambda_=0) and n-step, defaults to 1.0

  • rho_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the baseline targets (vs)

  • c_clip_ratio (float): the clipping threshold for importance weights (c) when calculating the baseline targets (vs)

  • rho_pg_clip_ratio (float): the clipping threshold for importance weights (rho) when calculating the policy gradient advantage

Returns:
  • trace_loss (namedtuple): the vtrace loss item, all of them are the differentiable 0-dim tensor

Shapes:
  • target_output (dict{key:torch.FloatTensor}): \((T, B, N)\), where T is timestep, B is batch size and N is action dim. The keys are usually parameters of reparameterization trick.

  • behaviour_output (dict{key:torch.FloatTensor}): \((T, B, N)\)

  • action (torch.LongTensor): \((T, B)\)

  • value (torch.FloatTensor): \((T+1, B)\)

  • reward (torch.LongTensor): \((T, B)\)

  • weight (torch.LongTensor): \((T, B)\)

Examples:
>>> T, B, N = 4, 8, 16
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> target_output = dict(
>>>     'mu': torch.randn(T, B, N).requires_grad_(True),
>>>     'sigma': torch.exp(torch.randn(T, B, N).requires_grad_(True)),
>>> )
>>> behaviour_output = dict(
>>>     'mu': torch.randn(T, B, N),
>>>     'sigma': torch.exp(torch.randn(T, B, N)),
>>> )
>>> action = torch.randn((T, B, N))
>>> data = vtrace_data(target_output, behaviour_output, action, value, reward, None)
>>> loss = vtrace_error_continuous_action(data, rho_clip_ratio=1.1)