Shortcuts

MDQN

Overview

MDQN was proposed in Munchausen Reinforcement Learning. They call this general approach “Munchausen Reinforcement Learning” (M-RL), as a reference to a famous passage of The Surprising Adventures of Baron Munchausen by Raspe, where the Baron pulls himself out of a swamp by pulling on his own hair. From a practical point of view, the key difference between MDQN and DQN is that MDQN adding a scaled log-policy to the immediate reward on the Soft-DQN which is is an extension of the traditional DQN algorithm with max entropy.

Quick Facts

  1. MDQN is a model-free and value-based RL algorithm.

  2. MDQN only support discrete action spaces.

  3. MDQN is an off-policy algorithm.

  4. MDQN uses eps-greedy for exploration.

  5. MDQN increased the action gap, and has implicit KL regularization.

Key Equations or Key Graphs

The target Q value used in MDQN is:

\[\hat{q}_{\mathrm{m} \text {-dqn }}\left(r_t, s_{t+1}\right)=r_t+\alpha \tau \ln \pi_{\bar{\theta}}\left(a_t \mid s_t\right)+\gamma \sum_{a^{\prime} \in A} \pi_{\bar{\theta}}\left(a^{\prime} \mid s_{t+1}\right)\left(q_{\bar{\theta}}\left(s_{t+1}, a^{\prime}\right)-\tau \ln \pi_{\bar{\theta}}\left(a^{\prime} \mid s_{t+1}\right)\right)\]

For the log-policy \(\alpha \tau \ln \pi_{\bar{\theta}}\left(a_t \mid s_t\right)\) we used the following formula to calculate

\[\tau \ln \pi_{k}=q_k-v_k-\tau \ln \left\langle 1, \exp \frac{q_k-v_k}{\tau}\right\rangle\]

where \(q_k\) is the target_q_current in our code. For the max entropy part \(\tau \ln \pi_{\bar{\theta}}\left(a^{\prime} \mid s_{t+1}\right)\) , we use the same formula to calculate where the where \(q_{k+1}\) is the target_q in our code

And we replace \(\tau \ln \pi(a \mid s)\) by \([\tau \ln \pi(a \mid s)]_{l_0}^0`\) because log-policy term is not bounded, and can cause numerical issues if the policy becomes too close to deterministic.

And we replace \(\pi_{\bar{\theta}}\left(a^{\prime} \mid s_{t+1}\right)\) by \(softmax(q-v)\) which official implementations used but not mentationed in their paper.

And we test action at asterix and get the same result as paper that MDQN could increase the action gap.

../_images/action_gap.png

Pseudo-code

../_images/mdqn.png

Extension

  • TBD

Implementations

The default config of MDQNPolicy is defined as follows:

class ding.policy.mdqn.MDQNPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of Munchausen DQN algorithm, extended by auxiliary objectives. Paper link: https://arxiv.org/abs/2007.14430.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

mdqn

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling Weight
to correct biased update. If True,
priority must be True.

6

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

1, [3, 5]

N-step reward discount sum for target
q_value estimation

8

learn.update
per_collect
_gpu

int

1

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

10

learn.batch_
size

int

32

The number of samples of an iteration

11

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

12

learn.target_
update_freq

int

2000

Frequence of target network update.
Hard(assign) update

13

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

14

collect.n_sample

int

4

The number of training samples of a
call of collector.
It varies from
different envs

15

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1

16

other.eps.type

str

exp

exploration rate decay type
Support [‘exp’,
‘linear’].

17

other.eps.
start

float

0.01

start value of exploration rate
[0,1]

18

other.eps.
end

float

0.001

end value of exploration rate
[0,1]

19

other.eps.
decay

int

250000

decay length of exploration
greater than 0. set
decay=250000 means
the exploration rate
decay from start
value to end value
during decay length.

20

entropy_tau

float

0.003

the ration of entropy in TD loss

21

alpha

float

0.9

the ration of Munchausen term to the
TD loss

The td error interface MDQN used is defined as follows:

ding.rl_utils.td.m_q_1step_td_error(data: ~collections.namedtuple, gamma: float, tau: float, alpha: float, criterion: <module 'torch.nn.modules' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/torch/nn/modules/__init__.py'> = MSELoss()) Tensor[source]
Overview:

Munchausen td_error for DQN algorithm, support 1 step td error.

Arguments:
  • data (m_q_1step_td_data): The input data, m_q_1step_td_data to calculate loss

  • gamma (float): Discount factor

  • tau (float): Entropy factor for Munchausen DQN

  • alpha (float): Discount factor for Munchausen term

  • criterion (torch.nn.modules): Loss function criterion

Returns:
  • loss (torch.Tensor): 1step td error, 0-dim tensor

Shapes:
  • data (m_q_1step_td_data): the m_q_1step_td_data containing [‘q’, ‘target_q’, ‘next_q’, ‘act’, ‘reward’, ‘done’, ‘weight’]

  • q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • target_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • next_q (torch.FloatTensor): \((B, N)\) i.e. [batch_size, action_dim]

  • act (torch.LongTensor): \((B, )\)

  • reward (torch.FloatTensor): \(( , B)\)

  • done (torch.BoolTensor) \((B, )\), whether done in last timestep

  • weight (torch.FloatTensor or None): \((B, )\), the training sample weight

Examples:
>>> action_dim = 4
>>> data = m_q_1step_td_data(
>>>     q=torch.randn(3, action_dim),
>>>     target_q=torch.randn(3, action_dim),
>>>     next_q=torch.randn(3, action_dim),
>>>     act=torch.randint(0, action_dim, (3,)),
>>>     reward=torch.randn(3),
>>>     done=torch.randint(0, 2, (3,)),
>>>     weight=torch.ones(3),
>>> )
>>> loss = m_q_1step_td_error(data, 0.99, 0.01, 0.01)

Benchmark

Benchmark and comparison of mdqn algorithm

environment

best mean reward

evaluation results

config link

comparison

Asterix
(Asterix-v0)

8963

../_images/mdqn_asterix.png

config_link_asterix

sdqn(3513) paper(1718) dqn(3444)
SpaceInvaders
(SpaceInvaders-v0)

2211

../_images/mdqn_spaceinvaders.png

config_link_spaceinvaders

sdqn(1804) paper(2045) dqn(1228)
Enduro
(Enduro-v4)

1003

../_images/mdqn_enduro.png

config_link_enduro

sdqn(986.1) paper(1171) dqn(986.4)

Key difference between our config and paper config:

  • we collect 100 samples, train 10 times. In the paper, they collect 4 samples, train 1 time.

  • we update target network for every 500 iterations, they update target network for every 2000 iterations.

  • the epsilon we used for exploration is from 1 to 0.05, their epsilon is from 0.01 to 0.001.

P.S.:

  • The above results are obtained by running the same configuration on seed 0

  • For the discrete action space algorithm like DQN, the Atari environment set is generally used for testing, and Atari environment is generally evaluated by the highest mean reward training 10M env_step. For more details about Atari, please refer to Atari Env Tutorial .

Reference

  • Vieillard, Nino, Olivier Pietquin, and Matthieu Geist. “Munchausen reinforcement learning.” Advances in Neural Information Processing Systems 33 (2020): 4235-4246.