DQfD¶

Overview¶

DQfD was proposed in Deep Q-learning from Demonstrations by DeepMind, which appeared at AAAI 2018. It ﬁrst pretrains solely on demonstration data, using a combination of 1-step TD, n-step TD, supervised, and regularization losses so that it has a reasonable policy that is a good starting point for learning in the task. Once it starts interacting with the task, it continues learning by sampling from both its selfgenerated data as well as the demonstration data. The ratio of both types of data in each mini-batch is automatically controlled by a prioritized-replay mechanism.

DQfD leverages small sets of demonstration data to massively accelerate the learning process and performs better than PDD DQN, RBS, HER and ADET on Atari games.

Quick Facts¶

DQfD is an extension algorithm of DQN.
Store the demonstrations into an expert replay buffer.
Pre-train the network with expert demonstrations and accelerate the subsequent RL training process.
Agent gathers more transitions for new replay buffer (see detail_explanation). Trains network on mixture of new replay buffer and expert replay buffer.
Network is trained with special loss function made up of four parts: one-step loss, n-step loss, expert large margin classification loss and L2 regularization.

Key Equations or Key Graphs¶

The DQfD overall loss used to update the network is a combination of all four losses.

Overall Loss: \(J(Q) = J_{DQ}(Q) + \lambda_1 J_n(Q) + \lambda_2J_E(Q) + \lambda_3 J_{L2}(Q)\)

one-step loss: \(J_{DQ}(Q) = (R(s,a) + \gamma Q(s_{t+1}, a_{t+1}^{max}; \theta^{'}) - Q(s,a;\theta))^2\), where \(a_{t+1}^{max} = argmax_a Q(s_{t+1},a;\theta)\).
n-step loss: \(J_n(Q) = r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + max_a \gamma^n Q(s_{t+n},a)\).
large margin classiﬁcation loss: \(J_E(Q) = max_{a \in A}[Q(s,a) + L(a_E,a)] - Q(s,a_E)\), \(L(a_E,a)\) is a margin function that is 0 when \(a = a_E\) and positive otherwise. This loss forces the values of the other actions to be at least a margin lower than the value of the demonstrator’s action.
L2 regularization loss: \(J_{L2}(Q)\) help prevent from over-ﬁtting.

Pseudo-code¶

Note

In Phase I, the agent just uses the demonstration data, and does not do any exploration. The goal of the pre-training phase is to learn to imitate the demonstrator with a value function that satisﬁes the Bellman equation. During this pre-training phase, the agent samples mini-batches from the demonstration data and updates the network by applying the total loss J(Q).

In Phase II, the agent starts acting on the system, collecting self-generated data, and adding it to its replay buffer. Data is added to the replay buffer until it is full, and then the agent starts overwriting old data in that buffer. However, the agent never over-writes the demonstration data. All the losses are applied to the demonstration data in both phases, while the supervised loss is not applied to self-generated data.

Extensions¶

DeepMind has extended DQfD in several ways. Upon a literature search, it seems like two relevant follow-up works are:

Distributed Prioritized Experience Replay

The main idea of this paper is to scale up the experience replay data by having many actors collect experience. Their framework is called Ape-X, and they claim that Ape-X DQN achieves a new state of the art performance on Atari games. This paper is not that particularly relevant to DQfD, but we include it here mainly because a follow-up paper (see below) used this technique with DQfD.
Observe and Look Further: Achieving Consistent Performance on Atari

This paper proposes the Ape-X DQfD algorithm, which as one might expect combines DQfD with the distributed prioritized experience replay algorithm.

Implementations¶

The DI-engine implements DQfD.

The default config of DQfD Policy is defined as follows:

class ding.policy.dqfd.DQFDPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:

Policy class of DQFD algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	dqn	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	True	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	True	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	10, [3, 5]	N-step reward discount sum for target q_value estimation
8	`lambda1`	float	1	multiplicative factor for n-step
9	`lambda2`	float	1	multiplicative factor for the supervised margin loss
10	`lambda3`	float	1e-5	L2 loss
11	`margin_fn`	float	0.8	margin function in JE, here we set this as a constant
12	`per_train_` `iter_k`	int	10	number of pertraining iterations
13	`learn.update` `per_collect`	int	3	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
14	`learn.batch_` `size`	int	64	The number of samples of an iteration
15	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
16	`learn.target_` `update_freq`	int	100	Frequency of target network update.	Hard(assign) update
17	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
18	`collect.n_sample`	int	[8, 128]	The number of training samples of a call of collector.	It varies from different envs
19	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1

The network interface DQfD used is defined as follows:

class ding.model.template.q_learning.DQN(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, init_bias: float | None = None, noise: bool = False)[source]

Overview:: The neural nework structure and computation graph of Deep Q Network (DQN) algorithm, which is the most classic value-based RL algorithm for discrete action. The DQN is composed of two parts: encoder and head. The encoder is used to extract the feature from various observation, and the head is used to compute the Q value of each action dimension.
Interfaces:: __init__, forward.

Note

Current DQN supports two types of encoder: FCEncoder and ConvEncoder, two types of head: DiscreteHead and DuelingHead. You can customize your own encoder or head by inheriting this class.

__init__(obs_shape: int | SequenceType, action_shape: int | SequenceType, encoder_hidden_size_list: SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: int | None = None, head_layer_num: int = 1, activation: Module | None = ReLU(), norm_type: str | None = None, dropout: float | None = None, init_bias: float | None = None, noise: bool = False) → None[source]

Overview:

initialize the DQN (encoder + head) Model according to corresponding input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
dueling (Optional[bool]): Whether choose DuelingHead or DiscreteHead (default).
head_hidden_size (Optional[int]): The hidden_size of head network, defaults to None, then it will be set to the last element of encoder_hidden_size_list.
head_layer_num (int): The number of layers used in the head network to compute Q value output.
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU().
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of [‘BN’, ‘IN’, ‘SyncBN’, ‘LN’]
dropout (Optional[float]): The dropout rate of the dropout layer. if None then default disable dropout layer.
init_bias (Optional[float]): The initial value of the last layer bias in the head network. - noise (bool): Whether to use NoiseLinearLayer as layer_fn to boost exploration in Q networks’ MLP. Default to False.

forward(x: Tensor) → Dict[source]

Overview:

DQN forward computation graph, input observation tensor to predict q_value.

Arguments:

x (torch.Tensor): The input observation tensor data.

Returns:

outputs (Dict): The output of DQN’s forward, including q_value.

ReturnsKeys:

logit (torch.Tensor): Discrete Q-value output of each possible action dimension.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape
logit (torch.Tensor): \((B, M)\), where B is batch size and M is action_shape

Examples:

>>> model = DQN(32, 6)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 32)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])

Note

For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

Benchmark¶

environment	best mean reward	config link
Pong (PongNoFrameskip-v4)	20	config_link_p
Qbert (QbertNoFrameskip-v4)	4976	config_link_q
SpaceInvaders (SpaceInvadersNoFrame skip-v4)	1969	config_link_s

Reference¶

Hester T, Vecerik M, Pietquin O, et al. Deep q-learning from demonstrations[C]//Thirty-second AAAI conference on artificial intelligence. 2018.
Blog: Combining Imitation Learning and Reinforcement Learning Using DQfD