Shortcuts

FQF

Overview

FQF was proposed in Fully Parameterized Quantile Function for Distributional Reinforcement Learning. The key difference between FQF and IQN is that FQF additionally introduces the fraction proposal network, a parametric function trained to generate tau in [0, 1], while IQN samples tau from a base distribution, e.g. U([0, 1]).

Quick Facts

  1. FQF is a model-free and value-based distibutional RL algorithm.

  2. FQF only support discrete action spaces.

  3. FQF is an off-policy algorithm.

  4. Usually, FQF use eps-greedy or multinomial sample for exploration.

  5. FQF can be equipped with RNN.

Key Equations or Key Graphs

For any continuous quantile function \(F_{Z}^{-1}\) that is non-decreasing, define the 1-Wasserstein loss of \(F_{Z}^{-1}\) and \(F_{Z}^{-1, \tau}\) by

\[W_{1}(Z, \tau)=\sum_{i=0}^{N-1} \int_{\tau_{i}}^{\tau_{i+1}}\left|F_{Z}^{-1}(\omega)-F_{Z}^{-1}\left(\hat{\tau}_{i}\right)\right| d \omega\]

Note that as \(W_{1}\) is not computed, we can’t directly perform gradient descent for the fraction proposal network. Instead, we assign \(\frac{\partial W_{1}}{\partial \tau_{i}}\) to the optimizer.

\(\frac{\partial W_{1}}{\partial \tau_{i}}\) is given by

\[\frac{\partial W_{1}}{\partial \tau_{i}}=2 F_{Z}^{-1}\left(\tau_{i}\right)-F_{Z}^{-1}\left(\hat{\tau}_{i}\right)-F_{Z}^{-1}\left(\hat{\tau}_{i-1}\right), \forall i \in(0, N).\]

Like implicit quantile networks, a learned quantile tau is encoded into an embedding vector via:

\[\phi_{j}(\tau):=\operatorname{ReLU}\left(\sum_{i=0}^{n-1} \cos (\pi i \tau) w_{i j}+b_{j}\right)\]

Then the quantile embedding is element-wise multiplied by the embedding of the observation of the environment, and the subsequent fully-connected layers map the resulted product vector to the respective quantile value.

The advantage of FQF over IQN can be showed in this picture:

../_images/fqf_iqn_compare.png

Pseudo-code

../_images/FQF.png

Extensions

FQF can be combined with:

  • PER (Prioritized Experience Replay)

    Tip

    Whether PER improves FQF depends on the task and the training strategy.

  • Multi-step TD-loss

  • Double (target) Network

  • RNN

Implementation

Tip

Our benchmark result of FQF uses the same hyper-parameters as DQN except the FQF’s exclusive hyper-parameter, the number of quantiles, which is empirically set as 32. Intuitively, the advantage of trained quantile fractions compared to random ones will be more observable at smaller N. At larger N when both trained quantile fractions and random ones are densely distributed over [0, 1], the differences between FQF and IQN becomes negligible.

The default config of FQF is defined as follows:

The network interface FQF used is defined as follows:

The bellman updates of FQF used is defined in the function fqf_nstep_td_error of ding/rl_utils/td.py.

Benchmark

environment

best mean reward

evaluation results

config link

comparison

Pong

(PongNoFrameskip-v4)

21

../_images/FQF_pong.png

config_link_p

Tianshou(20.7)

Qbert

(QbertNoFrameskip-v4)

23416

../_images/FQF_qbert.png

config_link_q

Tianshou(16172.5)

SpaceInvaders

(SpaceInvadersNoFrame skip-v4)

2727.5

../_images/FQF_spaceinvaders.png

config_link_s

Tianshou(2482)

P.S.:
  1. The above results are obtained by running the same configuration on three different random seeds (0, 1, 2).

References

(FQF) Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, Tieyan Liu: “Fully Parameterized Quantile Function for Distributional Reinforcement Learning”, 2019; arXiv:1911.02140. https://arxiv.org/pdf/1911.02140

Other Public Implementations