FQF ^^^^^^^ Overview --------- FQF was proposed in `Fully Parameterized Quantile Function for Distributional Reinforcement Learning `_. The key difference between FQF and IQN is that FQF additionally introduces the fraction proposal network, a parametric function trained to generate tau in [0, 1], while IQN samples tau from a base distribution, e.g. U([0, 1]). Quick Facts ----------- 1. FQF is a **model-free** and **value-based** distibutional RL algorithm. 2. FQF only support **discrete action spaces**. 3. FQF is an **off-policy** algorithm. 4. Usually, FQF use **eps-greedy** or **multinomial sample** for exploration. 5. FQF can be equipped with RNN. Key Equations or Key Graphs ---------------------------- For any continuous quantile function :math:`F_{Z}^{-1}` that is non-decreasing, define the 1-Wasserstein loss of :math:`F_{Z}^{-1}` and :math:`F_{Z}^{-1, \tau}` by .. math:: W_{1}(Z, \tau)=\sum_{i=0}^{N-1} \int_{\tau_{i}}^{\tau_{i+1}}\left|F_{Z}^{-1}(\omega)-F_{Z}^{-1}\left(\hat{\tau}_{i}\right)\right| d \omega Note that as :math:`W_{1}` is not computed, we can’t directly perform gradient descent for the fraction proposal network. Instead, we assign :math:`\frac{\partial W_{1}}{\partial \tau_{i}}` to the optimizer. :math:`\frac{\partial W_{1}}{\partial \tau_{i}}` is given by .. math:: \frac{\partial W_{1}}{\partial \tau_{i}}=2 F_{Z}^{-1}\left(\tau_{i}\right)-F_{Z}^{-1}\left(\hat{\tau}_{i}\right)-F_{Z}^{-1}\left(\hat{\tau}_{i-1}\right), \forall i \in(0, N). Like implicit quantile networks, a learned quantile tau is encoded into an embedding vector via: .. math:: \phi_{j}(\tau):=\operatorname{ReLU}\left(\sum_{i=0}^{n-1} \cos (\pi i \tau) w_{i j}+b_{j}\right) Then the quantile embedding is element-wise multiplied by the embedding of the observation of the environment, and the subsequent fully-connected layers map the resulted product vector to the respective quantile value. The advantage of FQF over IQN can be showed in this picture: .. image:: images/fqf_iqn_compare.png :align: center :scale: 100% Pseudo-code ------------- .. image:: images/FQF.png :align: center :scale: 100% Extensions ----------- FQF can be combined with: - PER (Prioritized Experience Replay) .. tip:: Whether PER improves FQF depends on the task and the training strategy. - Multi-step TD-loss - Double (target) Network - RNN Implementation ------------------ .. tip:: Our benchmark result of FQF uses the same hyper-parameters as DQN except the FQF's exclusive hyper-parameter, ``the number of quantiles``, which is empirically set as 32. Intuitively, the advantage of trained quantile fractions compared to random ones will be more observable at smaller N. At larger N when both trained quantile fractions and random ones are densely distributed over [0, 1], the differences between FQF and IQN becomes negligible. The default config of FQF is defined as follows: .. autoclass:: ding.policy.fqf.FQFPolicy :noindex: The network interface FQF used is defined as follows: .. autoclass:: ding.model.template.q_learning.FQF :members: forward :noindex: The bellman updates of FQF used is defined in the function ``fqf_nstep_td_error`` of ``ding/rl_utils/td.py``. Benchmark ----------- +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | environment |best mean reward | evaluation results | config link | comparison | +=====================+=================+=====================================================+==========================+======================+ | | | |`config_link_p `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | | | |`config_link_q `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | | | |`config_link_s `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ P.S.: 1. The above results are obtained by running the same configuration on three different random seeds (0, 1, 2). References ------------ (FQF) Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, Tieyan Liu: “Fully Parameterized Quantile Function for Distributional Reinforcement Learning”, 2019; arXiv:1911.02140. https://arxiv.org/pdf/1911.02140 Other Public Implementations --------------------------------- - `Tianshou `_