QMIX ^^^^^^^ Overview --------- QMIX is proposed by `Rashid et al.(2018) `_ for learning joint action value function conditioned on global state information in multi-agent centralized training, and extracting decentralized policies from the centralized end-to-end framework. QMIX employs a centralized neural network to estimate joint action values as a complex non-linear combination of per-agent action values based on local observations, which provides a novel representation of centralized action-value functions and guarantees consistency between the centralized and decentralized policies. QMIX is a non-linear extension of `VDN (Sunehag et al. 2017) `_. Compared to VDN, QMIX can represent more extra state information during training and a much richer class of action-value functions. Quick Facts ------------- 1. QMIX uses the paradigm of **centralized training with decentralized execution**. 2. QMIX is a **model-free**, **value-based**, **off-policy**, **multi-agent** RL method. 3. QMIX only support **discrete** action spaces. 4. QMIX considers a **partially observable** scenario in which each agent only obtains individual observations. 5. QMIX accepts **DRQN** as individual value network to tackle the **partially observable** issue. 6. QMIX represents the joint value function using an architecture consisting of **agent networks, a mixing network, a hyper-network**. The mixing network is a feed-forward neural network that takes the agent network outputs as input and mixes them monotonically, producing joint action values. The weights of the mixing network are produced by a separate hypernetwork. Key Equations or Key Graphs --------------------------- VDN and QMIX are representative methods that use the idea of factorization of the joint action-value function :math:`Q_{tot}` into individual ones :math:`Q_a` for decentralized execution. In order to achieve centralized training with decentralized execution (CTDE), we need to ensure that a global :math:`argmax` performed on :math:`Q_{tot}` yields the same result as a set of individual :math:`argmax` operations performed on each :math:`Q_a`: .. math:: $\arg \max _{\boldsymbol{u}} Q_{\mathrm{tot}}(\boldsymbol{\tau}, \boldsymbol{u})=\left(\begin{array}{c}\arg \max _{u_{1}} Q_{1}\left(\tau_{1}, u_{1}\right) \\ \vdots \\ \arg \max _{u_{N}} Q_{N}\left(\tau_{n}, u_{N}\right)\end{array}\right)$ VDN factorizes the joint action-value function into a sum of individual action-value functions. :math:`$Q_{\mathrm{tot}}(\boldsymbol{\tau}, \boldsymbol{u})=\sum_{i=1}^{N} Q_{i}\left(\tau_{i}, u_{i}\right)$` QMIX extends this additive value factorization to represent the joint action-value function as a monotonic function. QMIX is based on monotonicity, a constraint on the relationship between joint action values :math:`Q_{tot}` and individual action values :math:`Q_a`. .. math:: \frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0, \forall a \in A The overall QMIX architecture including individual agent networks, the mixing network, the hyper-network: .. image:: images/marl/qmix.png QMIX trains the mixing network via minimizing the following loss: .. math:: y^{tot} = r + \gamma \max_{\textbf{u}^{’}}Q_{tot}(\tau^{'}, \textbf{u}^{'}, s^{'}; \theta^{-}) .. math:: \mathcal{L}(\theta) = \sum_{i=1}^{b} [(y_{i}^{tot} - Q_{tot}(\tau, \textbf{u}, s; \theta)^{2}] Each weight of the mixing network is produced by a independent hyper-network, which takes the global state as input and outputs the weight of one layer of the mixing network. More details can be found in the original paper `Rashid et al.(2018) `_. VDN and QMIX are methods that attempt to factorize :math:`Q_tot` assuming additivity and monotonicity, respectively. Thus, joint action value functions satisfying those conditions would be well factorized by VDN and QMIX. However, there exist tasks whose joint action-value functions do not meet the said conditions. `QTRAN (Son et al. 2019) `_, proposes a factorization method, which is free from such structural constraints via transforming the original joint action-value function into an easily factorizable one. QTRAN guarantees more general factorization than VDN or QMIX. Implementations ---------------- The default config is defined as follows: .. autoclass:: ding.policy.qmix.QMIXPolicy :noindex: The network interface QMIX used is defined as follows: .. autoclass:: ding.model.template.QMix :members: forward :noindex: Benchmark ----------- .. +---------------------+-----------------------------------------------------+----------------------------------+ | SMAC Map | evaluation results | config link | +=====================+=====================================================+==================================+ | | |`config_link_MMM `_ | | | | | +---------------------+-----------------------------------------------------+----------------------------------+ | | |`config_link_3s5z `_ | +---------------------+-----------------------------------------------------+----------------------------------+ | | |`config_link_MMM2 `_ | +---------------------+-----------------------------------------------------+----------------------------------+ | | |`config_link_5m6m `_ | +---------------------+-----------------------------------------------------+----------------------------------+ | | |`config_link_2c64zg `_ | +---------------------+-----------------------------------------------------+----------------------------------+ | | |`config_link_3s5z3s6z `_ | +---------------------+-----------------------------------------------------+----------------------------------+ .. list-table:: Benchmark and comparison of qmix algorithm :widths: 25 15 30 15 15 :header-rows: 1 * - environment - best mean reward - evaluation results - config link - comparison * - | MMM | - 1 - .. image:: images/benchmark/QMIX_MMM.png - `config_link_MMM `_ - | Pymarl(1) * - | 3s5z | - 1 - .. image:: images/benchmark/QMIX_3s5z.png - `config_link_3s5z `_ - | Pymarl(1) * - | MMM2 | - 0.8 - .. image:: images/benchmark/QMIX_MMM2.png - `config_link_MMM2 `_ - | Pymarl(0.7) * - | 5m6m | - 0.6 - .. image:: images/benchmark/QMIX_5m6m.png - `config_link_5m6m `_ - | Pymarl(0.76) * - | 2c_vs_64zg | - 1 - .. image:: images/benchmark/QMIX_2c_vs_64zg.png - `config_link_2c_vs_64zg `_ - | Pymarl(1) .. * - | 3s5z_vs_3s6z | - 0.03 - .. image:: images/benchmark/QMIX_3s5z_vs_3s6z.png - `config_link_3s5z_vs_3s6z `_ - | Pymarl(0.03) P.S.: 1. The above results are obtained by running the same configuration on five different random seeds (0, 1, 2). 2. For the multi-agent discrete action space algorithm like QMIX, the SMAC environment set is generally used for testing, and is generally evaluated by the highest mean reward training 10M ``env_step``. For more details about SMAC, please refer to `SMAC Env Tutorial <../env_tutorial/smac_zh.html>`_ . References ---------------- - Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. International Conference on Machine Learning. PMLR, 2018. - Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017. - Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, Yung Yi. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. International Conference on Machine Learning. PMLR, 2019. - Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. The StarCraft Multi-Agent Challenge. arXiv preprint arXiv:1902.04043, 2019. Other Public Implementations ---------------------------- - pymarl_ .. _pymarl: https://github.com/oxwhirl/pymarl/blob/master/src/config/algs/qmix.yaml