QMIX¶

概述¶

QMIX 是由 Rashid et al.(2018) 提出的，用于在多智能体集中式训练中学习基于全局状态信息的联合动作价值函数，并从集中式端到端框架中提取分布式执行策略。 QMIX 使用集中式神经网络来估计联合动作值，作为基于局部观察的每个智能体动作值的复杂非线性组合，这为集中式动作价值函数提供了一种新的表示，并保证了集中式和分散式策略之间的一致性。

QMIX 是 VDN(Sunehag et al. 2017) 的非线性扩展。与 VDN(Value-Decomposition Networks For Cooperative Multi-Agent Learning ) 相比，QMIX 在训练过程中可以通过超网络(hyper-network)输入的全局信息表示更多的额外状态信息（智能体观测范围外），并且可以表示更丰富的动作价值函数类。

核心要点¶

QMIX 使用 集中式训练与分散式执行(centralized training with decentralized execution) 的范式。
QMIX 是一种 无模型(model-free)、基于价值(value-based)、异策略(off-policy)、多智能体(multi-agent) 的强化学习方法。
QMIX 仅支持 离散(discrete) 动作空间。
QMIX 考虑了一种 部分可观察(partially observable) 的情景，其中每个智能体只获得个体观察。
QMIX 接受 DRQN 作为个体价值网络来解决 部分可观察 问题。
QMIX 使用由 智能体网络(agent networks)、混合网络(mixing network)、超网络(hyper-network) 组成的架构来表示联合价值函数。混合网络是一个前馈神经网络，它将智能体网络的输出作为输入并单调地混合它们，产生联合动作值。混合网络的权重由单独的超网络产生。

关键方程或关键图形¶

VDN 和 QMIX 是使用将联合动作价值函数 $Q_{tot}$ 分解为用于分散执行的个体函数 $Q_a$ 的思想的代表性方法。

为了实现集中式训练与分散式执行 (centralized training with decentralized execution CTDE)，我们需要确保在 $Q_{tot}$ 上执行的全局 $argmax$ 与在每个 $Q_a$ 上执行的一组单独的 $argmax$ 操作产生相同的结果：

\[\begin{split}$\arg \max _{\boldsymbol{u}} Q_{\mathrm{tot}}(\boldsymbol{\tau}, \boldsymbol{u})=\left(\begin{array}{c}\arg \max _{u_{1}} Q_{1}\left(\tau_{1}, u_{1}\right) \\ \vdots \\ \arg \max _{u_{N}} Q_{N}\left(\tau_{n}, u_{N}\right)\end{array}\right)$\end{split}\]

VDN 将联合动作价值函数分解为个体动作价值函数之和。 $$Q_{\mathrm{tot}}(\boldsymbol{\tau}, \boldsymbol{u})=\sum_{i=1}^{N} Q_{i}\left(\tau_{i}, u_{i}\right)$$

QMIX 扩展了这种加法值分解，将联合动作价值函数表示为一个单调函数。QMIX 基于单调性，即对联合动作值 $Q_{tot}$ 和个体动作值 $Q_a$ 之间关系的约束。

\[\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0， \forall a \in A\]

QMIX 的整体架构包括个体智能体网络、混合网络和超网络：

QMIX 通过最小化下面的损失函数来训练混合网络：

\[y^{tot} = r + \gamma \max_{\textbf{u}^{’}}Q_{tot}(\tau^{'}, \textbf{u}^{'}, s^{'}; \theta^{-})\]

\[\mathcal{L}(\theta) = \sum_{i=1}^{b} [(y_{i}^{tot} - Q_{tot}(\tau, \textbf{u}, s; \theta))^{2}]\]

混合网络的每个权重都是由独立的超网络产生的，它以全局状态作为输入并输出混合网络一层的权重。更多细节可以在原始论文 Rashid et al.(2018) 中找到。

VDN 和 QMIX 是试图分解 $Q_tot$ 的方法，分别假设可加性和单调性。因此，满足这些条件的联合动作价值函数将被 VDN 和 QMIX 很好地分解。然而，存在一些任务，其联合动作价值函数不满足所述条件。 QTRAN (Son et al. 2019) 提出了一种通过将原始联合动作价值函数转换为容易分解的函数来摆脱这种结构约束的分解方法。 QTRAN (QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning) 保证了比 VDN 或 QMIX 更一般的分解。

实现¶

算法的默认设置如下：

class ding.policy.qmix.QMIXPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:
Policy class of QMIX algorithm. QMIX is a multi-agent reinforcement learning algorithm, you can view the paper in the following link https://arxiv.org/abs/1803.11485.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

qmix

RL policy register name, refer to

registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

bool

True

Whether to use cuda for network

this arg can be diff-

erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy

or off-policy

priority

bool

False

Whether use priority(PER)

priority sample,

update priority

5

priority_

IS_weight

bool

False

Whether use Importance Sampling

Weight to correct biased update.

IS weight

6

learn.update_

per_collect

int

20

How many updates(iterations) to train

after collector’s one collection. Only

valid in serial training

this args can be vary

from envs. Bigger val

means more off-policy

7

learn.target_

update_theta

float

0.001

Target network update momentum

parameter.

between[0,1]

8

learn.discount

_factor

float

0.99

Reward’s future discount factor, aka.

gamma

may be 1 when sparse

reward env

QMIX 使用的网络接口定义如下：

class ding.model.template.QMix(agent_num: int, obs_shape: int, global_obs_shape: int | List[int], action_shape: int, hidden_size_list: list, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False)[source]

Overview:
The neural network and computation graph of algorithms related to QMIX(https://arxiv.org/abs/1803.11485). The QMIX is composed of two parts: agent Q network and mixer(optional). The QMIX paper mentions that all agents share local Q network parameters, so only one Q network is initialized here. Then use summation or Mixer network to process the local Q according to the mixer settings to obtain the global Q.

Interface:
__init__, forward.

forward(data: dict, single_step: bool = True) → dict[source]

Overview:
QMIX forward computation graph, input dict including time series observation and related data to predict total q_value and each agent q_value.

Arguments:

data (dict): Input data dict with keys [‘obs’, ‘prev_state’, ‘action’].

agent_state (torch.Tensor): Time series local observation data of each agents.

global_state (torch.Tensor): Time series global observation data.

prev_state (list): Previous rnn state for q_network.

action (torch.Tensor or None): The actions of each agent given outside the function. If action is None, use argmax q_value index as action to calculate agent_q_act.

single_step (bool): Whether single_step forward, if so, add timestep dim before forward and remove it after forward.

Returns:

ret (dict): Output data dict with keys [total_q, logit, next_state].

ReturnsKeys:

total_q (torch.Tensor): Total q_value, which is the result of mixer network.

agent_q (torch.Tensor): Each agent q_value.

next_state (list): Next rnn state for q_network.

Shapes:

agent_state (torch.Tensor): $(T, B, A, N)$, where T is timestep, B is batch_size A is agent_num, N is obs_shape.

global_state (torch.Tensor): $(T, B, M)$, where M is global_obs_shape.

prev_state (list): math:(B, A), a list of length B, and each element is a list of length A.

action (torch.Tensor): $(T, B, A)$.

total_q (torch.Tensor): $(T, B)$.

agent_q (torch.Tensor): $(T, B, A, P)$, where P is action_shape.

next_state (list): math:(B, A), a list of length B, and each element is a list of length A.

Benchmark¶

Benchmark and comparison of qmix algorithm¶
environment	best mean reward	config link	comparison
MMM	1	config_link_MMM	Pymarl(1)
3s5z	1	config_link_3s5z	Pymarl(1)
MMM2	0.8	config_link_MMM2	Pymarl(0.7)
5m6m	0.6	config_link_5m6m	Pymarl(0.76)
2c_vs_64zg	1	config_link_2c_vs_64zg	Pymarl(1)

P.S.：

上述结果是通过在五个不同的随机种子 (0, 1, 2, 3, 4) 上运行相同的配置获得的。

2. 对于像 QMIX 这样的多智能体离散动作空间算法，通常使用 SMAC 环境集进行测试，并通常通过最高平均奖励训练 10M env_step 进行评估。有关 SMAC 的更多详细信息，请参阅 SMAC Env 教程 SMAC Env Tutorial 。

引用¶

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. International Conference on Machine Learning. PMLR, 2018.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.
Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, Yung Yi. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. International Conference on Machine Learning. PMLR, 2019.
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. The StarCraft Multi-Agent Challenge. arXiv preprint arXiv:1902.04043, 2019.

其他开源实现¶

pymarl

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	qmix	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	True	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
	`priority`	bool	False	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update_` `per_collect`	int	20	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.target_` `update_theta`	float	0.001	Target network update momentum parameter.	between[0,1]
8	`learn.discount` `_factor`	float	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env