QTRAN¶

Overview¶

QTRAN is proposed by Kyunghwan et al.(2019). QTRAN is a factorization method for MARL, which is free from such structural constraints and takes on a new approach to transform the original joint action-value function into an easily factorizable one, with the same optimal actions.

Compared to VDN(Sunehag et al. 2017), QMIX(Rashid et al. 2018), QTRAN guarantees more general factorization than VDN or QMIX, thus covering a much wider class of MARL tasks than does previous methods, and it performs better than QMIX in 5m_vs_6m and MMM2 maps.

Quick Facts¶

QTRAN uses the paradigm of centralized training with decentralized execution.
QTRAN is a model-free and value-based method.
QTRAN only support discrete action spaces.
QTRAN is an off-policy multi-agent RL algorithm.
QTRAN considers a partially observable scenario in which each agent only obtains individual observations.
QTRAN accepts DRQN as individual value network.
QTRAN learns the joint value function through Individual action-value network, Joint action-value network and State-value network.

Key Equations or Key Graphs¶

The overall QTRAN architecture including individual agent networks and the mixing network structure:

QTRAN trains the mixing network via minimizing the following loss:

\[L_{\mathrm{td}}(; \boldsymbol{\theta}) =\left(Q_{\mathrm{jt}}(\boldsymbol{\tau}, \boldsymbol{u})-y^{\mathrm{dqn}}\left(r, \boldsymbol{\tau}^{\prime} ; \boldsymbol{\theta}^{-}\right)\right)^{2}\]

\[L_{\mathrm{opt}}(; \boldsymbol{\theta}) =\left(Q_{\mathrm{jt}}^{\prime}(\boldsymbol{\tau}, \overline{\boldsymbol{u}})-\hat{Q}_{\mathrm{jt}}(\boldsymbol{\tau}, \overline{\boldsymbol{u}})+V_{\mathrm{jt}}(\boldsymbol{\tau})\right)^{2}\]

\[L_{\mathrm{nopt}}(; \boldsymbol{\theta}) =\left(\min \left[Q_{\mathrm{jt}}^{\prime}(\boldsymbol{\tau}, \boldsymbol{u})-\hat{Q}_{\mathrm{jt}}(\boldsymbol{\tau}, \boldsymbol{u})+V_{\mathrm{jt}}(\boldsymbol{\tau}), 0\right]\right)^{2}\]

\[L\left(\boldsymbol{\tau}, \boldsymbol{u}, r, \boldsymbol{\tau}^{\prime} ; \boldsymbol{\theta}\right)=L_{\mathrm{td}}+\lambda_{\mathrm{opt}} L_{\mathrm{opt}}+\lambda_{\mathrm{nopt}} L_{\mathrm{nopt}},\]

Pseudo-code¶

The following flow charts show how QTRAN trains.

Extensions¶

QTRAN++ (Son et al. 2019), as an extension of QTRAN, successfully bridges the gap between empirical performance and theoretical guarantee, and newly achieves state-of-the-art performance in the SMAC environment.

Implementations¶

The default config is defined as follows:

class ding.policy.qtran.QTRANPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:
Policy class of QTRAN algorithm. QTRAN is a multi model reinforcement learning algorithm, you can view the paper in the following link https://arxiv.org/abs/1803.11485

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

qtran

RL policy register name, refer to

registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

bool

True

Whether to use cuda for network

this arg can be diff-

erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy

or off-policy

priority

bool

False

Whether use priority(PER)

priority sample,

update priority

5

priority_

IS_weight

bool

False

Whether use Importance Sampling

Weight to correct biased update.

IS weight

6

learn.update_

per_collect

int

20

How many updates(iterations) to train

after collector’s one collection. Only

valid in serial training

this args can be vary

from envs. Bigger val

means more off-policy

7

learn.target_

update_theta

float

0.001

Target network update momentum

parameter.

between[0,1]

8

learn.discount

_factor

float

0.99

Reward’s future discount factor, aka.

gamma

may be 1 when sparse

reward env

The network interface QTRAN used is defined as follows:

ding.model.template.qtran: alias of <module ‘ding.model.template.qtran’ from ‘/home/runner/work/DI-engine/DI-engine/ding/model/template/qtran.py’>

The Benchmark result of QTRAN in SMAC (Samvelyan et al. 2019), for StarCraft micromanagement problems, implemented in DI-engine is shown.

smac map	best mean reward	config link	comparison
MMM	1.00	config_link_0	Pymarl(1.0)
3s5z	0.95	config_link_1	Pymarl(0.1)
5m6m	0.55	config_link_2	Pymarl(0.7)

References¶

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. ICML, 2019.

Other Public Implementations¶

Pymarl.

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	qtran	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	True	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
	`priority`	bool	False	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update_` `per_collect`	int	20	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.target_` `update_theta`	float	0.001	Target network update momentum parameter.	between[0,1]
8	`learn.discount` `_factor`	float	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env