Shortcuts

CollaQ

Overview

CollaQ (Zhang et al. 2020), Collaborative Q-learning, is a multi-agent collaboration approach based on Q-learning, which formulates multi-agent collaboration as a joint optimization problem on reward assignments. CollaQ decomposes decentralized Q value functions of individual agents into two terms, the self-term that only relies on the agent’s own state, and the interactive term that is related to states of nearby agents. CollaQ jointly trains using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss.

Quick Facts

  1. CollaQ is a model-free and value-based multi-agent RL approach.

  2. CollaQ only supports discrete action spaces.

  3. CollaQ is an off-policy algorithm.

  4. CollaQ considers a partially observable scenario in which each agent only obtains individual observations.

  5. CollaQ uses DRQN architecture for individual Q learning.

  6. Compared to QMIX and VDN, CollaQ doesn’t need a centralized Q function, which expands the individual Q-function for each agent with reward assignment depending on the joint state.

Key Equations or Key Graphs

The overall architecture of the Q-function with attention-based model in CollaQ:

../_images/collaq.png

The Q-function for agent i:

\[Q_{i}(s_{i},a_{i};\hat{\textbf{r}}_{i}) = \underbrace{Q_{1}(s{i}, a_{i},\hat{\textbf{r}_{0i}})}_{Q^{alone}(s_{i},a_{i})} + \underbrace{\nabla_{\textbf{r}}Q_{i}(s_{i},a_{i};\textbf{r}_{0i})\cdot(\hat{\textbf{r}_{i}} - \textbf{r}_{0i}) + \mathcal{O}(||\hat{\textbf{r}_{i}} - \textbf{r}_{0i}||^{2})}_{Q^{collab}(s^{local}_{i}, a_{i})}\]

The overall training objective of standard DQN training with MARA loss:

\[L = \mathbb{E}_{s_{i},a{i}\sim\rho(\cdot)}[\underbrace{(y-Q_{i}(o_{i},a_{i}))^{2}}_{\text{DQN Object}} +\underbrace{\alpha(Q_{i}^{collab}(o_{i}^{alone}, a_{i}))^{2}}_{\text{MARA Object}}]\]

Extensions

  • CollaQ can choose wether to use an attention-based architecture or not. Because the observation can be spatially large and covers agents whose states do not contribute much to a certain agent policy. In details, CollaQ uses a transformer architecture (stacking multiple layers of attention modules), which empirically helps improve the performance in multi-agent tasks.

Implementations

The default config is defined as follows:

class ding.policy.collaq.CollaQPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of CollaQ algorithm. CollaQ is a multi-agent reinforcement learning algorithm

Interface:
_init_learn, _data_preprocess_learn, _forward_learn, _reset_learn, _state_dict_learn, _load_state_dict_learn

_init_collect, _forward_collect, _reset_collect, _process_transition, _init_eval, _forward_eval_reset_eval, _get_train_sample, default_model

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

collaq

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

True

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update.
IS weight

6

learn.update_
per_collect

int

20

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

7

learn.target_
update_theta

float

0.001

Target network update momentum
parameter.
between[0,1]

8

learn.discount
_factor

float

0.99

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

9

learn.collaq
_loss_weight

float

1.0

The weight of collaq MARA loss

The network interface CollaQ used is defined as follows:

class ding.model.template.collaq.CollaQ(agent_num: int, obs_shape: int, alone_obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, attention: bool = False, self_feature_range: List[int] | None = None, ally_feature_range: List[int] | None = None, attention_size: int = 32, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False)[source]
Overview:

The network of CollaQ (Collaborative Q-learning) algorithm. It includes two parts: q_network and q_alone_network. The q_network is used to get the q_value of the agent’s observation and the agent’s part of the observation information of the agent’s concerned allies. The q_alone_network is used to get the q_value of the agent’s observation and the agent’s observation information without the agent’s concerned allies. Multi-Agent Collaboration via Reward Attribution Decomposition https://arxiv.org/abs/2010.08531

Interface:

__init__, forward, _setup_global_encoder

forward(data: dict, single_step: bool = True) dict[source]
Overview:

The forward method calculates the q_value of each agent and the total q_value of all agents. The q_value of each agent is calculated by the q_network, and the total q_value is calculated by the mixer.

Arguments:
  • data (dict): input data dict with keys [‘obs’, ‘prev_state’, ‘action’]
    • agent_state (torch.Tensor): each agent local state(obs)

    • agent_alone_state (torch.Tensor): each agent’s local state alone, in smac setting is without ally feature(obs_along)

    • global_state (torch.Tensor): global state(obs)

    • prev_state (list): previous rnn state, should include 3 parts: one hidden state of q_network, and two hidden state if q_alone_network for obs and obs_alone inputs

    • action (torch.Tensor or None): if action is None, use argmax q_value index as action to calculate agent_q_act

  • single_step (bool): whether single_step forward, if so, add timestep dim before forward and remove it after forward

Return:
  • ret (dict): output data dict with keys [‘total_q’, ‘logit’, ‘next_state’]
    • total_q (torch.Tensor): total q_value, which is the result of mixer network

    • agent_q (torch.Tensor): each agent q_value

    • next_state (list): next rnn state

Shapes:
  • agent_state (torch.Tensor): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape

  • global_state (torch.Tensor): \((T, B, M)\), where M is global_obs_shape

  • prev_state (list): math:(B, A), a list of length B, and each element is a list of length A

  • action (torch.Tensor): \((T, B, A)\)

  • total_q (torch.Tensor): \((T, B)\)

  • agent_q (torch.Tensor): \((T, B, A, P)\), where P is action_shape

  • next_state (list): math:(B, A), a list of length B, and each element is a list of length A

Examples:
>>> collaQ_model = CollaQ(
>>>     agent_num=4,
>>>     obs_shape=32,
>>>     alone_obs_shape=24,
>>>     global_obs_shape=32 * 4,
>>>     action_shape=9,
>>>     hidden_size_list=[128, 64],
>>>     self_feature_range=[8, 10],
>>>     ally_feature_range=[10, 16],
>>>     attention_size=64,
>>>     mixer=True,
>>>     activation=torch.nn.Tanh()
>>> )
>>> data={
>>>     'obs': {
>>>         'agent_state': torch.randn(8, 4, 4, 32),
>>>         'agent_alone_state': torch.randn(8, 4, 4, 24),
>>>         'agent_alone_padding_state': torch.randn(8, 4, 4, 32),
>>>         'global_state': torch.randn(8, 4, 32 * 4),
>>>         'action_mask': torch.randint(0, 2, size=(8, 4, 4, 9))
>>>     },
>>>     'prev_state': [[[None for _ in range(4)] for _ in range(3)] for _ in range(4)],
>>>     'action': torch.randint(0, 9, size=(8, 4, 4))
>>> }
>>> output = collaQ_model(data, single_step=False)

The Benchmark result of CollaQ in SMAC (Samvelyan et al. 2019), for StarCraft micromanagement problems, implemented in DI-engine is shown.

Benchmark

Environment

Best mean reward

Evaluation results

Config link

Comparison

5m6m

1

../_images/smac_5m6m_collaq.png

config_link_p

Pymarl(0.8)

MMM

0.7

../_images/smac_MMM_collaq.png

config_link_q

Pymarl(1)

3s5z

1

../_images/smac_3s5z_collaq.png

config_link_s

Pymarl(1)

P.S.:

The above results are obtained by running the same configuration on three different random seeds (0, 1, 2).

References

  • Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian. Multi-Agent Collaboration via Reward Attribution Decomposition. arXiv preprint arXiv:2010.08531, 2020.

  • Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. The StarCraft Multi-Agent Challenge. arXiv preprint arXiv:1902.04043, 2019.

Other Public Implementations