Shortcuts

COMA

Overview

COMA (Foerster et al. 2018), counterfactual multi-agent policy gradients, is a multi-agent actor critic based approach to learn a fully centralized state action function and use it to guide the optimization of decentralized policies. COMA uses a centralized critic to train decentralized actors for individual agents, estimating a counterfactual advantage function for each agent in order to address multi-agent credit assignment. A counterfactual baseline is used in COMA to marginalizes out a single agent’s action, while keeping the other agents’ actions fixed, and the centralized critic representation allows the counterfactual baseline to be computed efficiently.

Quick Facts

  1. COMA uses the paradigm of centralized training with decentralized execution.

  2. COMA is a model-free and actor critic method.

  3. COMA focus on settings with discrete actions. COMA can be extended to continuous actions spaces by other estimation methods.

  4. COMA uses on-policy policy gradient learning to train critics.

  5. COMA has poor sample efficiency and is prone to getting stuck in sub-optimal local minima.

  6. COMA considers a partially observable setting in which each agent only obtains individual observations. Agents must rely on local action-observation histories during execution.

  7. COMA accepts Independent Actor-Critic as individual value network and speeds learning by sharing parameters among the agents.

  8. Since learning is centralized, the centralized critic in COMA estimates Q-values for the joint action conditioned on the central state.

Key Equations or Key Graphs

The overall information flow between the decentralized actors, the environment and the centralized critic in COMA:

../_images/coma.png

COMA computes an advantage function that compares the Q-value for the current action $u^a$ to a counterfactual baseline that marginalises out \(u^a\), while keeping the other agents’ actions \(u^{-a}\) fixed.

\[A^{a}(s, \textbf{u}) = Q(s, \textbf{u}) - \sum_{u^{'a}}\pi^{a}(u^{'a}|\tau^{a})Q(s, (\textbf{u}^{-a}, u^{'a}))\]

The advantage \(A^{a}(s, u)\) computes a separate baseline that uses the centralized critic to find counterfactuals when only \(a\)’s action changes, learned directly from agents’ experiences.

The first term in the equation is the global Q-value of the currently selected action, which indicates the estimated Q-value in centrailed critical. The second term in the equation represents the expectation of global Q-value that can be obtained under all possible selection actions of agent a. The difference between two reflects the advantage of the action selected by the current agent over the average result.

Comparing to the origin single-agent actor-critic algorithm, COMA policy gradient for all agent policies using the above counterfactual advantage:

\[g_{k} = \mathbb{E}_{\pi}[\sum_{a}\nabla_{\theta_{k}} \log \pi^{a}(u^{a}|\tau^{a})A^{a}(s, \textbf{u})]\]

Note

COMA uses a counterfactual baseline. Each agent learns from a shaped reward that compares the global reward to the reward received when that agent’s action is replaced with a default action.

Extensions

  • COMA takes the advantage of learning a centralized critic to train decentralized actors. Similarly, Gupta et al. (2017) present a centralized actor-critic algorithm learning per-agent critics to opt for better scalability at the cost of diluted benefits of centralization.

  • MADDPG (Lowe et al. 2017) extends the DDPG framework in multi-agent settings and learns a centralized critic for each agent. These approaches use on-policy policy gradient learning.

  • COMA-CC (Vasilev et al. 2021) improves COMA by changing its training scheme to use the entire batch of data, rather than mini-batches with a consistent critic. COMA-CC is an off-policy version COMA with an alternative critic. For each counterfactual Q-value computation, the COMA critic requires \(n\) inputs, one for each agent, and the COMA-CC critic requires \(nm\) inputs, one for each agent and counterfactual joint action. To reduce computation, the concatenated observations \((z^1_t, ..., z^n_t)\) is compressed via an encoding network before used as inputs to the critic.

\[Q(s_{t},\textbf{u}_{t}) \approx Q_{\phi}(s_{t}, z^{1}_{t}, \cdots, z^{n}_{t}, \textbf{u}_{t-1}, \textbf{u}_{t})\]

Implementations

The default config is defined as follows:

class ding.policy.coma.COMAPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]
Overview:

Policy class of COMA algorithm. COMA is a multi model reinforcement learning algorithm

Interface:
_init_learn, _data_preprocess_learn, _forward_learn, _reset_learn, _state_dict_learn, _load_state_dict_learn

_init_collect, _forward_collect, _reset_collect, _process_transition, _init_eval, _forward_eval_reset_eval, _get_train_sample, default_model, _monitor_vars_learn

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

coma

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

True

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update.
IS weight

6

learn.update
_per_collect

int

1

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

7

learn.target_
update_theta

float

0.001

Target network update momentum
parameter.
between[0,1]

8

learn.discount
_factor

float

0.99

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

9

learn.td_
lambda

float

0.8

The trade-off factor of td-lambda,
which balances 1step td and mc

10

learn.value_
weight

float

1.0

The loss weight of value network
policy network weight
is set to 1

11

learn.entropy_
weight

float

0.01

The loss weight of entropy
regularization
policy network weight
is set to 1

The network interface COMA used is defined as follows:

class ding.model.template.coma.COMA(agent_num: int, obs_shape: Dict, action_shape: int | SequenceType, actor_hidden_size_list: SequenceType)[source]
Overview:

The network of COMA algorithm, which is QAC-type actor-critic.

Interface:

__init__, forward

Properties:
  • mode (list): The list of forward mode, including compute_actor and compute_critic

forward(inputs: Dict, mode: str) Dict[source]
Overview:

forward computation graph of COMA network

Arguments:
  • inputs (dict): input data dict with keys [‘obs’, ‘prev_state’, ‘action’]

  • agent_state (torch.Tensor): each agent local state(obs)

  • global_state (torch.Tensor): global state(obs)

  • action (torch.Tensor): the masked action

ArgumentsKeys:
  • necessary: obs { agent_state, global_state, action_mask }, action, prev_state

ReturnsKeys:
  • necessary:
    • compute_critic: q_value

    • compute_actor: logit, next_state, action_mask

Shapes:
  • obs (dict): agent_state: \((T, B, A, N, D)\), action_mask: \((T, B, A, N, A)\)

  • prev_state (list): \([[[h, c] for _ in range(A)] for _ in range(B)]\)

  • logit (torch.Tensor): \((T, B, A, N, A)\)

  • next_state (list): \([[[h, c] for _ in range(A)] for _ in range(B)]\)

  • action_mask (torch.Tensor): \((T, B, A, N, A)\)

  • q_value (torch.Tensor): \((T, B, A, N, A)\)

Examples:
>>> agent_num, bs, T = 4, 3, 8
>>> agent_num, bs, T = 4, 3, 8
>>> obs_dim, global_obs_dim, action_dim = 32, 32 * 4, 9
>>> coma_model = COMA(
>>>     agent_num=agent_num,
>>>     obs_shape=dict(agent_state=(obs_dim, ), global_state=(global_obs_dim, )),
>>>     action_shape=action_dim,
>>>     actor_hidden_size_list=[128, 64],
>>> )
>>> prev_state = [[None for _ in range(agent_num)] for _ in range(bs)]
>>> data = {
>>>     'obs': {
>>>         'agent_state': torch.randn(T, bs, agent_num, obs_dim),
>>>         'action_mask': None,
>>>     },
>>>     'prev_state': prev_state,
>>> }
>>> output = coma_model(data, mode='compute_actor')
>>> data= {
>>>     'obs': {
>>>         'agent_state': torch.randn(T, bs, agent_num, obs_dim),
>>>         'global_state': torch.randn(T, bs, global_obs_dim),
>>>     },
>>>     'action': torch.randint(0, action_dim, size=(T, bs, agent_num)),
>>> }
>>> output = coma_model(data, mode='compute_critic')

The Benchmark result of COMA in SMAC (Samvelyan et al. 2019), for StarCraft micromanagement problems, implemented in DI-engine is shown.

smac map

best mean reward

evaluation results

config link

comparison

MMM

1.00

../_images/COMA_MMM.png

config_link_1

Pymarl(0.1)

3s5z

1.00

../_images/COMA_3s5z.png

config_link_2

Pymarl(0.0)

We did not show the performance curve of COMA at 5m_vs_6m map because COMA can’t converge under this map, and the original author’s COMA algorithm also can’t converge under this map.

References

  • Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

  • Jayesh K. Gupta, Maxim Egorov, Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. International Conference on Autonomous Agents and Multiagent Systems, 2017.

  • Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.

  • Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. The StarCraft Multi-Agent Challenge. arXiv preprint arXiv:1902.04043, 2019.

  • Bozhidar Vasilev, Tarun Gupta, Bei Peng, Shimon Whiteson. Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients. arXiv preprint arXiv:2104.13446, 2021.

Other Public Implementations