A2C ^^^^^^^ Overview --------- A3C (Asynchronous advantage actor-critic) algorithm is a simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. A2C(advantage actor-critic), on the other hand, is the synchronous version of A3C where where the policy gradient algorithm is combined with an advantage function to reduce variance. Quick Facts ----------- 1. A2C is a **model-free** and **policy-based** RL algorithm. 2. A2C is an **on-policy** algorithm. 3. A2C supports both **discrete** and **continuous** action spaces. 4. A2C can be equipped with Recurrent Neural Network (RNN). Key Equations or Key Graphs ---------------------------- A2C uses advantage estimation in the policy gradient. We implement the advantage by Generalized Advantage Estimation (GAE): .. math:: \nabla_{\theta} \log \pi\left(a_{t} \mid {s}_{t} ; \theta\right) \hat{A}^{\pi}\left(s_{t}, {a}_{t} ; \phi \right) where the k-step advantage function is defined: .. math:: \sum_{i=0}^{k-1} \gamma^{i} r_{t+i}+\gamma^{k} \hat{V}^{\pi}\left(s_{t+k} ; \phi\right)-\hat{V}^{\pi}\left(s_{t} ; \phi\right) Pseudo-code ----------- .. image:: images/A2C.png :align: center :height: 150 .. note:: Different from Q-learning, A2C(and other actor critic methods) alternates between policy estimation and policy improvement. Extensions ----------- A2C can be combined with: - Multi-step learning - RNN - Generalized Advantage Estimation (GAE) GAE is proposed in `High-Dimensional Continuous Control Using Generalized Advantage Estimation `_, it uses exponentially-weighted average of different steps of advantage estimators, to make trade-off between variance and bias of the estimation of the advantage: .. math:: \hat{A}_{t}^{\mathrm{GAE}(\gamma, \lambda)}:=(1-\lambda)\left(\hat{A}_{t}^{(1)}+\lambda \hat{A}_{t}^{(2)}+\lambda^{2} \hat{A}_{t}^{(3)}+\ldots\right) where the k-step advantage estimator :math:`\hat{A}_t^{(k)}` is defined as : .. math:: \hat{A}_{t}^{(k)}:=\sum_{l=0}^{k-1} \gamma^{l} \delta_{t+l}^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\cdots+\gamma^{k-1} r_{t+k-1}+\gamma^{k} V\left(s_{t+k}\right) When k=1, the estimator :math:`\hat{A}_t^{(1)}` is the naive advantage estimator: .. math:: \hat{A}_{t}^{(1)}:=\delta_{t}^{V} \quad=-V\left(s_{t}\right)+r_{t}+\gamma V\left(s_{t+1}\right) When GAE is used, the common values of :math:`\lambda` usually belong to [0.8, 1.0]. Implementation ------------------ The default config is defined as follows: .. autoclass:: ding.policy.a2c.A2CPolicy :noindex: The network interface A2C used is defined as follows: .. autoclass:: ding.model.template.vac.VAC :members: forward, compute_actor, compute_critic, compute_actor_critic :noindex: The policy gradient and value update of A2C is implemented as follows: .. code:: python def a2c_error(data: namedtuple) -> namedtuple: logit, action, value, adv, return_, weight = data if weight is None: weight = torch.ones_like(value) dist = torch.distributions.categorical.Categorical(logits=logit) logp = dist.log_prob(action) entropy_loss = (dist.entropy() * weight).mean() policy_loss = -(logp * adv * weight).mean() value_loss = (F.mse_loss(return_, value, reduction='none') * weight).mean() return a2c_loss(policy_loss, value_loss, entropy_loss) .. note:: we apply GAE to calculate the advantage when update the actor network with the GAE default parameter `gae_lambda` =0.95. The target for the update for the value network is obtained by the value function at the current time step plus the advantage function calculated in collectors. Benchmark ----------- +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | environment |best mean reward | evaluation results | config link | comparison | +=====================+=================+=====================================================+==========================+======================+ | | | |`config_link_p `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | | | |`config_link_q `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ | | | |`config_link_s `_ | | +---------------------+-----------------+-----------------------------------------------------+--------------------------+----------------------+ P.S.: 1. The above results are obtained by running the same configuration on five different random seeds (0, 1, 2, 3, 4) References ----------- Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu: “Asynchronous Methods for Deep Reinforcement Learning”, 2016, ICML 2016; arXiv:1602.01783. https://arxiv.org/abs/1602.01783 Other Public Implementations ---------------------------- - Baselines_ - `sb3`_ - `rllib (Ray)`_ - tianshou_ .. _Baselines: https://github.com/openai/baselines/tree/master/baselines/a2c .. _sb3: https://github.com/DLR-RM/stable-baselines3/tree/master/stable_baselines3/a2c .. _`rllib (Ray)`: https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a2c.py .. _tianshou: https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/a2c.py