HOME

GitHub bilibili twitter
View code on GitHub

PyTorch implementation of Advantage Actor-Critic (A2C)
REINFORCE method usually suffers from high variance for gradient estimation and Actor-Critic method can only get a biased gradient estimation.
To combine these two methods, A2C uses a baseline function for normalization. By subtracting the baseline function to the total return, the variance for gradient estimation is reduced.
In practical, the baseline function is set to be the value function. The final target function is formulated as:
1Nn=1Nlog(π(ansn))Aπ(sn,an)- \frac 1 N \sum_{n=1}^{N} log(\pi(a^n|s^n)) A^{\pi}(s^n, a^n)
Also in this way, the estimation is guaranteed to be unbiased.
Supplementary material for explaining why baseline function can reduce variance: Related Link
This document mainly includes:
- Implementation of A2C error.
- Main function (test function)

Overview
Implementation of A2C (Advantage Actor-Critic) Related Link

Unpack data: <π(as),a,V(s),Aπ(s,a),Gt,w><\pi(a|s), a, V(s), A^{\pi}(s, a), G_t, w>

Prepare weight for default cases.

Prepare policy distribution from logit and get log propability.

Policy loss: 1Nn=1Nlog(π(ansn))Aπ(sn,an)- \frac 1 N \sum_{n=1}^{N} log(\pi(a^n|s^n)) A^{\pi}(s^n, a^n)

Value loss: 1Nn=1N(GtnV(sn))2\frac 1 N \sum_{n=1}^{N} (G_t^n - V(s^n))^2

Entropy bonus: 1Nn=1Nanπ(ansn)log(π(ansn))\frac 1 N \sum_{n=1}^{N} \sum_{a^n}\pi(a^n|s^n) log(\pi(a^n|s^n))
P.S. the final loss is policy_loss + value_weight * value_loss - entropy_weight * entropy_loss .

Return the concrete loss items.

Overview
Test function of A2C, for both forward and backward operations.

batch size=4, action=32

Generate logit, action, value, adv, return_.

Compute A2C error.

Assert the loss is differentiable.

If you have any questions or advices about this documation, you can raise issues in GitHub (https://github.com/opendilab/PPOxFamily) or email us (opendilab@pjlab.org.cn).