HOME

GitHub bilibili twitter
View code on GitHub

PyTorch implementation of Policy Gradient (PG)
Policy gradient (also known as REINFORCE) is a classical method for learning a policy.
Each (st,at)(s_t,a_t) will be used to compute corresponding log probability. Then the probability is back-propagated then and calculate gradient. The gradient will be multiplied by a weight value, which is the accumulated return in this game.
The final target function is formulated as:
1Nn=1Nlog(π(ansn))Gtn- \frac 1 N \sum_{n=1}^{N} log(\pi(a^n|s^n)) G_t^n
This document mainly includes:
- Implementation of PG error.
- Main function (test function)

Overview
Implementation of PG (Policy Gradient)

Unpack data: <π(as),a,Gt><\pi(a|s), a, G_t>

Prepare policy distribution from logit and get log propability.

Policy loss: 1Nn=1Nlog(π(ansn))Gtn- \frac 1 N \sum_{n=1}^{N} log(\pi(a^n|s^n)) G_t^n

Entropy bonus: 1Nn=1Nanπ(ansn)log(π(ansn))\frac 1 N \sum_{n=1}^{N} \sum_{a^n}\pi(a^n|s^n) log(\pi(a^n|s^n))
P.S. the final loss is policy_loss - entropy_weight * entropy_loss .

Return the concrete loss items.

Overview
Test function of PG, for both forward and backward operations.

batch size=4, action=32

Generate logit, action, return_.

Compute PG error.

Assert the loss is differentiable.

If you have any questions or advices about this documation, you can raise issues in GitHub (https://github.com/opendilab/PPOxFamily) or email us (opendilab@pjlab.org.cn).