Shortcuts

GAIL

Overview

GAIL (Generative Adversarial Imitation Learning) was first proposed in Generative Adversarial Imitation Learning, is a general framework for directly extracting policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning. The authors deduced the optimization objective of GAIL from the perspective of occupancy measure. Compared to other learning methods, GAIL neither suffers from the compounding error problem in imitation learning, nor needs to expensively learn the inter-mediate reward function as in inverse reinforcement learning. But similar to other methods, GAIL is also exposed to “the curse of dimensionality”, which makes the scalability much valuable in high-dimension-space problems.

Quick Facts

  1. GAIL consists of a generator and a discriminator, trained in an adversarial manner.

  2. The generator is optimized for a surrogate reward provided by the discriminator, usually by policy-gradient reinforcement learning methods, like TRPO, for its sampling nature.

  3. The discriminator can be simply optimized by typical gradient descent methods, like Adam, to distinguish expert and generated data.

Key Equations or Key Graphs

The objective function in GAIL’s adversarial training is as below:

../_images/gail_loss.png

where pi is the generator policy, D is the discriminator policy, while \(H(\pi)\) is the causal entropy of policy pi. This is a min-max optimization process, and the objective is optimized in an iterative adversarial manner. During training, D has to maximize the objective, while pi has to counter D by minimizing the objective.

Pseudo-Code

Extensions

Implementation

The default config is defined as follows:

class ding.reward_model.gail_irl_model.GailRewardModel(config: EasyDict, device: str, tb_logger: SummaryWriter)[source]
Overview:

The Gail reward model class (https://arxiv.org/abs/1606.03476)

Interface:

estimate, train, load_expert_data, collect_data, clear_date, __init__, state_dict, load_state_dict, learn

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

gail

RL policy register name, refer
to registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

expert_data_
path

str

expert_data. .pkl

Path to the expert dataset

Should be a ‘.pkl’
file

3

learning_rate

float

0.001

The step size of gradient descent

4

update_per_
collect

int

100

Number of updates per collect



5

batch_size

int

64

Training batch size

6

input_size

int

Size of the input:
obs_dim + act_dim


7

target_new_
data_count

int

64

Collect steps per iteration



8

hidden_size

int

128

Linear model hidden size

9

collect_count

int

100000

Expert dataset size

One entry is a (s,a)
tuple

10

clear_buffer_
per_iters


int

1

clear buffer per fixed iters
make sure replay
buffer’s data count
isn’t too few.
(code work in entry)

Benchmark

environment

best mean reward

evaluation results

config link

expert

LunarLander

(LunarLander-v2)

200

../_images/lunarlander_gail.png

config_link_l

DQN

BipedalWalker

(BipedalWalker-v3)

300

../_images/bipedalwalker_gail.png

config_link_b

SAC

Hopper

(Hopper-v3)

3500

../_images/hopper_gail.png

config_link_h

SAC

Reference

  • Ho, Jonathan, and Stefano Ermon. Making efficient use of demonstrations to solve hard exploration problems. [https://arxiv.org/abs/1606.03476 arXiv:1606.03476], 2019.

  • Song, Jiaming, et al. Multi-agent generative adversarial imitation learning. [https://arxiv.org/abs/1807.09936 arXiv:1807.09936], 2018.

  • Finn, Chelsea, et al. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. [https://arxiv.org/abs/1611.03852 arXiv:1611.03852], 2016.

Other Public Implementations