Imitation Learning¶
Problem Definition and Research Motivation¶
Imitation Learning (IL) generally refers to a large class of learning methods in which an agent extracts knowledge from expert data and then imitates the behavior contained in these expert data. Due to the inherent characteristics of IL. It has two main characteristics: it usually requires a large amount of training data, and generally requires that the quality of the training data is good enough. In general, IL can be divided into three directions: IRL (inverse reinforcement learning), BC (behavioral cloning), Adversarial Structured IL, Below we briefly analyze each research direction in this field.
Research Direction¶
Behavioral Cloning (BC)¶
BC was first proposed in [1], which proposes a supervised learning method, which directly establishes the state-action mapping relationship by fitting expert data.
The biggest advantage of BC is that it is simple and efficient, but once the agent encounters some never-before-seen state, it may make fatal mistakes - a problem called “state distribution drift”. To solve this problem, DAgger [2] proposed a method to dynamically update the dataset: collect the real state-action pairs encountered with the policy currently being trained, and add these new expert data to the dataset for subsequent policy update. In a recent study, IBC [3] proposed implicit action cloning, the key of which is that the neural network accepts both observations and actions, and outputs a energy value that is low for expert actions and high for non-expert actions, thereby turning behavioral cloning into an energy-based modeling problem.
The current research hotspots of BC algorithms mainly focus on two aspects: meta-learning and behavior cloning using VR devices.
Inverse Reinforcement Learning (IRL)¶
Inverse reinforcement learning (IRL) is the problem of inferring the reward function of an agent, given its policy or observed behavior. Specifically, IRL first learns a reward function from expert data, and then uses this reward function for subsequent RL training. IRL can theoretically outperform expert data.
From the specific work above, Ziebart et al. [4] first proposed maximum entropy IRL, which utilizes the maximum entropy distribution to better characterize multimodal behavior for more efficient optimization. In 2016, Finn et al. [5] proposed a model-based approach to IRL called guided cost learning, capable of learning arbitrary nonlinear cost functions, such as neural networks, without meticulous feature engineering, and formulate an efficient sample-based approximation for MaxEnt IOC. Subsequently, Hester et al. proposed DQfD [6], which requires only a small amount of expert data, and significantly accelerates the training process through pre-training and a specially designed loss function. T-REX [7] propose a novel reward-learning-from-observation algorithm, that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations.
Adversarial Structured IL¶
The main goal of the Adversarial Structured IL approach is to improve the efficiency of IRL. Even if the IRL algorithm learns a very good reward function, in order to get the final near-optimal policy, it still needs to perform a reinforcement learning step. If the policy can be learned directly from the expert data, the efficiency can be greatly improved. Based on this idea, GAIL [8] combines generative network (GAN) and maximum entropy IRL to learn approximate optimal policies without human annotated expert data.
On this basis, many works have made related improvements to GAIL. For example, InfoGail [9] replaced GAN with WGAN and achieved better performance. There are also some recent works such as GoalGAIL [10], TRGAIL [11] and DGAIL [12] which combine other methods such as post-hoc relabeling and DDPG to achieve faster convergence and better final performance.
Future Study¶
There are still many challenges in imitation learning, mainly including the following:
Generally speaking, it is for a specific task, and there is a lack of imitation learning methods that can be applied to multiple tasks;
For situations where the data is not optimal, it is difficult to surpass the data to achieve optimal results;
Mainly focus on the research of observation, without combining multi-modal factors such as speech and natural language;
The local optimum can be found, but the global optimum can often not be found.