AI’s mastery of complex games like Go and StarCraft has boosted research interest in reinforcement learning (RL), where agents provided only with the game rules engage in self-play to elevate their performance to human level and beyond. But how to build reward functions for real-world tasks that lack a clearly defined win condition? Enter Adversarial Imitation Learning (AIL), a framework for continuous control that has been gaining popularity in recent years for solving such complex tasks.
A number of AIL algorithm improvements have been proposed and implemented, such as changing the discriminator’s loss function or switching from on-policy to off-policy agents to enhance the performance of learned policies and the sample complexity of the algorithm. The robustness and reliability of these improved AIL algorithms however remain uncertain, as their performance-boosting components have rarely been tested in rigorous empirical studies, and most researchers have scant knowledge or understanding of the high-level algorithmic options or the low-level implementation details.
To tackle these issues, a team from Google Brain recently conducted a comprehensive empirical study of more than fifty choices in a generic AIL framework. They explored the impact of these choices on large-scale (>500k trained agents) continuous-control tasks to provide practical insights and recommendations for designing novel and effective AIL algorithms.
Although the design of RL-style reward functions can be difficult or impossible for many real-world applications, simply demonstrating a correct behaviour for an agent to copy is easy and cheap — suggesting imitation learning may be the key that unlocks the next stage of complex task solving.
In recent years, AIL has become one of the most popular frameworks for imitation learning in continuous control. Drawing inspiration from Inverse RL and Generative Adversarial Networks (GANs), AIL models can learn behaviours similar to those of an expert teacher while also maintaining the ability to freely interact with their environment.
Various options have been proposed to enhance the performance of the original AIL algorithm, but until now no thorough examination of their relative effects in a controlled setting or ablation analyses has been undertaken on these choices. In the paper What Matters for Adversarial Imitation Learning?, the Google Brain team investigates the high- and low-level choices with regard to depth, and conducts a comprehensive study of their impact on AIL algorithm performance.
The team summarizes their contributions as:
- Implement a highly configurable generic AIL algorithm, with various axes of variation (>50 hyperparameters(HPs)), including 4 different RL algorithms and 7 regularization schemes for the discriminator.
- Conduct a large-scale (>500k trained agents) experimental study on 10 continuous-control tasks.
- Analyze the experimental results to provide practical insights and recommendations for designing novel and using existing AIL algorithms.
The researchers focus on continuous-control tasks and run their experiments on five widely used environments from the OpenAI Gym: HalfCheetah-v2, Hopper-v2, Walker2d-v2, Ant-v2 and Humanoid-v2; and three manipulation environments from Adroit: pen-v0, door-v0, and hammer-v0. They consider conditional 95th percentile and distribution of choice within the top-five-percent configurations for each choice.
The team identifies the key findings from their experiments as:
- What matters for agent training? The adversarial inverse reinforcement learning (AIRL) reward function performs best for synthetic demonstrations, and using an explicit absorbing state is crucial in environments with variable length episodes. Observation normalization also strongly affects performance. Using an off-policy RL algorithm is necessary for good sample complexity while replaying expert data, and pretraining with behaviour cloning (BC) improves the performance only slightly.
- What matters for the discriminator training? MLP discriminators perform on par or better than AIL-specific architectures, and explicit discriminator regularization is only important in more complicated environments. Spectral norm is overall the best regularizer, but standard regularizers from supervised learning can often perform on par. The optimal learning rate for the discriminator may be 2–2.5 orders of magnitude lower than that of an RL agent.
- Are synthetic demonstrations a good proxy for human data? Human demonstrations differ significantly from synthetic demos, and learning from human demonstrations benefits more from discriminator regularization and may work better with different discriminator inputs and reward functions than RL-generated demonstrations.
Overall, the team’s in-depth analysis of AIL framework aspects such as discriminator architecture, training and regularization, and choices related to agent training reveals valuable insights on how best to design and use novel AIL algorithms.
The paper What Matters for Adversarial Imitation Learning? is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.