Reinforcement Learning (RL) has provided some of the greatest AI triumphs over the last decade, delivering huge performance leaps in areas such as language modelling, image understanding, speech recognition, and most dramatically in game playing — where RL agents have dispatched human pros in Go, Dota2 and StarCraft. All these RL successes depend heavily on hyperparameter choice, and one of the most effective ways to optimize hyperparameter choice is to evaluate policies online as the agent is interacting with its environment.
It can however be prohibitively costly, risky or time-consuming to have an arbitrary policy interact with real-world environments, and thus RL agents are often constrained to learn from a fixed batch of previously collected data, with no opportunity for further data collection. In these offline reinforcement learning (ORL) scenarios, hyperparameter selection is an especially important challenge.
Researchers from DeepMind and Google recently conducted a thorough empirical study of hyperparameter selection for offline RL, aiming to identify and develop more reliable and effective approaches.
The researchers’ workflow for applying offline hyperparameter selection can be summarized as follows:
- Use several different hyperparameter settings for offline RL policies training.
- Summarize each policy’s performance by scalar statistics to execute in the real environment.
- Pick the top k best policies according to the summary statistics.
The team used simple and scalable evaluation metrics to assess the different policies:
- Spearman’s rank correlation: The Pearson correlation between the two sets of rank values.
- Regret @ k: The k policies with the highest summary statistic values.
- Absolute error: The absolute value of the difference between the statistic and the actual values.
For tasks, they chose challenging domains that require high-dimensional action spaces, high dimensional observation spaces and long time horizons.
- DM Control Suite: A set of continuous control tasks implemented in MuJoCo, which involve low dimensional action spaces from features of the MDP state.
- Manipulation tasks: Tasks requiring continuous control of a Kinova Jaco robotic arm with 9 degrees of freedom.
- DM Locomotion: A set of continuous control tasks which involve controlling a 56 degrees of freedom humanoid avatar, resulting in large action spaces.
The offline RL Algorithms include:
- Behaviour Cloning: The policy objective attempts to match the actions from the behaviour data.
- Critic Regularized Regression: The policy objective attempts to match the actions from the behaviour data, while also preferring actions with high value estimates.
- Distributed Distributional Deep Deterministic Policy Gradient: The policy objective directly optimizes the critic.
For offline policy evaluation algorithms, they chose Fitted Q Evaluation due to its simplicity and scalability.
The team’s experiments demonstrated that ORL algorithms are not robust to hyperparameter choices, but hyperparameter selection quality can be improved by carefully considering the choice of offline RL algorithms, Q estimators, and statistics across tasks. This is shown to be true even for the challenging task of DM Locomotion, which requires control of a 56 degrees of freedom humanoid avatar from visuals provided by an egocentric camera.
The paper Hyperparameter Selection for Offline Reinforcement Learning is on arXiv.
Analyst: Hecate He | Editor: Michael Sarazen; Yuan Yuan
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.