Researchers from the University of Tokyo and Google Research have proposed a new metric for reinforcement learning (RL) performance and novel Behaviour-Regularized Model-ENsemble (BREMEN) algorithm designed to manage the costs and risks of new policy deployment.
Most current RL algorithms’ decision-making processes require online access to the environment to combine the experience they collect while using a policy with updates to that policy. However, in real-world applications such as health, education, dialogue agents and robotics, deploying new data-collection policies comes with various costs and risks.
One way to substantially reduce these costs and risks is to learn tasks with a smaller number of data collection policies. To this end, the researchers propose a “deployment efficiency” approach that measures the number of distinct data-collection policies used during policy learning.
The researchers explain that solely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient algorithm. The proposed BREMEN model-based offline algorithm aims to handle problems that arise in limited deployment settings and enhance deployment efficiency.
BREMEN can effectively optimize a policy offline using 10 to 20 times less data than previous approaches. It also shows impressive results in limited deployment settings — obtaining successful policies from scratch in only 5 to 10 deployments. This can not only help alleviate costs and risks in real-world applications but also reduce the amount of communication required during distributed learning to enable more communication-efficient large-scale RL.
Evaluations show that BREMEN can achieve performance competitive with state-of-the-art model-free offline RL algorithms when using a standard dataset size of 1M and can also appropriately learn from smaller datasets — which methods like BC and BRAC struggle with.
The researchers also compared BREMEN with existing methods on sample efficiency, a popular RL criterion that measures the amount of environment interactions incurred during training. The results show that the recursive application of BREMEN achieves impressive deployment efficiency while also maintaining the same or better sample efficiency.
The research team found that under deployment efficiency constraints, most prior algorithms — model-free or model-based, online or offline — fail to achieve successful learning. They hope their work can gear the research community to value deployment efficiency as an important criterion for RL algorithms.
The goal is to eventually achieve similar sample efficiency and asymptotic performance as state-of-the-art algorithms like Soft Actor-Critic, an off-policy deep RL algorithm proposed by Berkeley researchers in 2018, while maintaining a deployment efficiency well-suited for safe and practical real-world RL.
The paper Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization is on arXiv.
Journalist: Yuan Yuan | Editor: Michael Sarazen