Large Language Models (LLMs) have ushered in an era of unparalleled language understanding capabilities, raising the possibility of harnessing their prowess for complex embodied visual tasks. This new frontier explores whether these models can be the cornerstone of adaptable, generalizable policies for decision-making that seamlessly transfer to novel scenarios.
In a new paper Large Language Models as Generalizable Policies for Embodied Tasks, an Apple research team presents Large LAnguage model Reinforcement Learning Policy (LLaRP). LLaRP effectively repurposes LLMs for Reinforcement Learning (RL) challenges within the realm of Embodied Artificial Intelligence (AI), achieving a remarkable 1.7 times higher success rate compared to other established baselines and zero-shot LLM applications.
The LLaRP approach is a pioneering effort in adapting pre-trained LLMs to navigate multi-modal decision-making settings inherent to embodied tasks. The core of the problem is cast as a Partially-Observable Markov Decision Process (POMDP), wherein the policy’s inputs encompass task instructions and egocentric visual RGB frames from the current time step. These inputs are encoded using LLM embeddings or a vision encoder. These embeddings serve as the input to a pre-trained LLM, and the hidden outputs are subsequently projected to action and value predictions. Notably, the entire system learns through online RL, with the action output module and observation encoder MLP being the only trainable components while the others remain frozen.
The research team demonstrates that using a pre-trained and frozen LLM as a Vision-Language Model (VLM) policy with learned input and output adapter layers results in a policy showcasing robust generalization capabilities. This policy is trained using online RL, and its generalization is assessed along two axes: Paraphrastic Robustness (PR) and Behavior Generalization (BG).
LLaRP undergoes rigorous evaluation across over 1,000 unseen tasks, spanning the axes of PR and BG, and achieves an impressive 42% success rate. This surpasses the performance of alternative LSTM-based policies at 25% and zero-shot LLM applications at 22%. Importantly, LLaRP outperforms all baselines when given novel instructions and when assigned previously unseen tasks. Moreover, the researchers demonstrate that the LLaRP LLM-based policy provides a significant performance boost in a distinct domain, Atari, compared to a Transformer baseline.
The research team further uncovers the benefits of infusing LLM-encoded world knowledge into RL. LLM-based models exhibit superior sample efficiency compared to other conventional architectures in both basic Proximal Policy Optimization (PPO) RL and continual learning settings. Furthermore, LLaRP proves to be more efficient in terms of required supervision when contrasted with commonly used imitation learning techniques.
In a promising initiative to facilitate further exploration of generalization in Embodied AI, the researchers introduce the Language Rearrangement task. This task involves a staggering 150,000 distinct language instructions, each equipped with automatically generated rewards, providing a valuable framework for ongoing research in the field.
In conclusion, this pioneering research paper exemplifies the transformative potential of integrating LLMs into embodied tasks. The LLaRP approach not only excels in achieving high success rates but also significantly enhances efficiency, opening up exciting possibilities for the future of Embodied AI research and development.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.