AI Machine Learning & Data Science Research

Meet LEO: An Embodied Generalist Agent Excelling in 3D World Tasks

In a new paper An Embodied Generalist Agent in 3D World, a research team introduces LEO, which stands as an embodied multi-modal and multi-task generalist agent that excels in essential capabilities such as perception, grounding, reasoning, planning, and action within the intricate 3D world.

The quest to develop a single, versatile model capable of performing diverse tasks akin to human abilities has been a longstanding pursuit in the realms of artificial intelligence and neuroscience. Recent strides in the realm of large language models (LLMs) have presented a promising avenue for creating such generalist models. Leveraging expansive datasets and scalable Transformer architectures, these models have shown immense potential.

However, a significant challenge persists: the limited capacity of these models to comprehend and engage with the three-dimensional environment that encompasses humans and other intelligent entities. This constraint acts as a bottleneck, impeding the successful execution of real-world tasks and the achievement of true general intelligence.

In a new paper An Embodied Generalist Agent in 3D World, a research team from Beijing Institute for General Artificial Intelligence (BIGAI), Peking University, Carnegie Mellon University and Tsinghua University introduce LEO, which stands as an embodied multi-modal and multi-task generalist agent that excels in essential capabilities such as perception, grounding, reasoning, planning, and action within the intricate 3D world.

The team summarizes their main contributions as follows:

  1. Introduction of LEO, the first generalist agent endowed with the capacity to perceive, ground, reason, plan, and act proficiently in the 3D environment.
  2. Demonstration that a generalist agent can be fashioned through fine-tuning the LLM with object-centric multi-modal representations and integrating training data with embodied action sequences, enabling excellence in embodied tasks.
  3. Curation of an extensive dataset and proposal of techniques to enhance the quality of prompted data from LLMs, crucial for training such an agent.
  4. Extensive evaluation of LEO, showcasing its proficiency in diverse tasks, including embodied navigation and robotic manipulation. Notably, consistent performance gains are observed with the scaling up of training data.
  5. Commitment to advancing research by releasing the data, code, and model weights for the benefit of future work on generalist agents.

LEO undergoes training in two stages, utilizing shared LLM-based model architectures, objectives, and weights: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. LEO’s perceptual abilities stem from an egocentric 2D image encoder for embodied views and an object-centric 3D point cloud encoder for a global, third-person perspective. The output tokens from the 3D encoder, representing observed entities, are interleaved with text tokens to form a scene-grounded instructional task sequence. This sequence serves as input to a decoder-only LLM, framing all tasks as sequence prediction problems. Autoregressive training objectives allow LEO to be trained with task-agnostic inputs and outputs.

The team conducts a comprehensive empirical study, quantitatively evaluating and ablating LEO on diverse 3D tasks. Tasks include object-level and scene-level captioning, 3D question answering, and robotic manipulation. Results indicate that LEO achieves state-of-the-art performance on most tasks. Task-agnostic instruction tuning, enabled by a unified model, surpasses previous task-specific models across various domains. Moreover, pretraining of 3D vision-language alignment significantly enhances the performance of VLA instruction-tuning. The study also highlights the positive impact of scaling up training data on the generalist agent’s performance.

In conclusion, LEO stands as a pioneering embodiment of a generalist agent, showcasing remarkable capabilities in navigating and interacting within the 3D world. The insights and methodologies introduced by the research team open new avenues for the development of artificial intelligence with enhanced perceptual and action-oriented competencies.

The paper An Embodied Generalist Agent in 3D World on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Meet LEO: An Embodied Generalist Agent Excelling in 3D World Tasks

  1. Wizards

    You will spend a lot of interesting time reading interesting press material. with an abundance of fascinating and fascinating data. I appreciate you sharing snake io

  2. Pingback: Introducing DeepMind’s GATO: The Multipurpose AI That Can Boost Any Business Sector - AIFUTUREGROUP - AI IN ONE PLACE

Leave a Reply

Your email address will not be published. Required fields are marked *