AI Machine Learning & Data Science Research

3D-LLM: Integrate 3D World Into Language Models

In a new paper 3D-LLM: Injecting the 3D World into Large Language Models, a research team inject the 3D world into large language models and presents 3D-LLMs, a whole new family of models that can capture 3D spatial information to perform 3D-related tasks.

In recent years, the successes of Large Language Models (LLMs) and Vision-Language Models (VLMs) have been noteworthy. However, these powerful models have primarily operated within the confines of the 2D domain, lacking the capacity to handle real-world 3D-related tasks that demand processing of more complex concepts like spatial relationships, affordances, physics, and layout.

Addressing this shortcoming, in their new paper, 3D-LLM: Injecting the 3D World into Large Language Models, a research team from the University of California, Los Angeles; Shanghai Jiao Tong University; South China University of Technology; University of Illinois Urbana-Champaign; Massachusetts Institute of Technology; UMass Amherst; and MIT-IBM Watson AI Lab, introduces the 3D-based Large Language Models (3D-LLMs). These novel models integrate the 3D world into large language models and are designed to capture 3D spatial information, enabling them to perform 3D-related tasks effectively.

The team summarizes their main contributions as follows:

  1. We introduce a new family of 3D-based Large Language models (3D-LLMs) that can take 3D points with features and language prompts as input, and perform a variety of 3D-related tasks.
  2. We devise novel data collection pipelines that could generate large-scale 3D-language data.
  3. We use a 3D feature extractor that extracts meaningful 3D features from rendered multi-view images.
  4. We introduce a 3D localization mechanism for training the 3D-LLMs to better capture 3D spatial information.
  5. Experiments on held-out evaluation dataset, ScanQA, outperform state-of-the-art baselines.
  6. We plan to release our 3D-LLMs, the 3D-language dataset, and language-aligned 3D features of the dataset for future research development.

The team starts by addressing the main challenging of training a 3D-langauge models: the scarcity of 3D data and the difficulty to obtain meaningful 3D features for aligning language features. For the former challenge they introduce data generation pipelines to generate large-scale 3D data and language pairs; for the later challenge they construct 3D features from 2D multi-view images and utilize a 3D feature extractor to obtain 3D features from the 2D pretrained features.

Specifically, in the 3D-LLMs training procedure, the researchers first leverage three methods: direct reconstruction, feature fusion, neural field to construct 3D features from rendered image features; Next they uses 2D VLMs as backbones and input the aligned 3D features to train 3D-LLMs from scratch on the constructed 3D-language dataset; Finally they propose a 3D localization mechanism to enable 3D-LLMs better capture spatial information, which include augmenting 3D features with position embeddings and LLM vocabularies with location tokens.

In their empirical study, the team compared 3D-LLMs with baseline models, including ScanQA, ScanRefer+MCAN, VoteNet+MCAN, LLaVA, flamingo-SingleImage, flamingo-MultiView, BLIP2-flant5-SingleImage and BLIP2-flant5-MultiView. The results show that the proposed model surpasses all baseline models for most of the evaluation metrics.

The paper 3D-LLM: Injecting the 3D World into Large Language Models on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “3D-LLM: Integrate 3D World Into Language Models

  1. Good article. Also, 3D environment design https://kevurugames.com/game-art/3d-environment-design/ is the foundation of immersive gameplay, and Kevuru Games excels in this area. Their skill in creating detailed, visually stunning virtual worlds is commendable. Careful attention to design combined with advanced technology results in environments that not only mesmerize players, but enhance the overall game story. For me as a startup owner, working with Kevuru Games was undoubtedly the best choice.

  2. Henry Larry

    Integrating the 3D world into language models is a promising leap toward addressing the limitations of 2D centric models. Excited to see the practical implications of 3D LLMs in tackling complex spatial tasks.
    Merrick Fuel System Cleaner

Leave a Reply

Your email address will not be published. Required fields are marked *