In recent years, the successes of Large Language Models (LLMs) and Vision-Language Models (VLMs) have been noteworthy. However, these powerful models have primarily operated within the confines of the 2D domain, lacking the capacity to handle real-world 3D-related tasks that demand processing of more complex concepts like spatial relationships, affordances, physics, and layout.
Addressing this shortcoming, in their new paper, 3D-LLM: Injecting the 3D World into Large Language Models, a research team from the University of California, Los Angeles; Shanghai Jiao Tong University; South China University of Technology; University of Illinois Urbana-Champaign; Massachusetts Institute of Technology; UMass Amherst; and MIT-IBM Watson AI Lab, introduces the 3D-based Large Language Models (3D-LLMs). These novel models integrate the 3D world into large language models and are designed to capture 3D spatial information, enabling them to perform 3D-related tasks effectively.
The team summarizes their main contributions as follows:
- We introduce a new family of 3D-based Large Language models (3D-LLMs) that can take 3D points with features and language prompts as input, and perform a variety of 3D-related tasks.
- We devise novel data collection pipelines that could generate large-scale 3D-language data.
- We use a 3D feature extractor that extracts meaningful 3D features from rendered multi-view images.
- We introduce a 3D localization mechanism for training the 3D-LLMs to better capture 3D spatial information.
- Experiments on held-out evaluation dataset, ScanQA, outperform state-of-the-art baselines.
- We plan to release our 3D-LLMs, the 3D-language dataset, and language-aligned 3D features of the dataset for future research development.
The team starts by addressing the main challenging of training a 3D-langauge models: the scarcity of 3D data and the difficulty to obtain meaningful 3D features for aligning language features. For the former challenge they introduce data generation pipelines to generate large-scale 3D data and language pairs; for the later challenge they construct 3D features from 2D multi-view images and utilize a 3D feature extractor to obtain 3D features from the 2D pretrained features.
Specifically, in the 3D-LLMs training procedure, the researchers first leverage three methods: direct reconstruction, feature fusion, neural field to construct 3D features from rendered image features; Next they uses 2D VLMs as backbones and input the aligned 3D features to train 3D-LLMs from scratch on the constructed 3D-language dataset; Finally they propose a 3D localization mechanism to enable 3D-LLMs better capture spatial information, which include augmenting 3D features with position embeddings and LLM vocabularies with location tokens.
In their empirical study, the team compared 3D-LLMs with baseline models, including ScanQA, ScanRefer+MCAN, VoteNet+MCAN, LLaVA, flamingo-SingleImage, flamingo-MultiView, BLIP2-flant5-SingleImage and BLIP2-flant5-MultiView. The results show that the proposed model surpasses all baseline models for most of the evaluation metrics.
The paper 3D-LLM: Injecting the 3D World into Large Language Models on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.