The concept of instantly generating a 3D representation from a single image of any object is undeniably captivating. This breakthrough promises to significantly advance applications in industrial design, animation, gaming, and the realms of Augmented Reality (AR) and Virtual Reality (VR). Besides, the remarkable achievements in natural language processing and image processing have inspired researchers to delve into the realms of learning a universal 3D foundation for reconstructing objects from single images.
In a new paper LRM: Large Reconstruction Model for Single Image to 3D, a research team from Adobe Research and Australian National University introduces an innovative Large Reconstruction Model (LRM). This groundbreaking model has the remarkable ability to predict a 3D model of an object from a single input image in a mere 5 seconds.
The LRM approach adopts a robust transformer-based encoder-decoder architecture for acquiring 3D object representations from a single image in a data-driven fashion. The model takes an image as input and regresses a Neural Radiance Field (NeRF) in the form of a triplane representation. To achieve this, LRM employs the pre-trained visual transformer DINO (Caron et al., 2021) as the image encoder to generate image features. Subsequently, it learns an image-to-triplane transformer decoder to project the 2D image features onto the 3D triplane through cross-attention, effectively modeling relationships among the spatially-structured triplane tokens via self-attention.
The output tokens from the decoder are then reshaped and upsampled to create the final triplane feature maps. This enables LRM to render images from any viewpoint by decoding the triplane feature of each point. It does so with the aid of an additional shared multi-layer perceptron (MLP) to determine color and density, facilitating volume rendering.
What sets LRM apart is its design, which boasts high scalability and efficiency. In addition to employing a fully transformer-based pipeline, the triplane NeRF it employs stands out as a concise and scalable 3D representation. Compared to other alternatives like volumes and point clouds, it is computationally efficient. Furthermore, it offers superior locality with respect to the input image.
One of the remarkable aspects of LRM is its training process, which involves minimizing the difference between rendered images and ground truth images at novel perspectives. This is done without the need for excessive 3D-aware regularization or intricate hyper-parameter tuning, making the model exceedingly efficient during training and adaptable to a wide range of multi-view image datasets.
Empirical results underscore the remarkable fidelity of LRM when handling various inputs, spanning real-world images, synthetic creations, and rendered images featuring diverse subjects with distinct textures. It stands out as a state-of-the-art solution for single-image-to-3D reconstruction when compared to One-2-3-45.
In summary, this groundbreaking work demonstrates the potential of LRM to swiftly predict a 3D model of any object from a single, arbitrary image found in the wild. This development opens up a broad array of real-world applications that can benefit from this rapid and accurate 3D reconstruction capability.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.