AI Machine Learning & Data Science Research

Adobe’s DMV3D Achieves SOTA Performance for High-Fidelity 3D Objects Generation Within Seconds

A research team innovative single-stage category-agnostic diffusion model. This model can generate 3D Neural Radiance Fields (NeRFs) from either text or a single-image input condition through direct model inference, enabling the creation of diverse high-fidelity 3D objects in just 30s/asset.

The recent surge in the popularity of 3D diffusion models is transforming the landscape of 3D asset generation, particularly in applications such as Augmented Reality (AR), Virtual Reality (VR), robotics, and gaming. These models excel in simplifying the complex 3D asset creation process, significantly reducing the manual workload involved.

However, a common challenge with these models is the need for access to ground-truth 3D models or point clouds for training, which can be challenging to obtain for real images. Additionally, the latent 3D diffusion approach often results in an intricate and challenging-to-denoise latent space on highly diverse, category-free 3D datasets, posing a hurdle for achieving high-quality rendering.

In a new paper DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, a research team from Adobe Research, Stanford University, HKU, TTIC and HKUST proposes DMV3D, an innovative single-stage category-agnostic diffusion model. This model can generate 3D Neural Radiance Fields (NeRFs) from either text or a single-image input condition through direct model inference, enabling the creation of diverse high-fidelity 3D objects in just 30 seconds per asset.

The team summarizes their main contributions as follows:

  1. A pioneering single-stage diffusion framework that employs a multi-view 2D image diffusion model to achieve 3D generation.
  2. An Large Reconstruction Model (LRM)-based multi-view denoiser capable of reconstructing noise-free triplane NeRFs from noisy multi-view images.
  3. A general probabilistic approach for high-quality text-to-3D generation and single-image reconstruction that utilizes fast direct model inference (approximately 30 seconds on a single A100 GPU).

The primary objective of this research is to realize rapid, realistic, and generic 3D generation. The proposed DMV3D integrates 3D NeRF reconstruction and rendering into its denoiser, creating a 2D multi-view image diffusion model trained without direct 3D supervision. This approach avoids the need for separately training 3D NeRF encoders for latent-space diffusion and eliminates the laborious per-asset optimization process.

Essentially, DMV3D incorporates a 3D reconstruction model as the 2D multi-view denoiser within a multi-view diffusion framework. The team strategically considers a sparse set of four multi-view images surrounding an object, effectively describing a 3D object without significant self-occlusions.

Leveraging large transformer models, the researchers address the challenging task of sparse-view 3D reconstruction. Built upon the recent 3D Large Reconstruction Model (LRM), they introduce a novel model for joint reconstruction and denoising, capable of handling various noise levels in the diffusion process. This model can be seamlessly integrated as the multi-view image denoiser in a multi-view image diffusion framework.

The team trained their model on large-scale datasets comprising synthetic renderings from Objaverse and real captures from MVImgNet, using only image-space supervision. DMV3D not only demonstrates the ability for single-stage 3D generation in approximately 30 seconds on a single A100 GPU but also achieves state-of-the-art results in single-image 3D reconstruction, surpassing prior methods based on SDS (Self-Supervised Depth Sensing) and other 3D diffusion models.

In summary, this work provides a fresh perspective on addressing 3D generation tasks by bridging the realms of 2D and 3D generative models, unifying 3D reconstruction and generation. The implications extend beyond the immediate applications, opening doors for the development of foundational models to tackle a diverse array of challenges in 3D vision and graphics.

The project website is at: https: //justimyhxu.github.io/projects/dmv3d/. The paper DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Adobe’s DMV3D Achieves SOTA Performance for High-Fidelity 3D Objects Generation Within Seconds

  1. Jill Moore

    Getting these realistic images can be difficult. Another obstacle mentioned is the complex latent space and difficulty in denoising that arises from the latent 3D diffusion method, especially when dealing with highly diverse and noncataloged BADLAND datasets.

Leave a Reply

Your email address will not be published. Required fields are marked *