The phenomenal performance of today’s state-of-the-art image generation models has spurred research on text-conditional 3D object generation. However, unlike 2D models, which can generate outputs in a few minutes or even seconds, object generation models typically require multiple GPU hours to produce a single sample.
In the new paper Point-E: A System for Generating 3D Point Clouds from Complex Prompts, An OpenAI research team presents Point·E, a system for text-conditional synthesis of 3D point clouds. The novel approach leverages diffusion models to generate diverse and complex 3D shapes conditioned on complex text prompts in just one or two minutes on a single GPU.
The team focuses on the task of text-to-3D generation, which is crucial for the democratization of 3D content creation across real-world applications ranging from virtual reality and gaming to industrial design. Current text-to-3D generation approaches fall into two categories, each with its drawbacks: 1) Generative models can be used to efficiently produce samples but cannot scale effectively to diverse and complex text prompts; 2) Pretrained text-image models can be leveraged to process complex and diverse text prompts, but this approach is computationally expensive, and the models can easily fall into local minima which don’t correspond to meaningful or coherent 3D objects.
The team thus explores an alternative approach designed to combine the benefits of both abovementioned methods by employing both a text-to-image diffusion model trained on a large corpus of text-image pairs (which enables it to handle diverse and complex prompts), and an image-to-3D diffusion model trained on a smaller dataset of image-3D pairs. The text-to-image model first samples an input image to generate a single synthetic view, and the image-to-3D model then generates a 3D point cloud conditioned on the sampled image.
The team base their generative stack on a recently proposed generative framework (Sohl-Dickstein et al., 2015; Song & Ermon, 2020b; Ho et al., 2020) for text-conditional image generation. They use a three billion parameter GLIDE model GLIDE (Nichol et al., 2021) finetuned on rendered 3D models as their text-to-image model; and a stack of diffusion models that generate RGB point clouds as their image-to-3D model.
While prior works use 3D-specific architectures to process point clouds, the researchers employ a simple transformer-based model (Vaswani et al., 2017) to improve efficiency. In their point cloud diffusion model architecture, images are first fed into a pretrained ViT-L/14 CLIP model, and the output grid is then fed into the transformer as a token.
In their empirical study, the team compared the proposed Point·E approach to other 3D generative models on evaluation prompts from the COCO object detection, segmentation, and captioning dataset. The results confirm Point·E’s ability to produce diverse and complex 3D shapes conditioned on complex text prompts and speed up inference time by one to two orders of magnitude. The team hopes their work can inspire further research on text-to-3D synthesis.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.