Generative image models are gaining tremendous popularity, wowing the general public with their stunning text-to-image abilities. The machine learning research community meanwhile has been exploring ways to extend these models to other modalities such as audio, video and 3D assets. How to represent 3D assets in a manner that is both efficient and suitable for downstream applications however remains an open question.
In the new paper Shap·E: Generating Conditional 3D Implicit Functions, an OpenAI research team proposes Shap·E, a conditional generative model that leverages implicit neural representations (INRs) to produce complex and diverse 3D assets. Shap·E has a faster convergence speed and achieves performance competitive with baselines while modelling a higher-dimensional, multi-representation output space.
INRs, which typically map 3D coordinates to location-specific information such as colour and density, have emerged as a flexible and expressive approach for representing 3D assets. Drawbacks to this method include a high compute cost, as INRs must be acquired for each sample in the dataset, and that each INR can have many numerical parameters, making training downstream generative models more difficult.
In this paper, the researchers aim to improve on current INR approaches for diverse and complex 3D implicit representations. Following on recent works by Chen & Wang and Dupont et al., they eschew gradient-based meta-learning methods and train a transformer-based encoder to generate INR parameters for 3D assets, then train a conditional diffusion model on outputs from the encoder to produce INRs representing both neural radiance fields (NeRFs) and meshes. This enables multiple rendering approaches and facilitates INRs’ incorporation into downstream 3D applications.
In the proposed approach, the encoder consumes point clouds and rendered views of a 3D asset as inputs and outputs the parameters of a multi-layer perceptron (MLP) to represent the asset as an implicit function. Cross-attention is used to process the point cloud and input views and generate latent representations as a sequence of vectors, then each vector in this sequence is passed through a latent bottleneck and projection layer to output the MLP parameters.
In their empirical study, the team compared Shap·E to a Point·E (an explicit generative model over point clouds) baseline. In the experiments, Shap·E outperformed Point·E, recording faster convergence speeds and achieving comparable or superior performance. Moreover, Shap·E demonstrated its ability to generate diverse 3D objects without relying on images as an intermediate representation.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.