Introduced in the award-winning 2020 ECCV paperNeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Neural Radiance Fields (NeRFs) leverage a fully connected (non-convolutional) deep neural network to synthesize novel views of complex 3D scenes based on partial 2D views. Although NeRFs have achieved state-of-the-art performance in generating photo-realistic, high-resolution and view-consistent scenes, their wider deployment remains restricted by extremely high computational demands and poor generalization ability.
In the new paper Is Attention All NeRF Needs?, a research team from the Indian Institute of Technology Madras and the University of Texas at Austin proposes Generalizable NeRF Transformer (GNT), a pure and universal transformer-based architecture for efficient on-the-fly NeRF reconstruction from source views. The work demonstrates that a pure attention mechanism can suffice for learning a physically-grounded rendering process.

The team summarizes their main contributions as follows:
- We propose a purely transformer-based NeRF architecture, dubbed GNT, that achieves more expressive and universal scene representation and rendering by unifying coordinate networks and volumetric renderer into a two-stage transformer capable of generalizing to unseen scenes when trained across instances.
- For a more expressive volumetric representation, GNT employs the view transformer to aggregate multi-view image features complying with epipolar geometry to infer coordinate-aligned features. To learn a more universal ray-based rendering, GNT utilizes the ray transformer to predict the ray colour. Together these two parts assemble a transformer that renders novel views by completely relying on the attention mechanism, and inherently learns to be depth- and occlusion-aware.
- We empirically demonstrate that GNT significantly improves the PSNR of existing NeRFs (single scene) by up to ~1.3 dB in complex scenes. In the cross-scene generalization scenario, GNT achieves state-of-the-art perceptual metric scores by outperforming other baselines by up to ~20% ↓ LPIPS and ~12% ↑ SSIM.
NeRF optimization has typically been treated as an inverse imaging problem that overfits a neural network to meet the expectations of the observed views — but this training strategy leads to a huge computation burden. More recent works have proposed the coordinate-based network is unnecessary and reconsidered NeRF optimization as a cross-view image-based interpolation problem that synthesizes a generalizable 3D representation based on the seen views. But the resulting radiance fields along with volume rendering are not a universal imaging model, and this negatively impacts NeRFs’ generalization capability.


The proposed GNT takes a new approach, considering the transferable novel view synthesis task as a two-stage information aggregation process. It employs 1) a view transformer to leverage multi-view geometry as an inductive bias for attention-based scene representation and predict the coordinate-aligned features based on the aggregated information from epipolar lines on the neighbouring views, and 2) a ray transformer to render novel views by ray marching and decoding the sequence of sampled point features via an attention mechanism. The novel view rendering process thus relies only on the attention mechanism, making it less time-consuming and its learning depth- and occlusion-aware.



In their empirical evaluations, the team compared GNT with state-of-the-art novel view syntheses baselines such as NeRF, MipNeRF, and NLF. The results show that GNT significantly improves the PSNR (peak signal-to-noise ratio) by up to ~1.3 dB in complex scenes and consistently achieves top results in cross-scene generalization scenarios, outperforming the baselines while demonstrating impressive generalization capabilities.
Overall, the study confirms that a pure attention mechanism is sufficient for learning a physically-grounded rendering process and advances transformers as a potential “universal modelling tool” suitable even for graphical rendering.
The GNT code is available on the project’s GitHub. The paper Is Attention All NeRF Needs? is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: IITM & UT Austin’s Generalizable NeRF Transformer Demonstrates Transformers’ Capabilities for Graphical Rendering • eSOFTNEWS
The results show the system outperforms
state-of-the-art methods in single-image, multi-frame, face-specific, and video-based settings, and can generate more pleasing.