AI Computer Vision & Graphics Machine Learning & Data Science Research

Google & Rutgers’ Aggregating Nested Transformers Yield Better Accuracy, Data Efficiency and Convergence

A research team from Google Cloud AI, Google Research and Rutgers University simplifies vision transformers’ complex design, proposing nested transformers (NesT) that simply stack basic transformer layers to process non-overlapping image blocks individually. The approach achieves superior ImageNet classification accuracy and improves model training efficiency.

The strong performance of vision transformers (ViTs) has attracted increasing research attention in recent years. ViTs’ impressive achievements however rely on sophisticated designs and massive datasets, which results in high computation costs and low model training efficiency.

In the paper Aggregating Nested Transformers, a research team from Google Cloud AI, Google Research and Rutgers University proposes simplifying ViTs’ complex design by incorporating nested transformers (NesT) that simply stack basic transformer layers to process non-overlapping image blocks individually. The novel approach achieves superior ImageNet classification accuracy and improves training efficiency.

image.png

Well-designed ViTs can outperform state-of-the-art convolutional neural networks on computer vision tasks when hundreds of millions of labelled training data are available. This is because such massive data inputs provide sufficient inductive biases such as locality and translation equivariance to train strong ViT models.

Previous studies have shown that ViTs capture locality behaviours by having the bottom layers attend locally to the surrounding pixels and the top layers deal with long-range dependencies. However, because the range of global self-attention on pixel pairs in high-resolution images tends to be very high, this results in a heavy computation burden.

Recent attempts to address this issue have proposed replacing holistic global self-attention with methods such as local self-attention and using hierarchical transformer structures to perform attention in local image patches. A downside to these approaches is that they require specialized and complex designs to promote information communication across patches and are difficult to implement.

The researchers summarize their proposed NesT approach as:

  1. Demonstrate integrating hierarchically nested transformers with the proposed block aggregation function can outperform previous sophisticated (local) self-attention methods, leading to substantially simplified architecture and improved data efficiency.
  2. NesT achieves superior ImageNet classification accuracy.
  3. With proper block de-aggregation, NesT can also be repurposed into a strong decoder that achieves better performance than convnets with comparable speed. This is demonstrated by 64 × 64 ImageNet generation, a critical milestone towards adopting transformers for efficient generative modelling.
  4. A novel method for interpreting the NesT reasoning process by traversing its tree-like structure provides a unique type of visual interpretability that can explain how aggregated local transformers selectively process local visual cues from semantic image patches.
image.png
image.png

NesT is designed to conduct local self-attention on every image block independently and then nest these blocks hierarchically. A coupling of processed information between spatially adjacent blocks can be done by the proposed block aggregation between every two hierarchies. The team notes that NesT only communicates and mixes global information during the block aggregation step, via simple spatial operations. These design features enable NesT to leverage local attention to improve data efficiency.

The researchers conducted experiments on CIFAR datasets to compare the data efficiency of their proposed NesT against baseline ViT models ResNet, EffNet, DeiT and Swin.

image.png

The results demonstrate that training a NesT with 38M and 68M parameters attains 83.3 and 83.8 percent ImageNet accuracy on 224 × 224 image sizes, outperforming previous methods with an up to 57 percent parameter reduction. The favourable data efficiency of NesT is embodied by its fast convergence: 75.9 percent on 30 epochs to 82.3 percent on 100 total epochs training. Finally, the study shows that training a NesT with 6M parameters using a single GPU results in 96 percent accuracy on CIFAR10, setting a new state-of-the-art for vision transformers.

The research shows that simply aggregating nested transformers can lead to better accuracy, data efficiency, and convergence of vision transformers.

The paper Aggregating Nested Transformers is on arXiv.


Author: Hecate He | Editor: Michael Sarazen, Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Google & Rutgers’ Aggregating Nested Transformers Yield Better Accuracy, Data Efficiency and Convergence

  1. Pingback: r/artificial - [R] Google & Rutgers’ Aggregating Nested Transformers Yield Better Accuracy, Data Efficiency and Convergence - Cyber Bharat

Leave a Reply

Your email address will not be published.

%d bloggers like this: