Meet Transformer in Transformer: A Visual Transformer That Captures Structural Information From Images

A new paper from Huawei, ISCAS and UCAS researchers proposes a novel Transformer-iN-Transformer (TNT) network architecture that outperforms conventional vision transformers on local information preservation and modelling for visual recognition.

Transformer architectures were introduced in 2017, and their computational efficiency and scalability quickly made them the de-facto standard for natural language processing (NLP) tasks. Recently, transformers have also begun to show their potential in computer vision (CV) tasks such as image recognition, object detection, and image processing.

Most of today’s visual transformers view an input image as a sequence of image patches while ignoring intrinsic structural information among the patches — a deficiency that negatively impacts their overall visual recognition ability. The TNT model addresses this, modelling both patch-level and pixel-level representations.

While convolutional neural networks (CNN) remain dominant in CV, transformer-based models have achieved promising performance on visual tasks without an image-specific inductive bias. A pioneering work in the application of transformers to image recognition tasks is Vision Transformer (ViT), which splits an image into a sequence of patches and transforms each patch into an embedding. ViT can thus process images using a standard transformer with few modifications, but will still not take the images’ structural information into account.

Illustration of the proposed Transformer-iN-Transformer (TNT) framework. T-Block denotes transformer block.

Like the ViT approach that inspired it, TNT splits an image into a sequence of patches. The TNT difference is that each patch is reshaped to a (super) pixel sequence. Linear transformation on the patches and pixels provides both patch embeddings and pixel embeddings, which are then fed into a stack of TNT blocks for representation learning. The TNT block contains an outer transformer block that models the global relation among patch embeddings; and an inner transformer block that extracts local structure information of pixel embeddings. In this way, local information such as spatial information can be captured by linearly projecting the pixel embeddings into the patch embedding space. Finally, the class token is used for classification via a Multi-Layer Perceptron (MLP) head.

The researchers conducted extensive experiments on visual benchmarks to evaluate TNT’s modelling of both global and local structure information in images and to improve its feature representation learning performance. They choose the ImageNet ILSVRC 2012 dataset for image classification tasks, and also tested on downstream tasks with transfer learning to evaluate TNT’s generalization ability. TNT was compared to recent transformer-based models such as ViT and DeiT, as well as CNN-based models including ResNet, RegNet and EfficientNet.

Results of TNT and other networks on ImageNet

In the evaluations, TNT-S achieved 81.3 percent top-1 accuracy, 1.5 percent higher than the baseline model DeiTS. TNT outperformed all the other visual transformer models and popular CNN-based models ResNet and RegNet, but was inferior to EfficientNet. The results show that while the proposed TNT architecture can outperform visual transformer benchmarks, it falls short of current SOTA CNN-based methods.

The paper Transformer in Transformer is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Meet Transformer in Transformer: A Visual Transformer That Captures Structural Information From Images

Like this:

1 comment on “Meet Transformer in Transformer: A Visual Transformer That Captures Structural Information From Images”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “Meet Transformer in Transformer: A Visual Transformer That Captures Structural Information From Images”

Leave a Reply Cancel reply

Related