A team of researchers from Facebook and UC Berkeley has proposed a new paradigm for computer vision. Unlike conventional approaches that use pixel arrays and convolutions as image representations, the new paradigm leverages token-based image representation and visual transformers.
In the paper Visual Transformers: Token-based Image Representation and Processing for Computer Vision, the team demonstrates how the paradigm can overcome the limitations of existing approaches and achieve better performance with lower computational costs on vision tasks.
Visual information can be captured and processed as arrays of pixels. The computer vision field has seen significant advancements thanks to the de facto deep learning operators, convolutions, which can accept and output pixel arrays with great success. However, the limitations of such approaches have challenged researchers to improve on the current paradigm. Researchers have noticed for example that each pixel bears varying importance for the target task. Convolutions distribute computation over all pixels regardless of their relative importance, which eventually causes redundancy in both computation and representations.
The deep learning architecture Transformer has been widely used in natural language processing (NLP) and is now gaining popularity in computer vision. In this study, the researchers replace pixel arrays with language-like descriptors, or tokens. Tokenization is a prevalent task in NLP that chops up given character sequences and defined document units into pieces called tokens — an instance of a sequence of characters that are grouped together as a useful semantic unit for processing.
The team first use convolutions to extract a feature map representation from a given image, as convolutions are very effective at processing low-level visual features. Once images represented as a set of visual tokens, researchers then apply visual transformers to find relationships between visual semantic concepts.
There are three major components in each stacked visual transformer — a tokenizer, a transformer, and a projector. The tokenizer is responsible for extracting a small number of visual tokens from the feature map. The transformer then captures the interaction between the visual tokens and computes output tokens. Finally, the projector fuses the output tokens back to the feature map. The researchers explain that they fuse the output of the transformer back to the feature map to augment the pixel-level representation because many vision tasks require pixel-level details that are not preserved in the visual tokens. The visual transformers layer ensures that both visual tokens and feature maps are used as output, where the former captures high-level semantics in the images while the latter preserves the pixel-level details.
In experiments, the researchers observed that the computational cost for the visual transformer is very low, given that the new paradigm only requires computing on a small number of visual tokens. On classification tasks on ImageNet, the team used ResNest as a baseline and replaced the last stage of the network with visual transformers blocks, resulting in savings of up to 6.9x MACs (multiply-accumulate operations, one of the most time-consuming operations in a deep neural network) for the stage, and achieving up to 4.53 points higher top-1 accuracy. The proposed paradigm showed similar advantages on the semantic segmentation task on COCO-Stuff and Look-Into-Person datasets.
The paper Visual Transformers: Token-based Image Representation and Processing for Computer Vision and pseudocode for the visual transformer are on arXiv, and the model will be open-sourced after the paper publication.
Journalist: Fangyu Cai | Editor: Michael Sarazen