Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers

A team of researchers from Facebook and UC Berkeley has proposed a new paradigm for computer vision. Unlike conventional approaches that use pixel arrays and convolutions as image representations, the new paradigm leverages token-based image representation and visual transformers.

In the paper Visual Transformers: Token-based Image Representation and Processing for Computer Vision, the team demonstrates how the paradigm can overcome the limitations of existing approaches and achieve better performance with lower computational costs on vision tasks.

Visual information can be captured and processed as arrays of pixels. The computer vision field has seen significant advancements thanks to the de facto deep learning operators, convolutions, which can accept and output pixel arrays with great success. However, the limitations of such approaches have challenged researchers to improve on the current paradigm. Researchers have noticed for example that each pixel bears varying importance for the target task. Convolutions distribute computation over all pixels regardless of their relative importance, which eventually causes redundancy in both computation and representations.

The deep learning architecture Transformer has been widely used in natural language processing (NLP) and is now gaining popularity in computer vision. In this study, the researchers replace pixel arrays with language-like descriptors, or tokens. Tokenization is a prevalent task in NLP that chops up given character sequences and defined document units into pieces called tokens — an instance of a sequence of characters that are grouped together as a useful semantic unit for processing.

The team first use convolutions to extract a feature map representation from a given image, as convolutions are very effective at processing low-level visual features. Once images represented as a set of visual tokens, researchers then apply visual transformers to find relationships between visual semantic concepts.

There are three major components in each stacked visual transformer — a tokenizer, a transformer, and a projector. The tokenizer is responsible for extracting a small number of visual tokens from the feature map. The transformer then captures the interaction between the visual tokens and computes output tokens. Finally, the projector fuses the output tokens back to the feature map. The researchers explain that they fuse the output of the transformer back to the feature map to augment the pixel-level representation because many vision tasks require pixel-level details that are not preserved in the visual tokens. The visual transformers layer ensures that both visual tokens and feature maps are used as output, where the former captures high-level semantics in the images while the latter preserves the pixel-level details.

In experiments, the researchers observed that the computational cost for the visual transformer is very low, given that the new paradigm only requires computing on a small number of visual tokens. On classification tasks on ImageNet, the team used ResNest as a baseline and replaced the last stage of the network with visual transformers blocks, resulting in savings of up to 6.9x MACs (multiply-accumulate operations, one of the most time-consuming operations in a deep neural network) for the stage, and achieving up to 4.53 points higher top-1 accuracy. The proposed paradigm showed similar advantages on the semantic segmentation task on COCO-Stuff and Look-Into-Person datasets.

The paper Visual Transformers: Token-based Image Representation and Processing for Computer Vision and pseudocode for the visual transformer are on arXiv, and the model will be open-sourced after the paper publication.

Journalist: Fangyu Cai | Editor: Michael Sarazen

1 comment on “Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers”

karenbeiton

2024-11-05

Our club commitment to providing a high-caliber experience with top-tier DJs, renowned performers, and first-class hospitality cements its reputation as a leading nightlife and entertainment spot in Las Vegas. With a staff that is attentive and professional, our club emphasizes guest satisfaction and safety. It features multiple bars, lounging areas, and the our club, which adds a unique daytime party vibe to its offerings. Visit us.

Loading...

Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers

Like this:

1 comment on “Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers”

Leave a Reply Cancel reply

Related