AI Machine Learning & Data Science Research

Can ViT Layers Express Convolutions? Peking U, UCLA & Microsoft Researchers Say ‘Yes’

In the new paper Can Vision Transformers Perform Convolution?, a research team from Peking University, UCLA and Microsoft Research proves that a single ViT layer with image patches as the input can perform any convolution operation constructively, and show that ViT performance in low data regimes can be significantly improved using their proposed ViT training pipeline.

Studies have shown that the new vision transformer (ViT) architectures can outperform traditional convolutional neural networks (CNNs) on a variety of challenging vision tasks — mainly because convolution can only capture local features, while transformer layers’ unique self-attention mechanism enables context-dependent weighting. This has led researchers to consider whether transformer layers are strictly more powerful than convolution, and whether it is possible for ViT self-attention layers to express convolution operations.

In the new paper Can Vision Transformers Perform Convolution?, a research team from Peking University, UCLA and Microsoft Research proves that a single ViT layer with image patches as the input can perform any convolution operation constructively, and shows that ViT performance in low data regimes — where they still trail CNNs — can be significantly improved using the team’s proposed ViT training pipeline.

The team summarises their study’s contributions as:

  1. We provide a constructive proof to show that a 9-head self-attention layer in Vision Transformers with image patch as the input can perform any convolution operation, where the key insight is to leverage the multi-head attention mechanism and relative positional encoding to aggregate features for computing convolution.
  2. We prove lower bounds on the number of heads for self-attention layers to express convolution operation, for both the patch input and the pixel input setting. This result shows that the construction in the above-mentioned constructive proof is optimal in terms of the number of heads. Specifically, we show that 9 heads are both necessary and sufficient for a self-attention layer with patch input to express convolution with a K × K kernel, while a self-attention layer with pixel input must need K2 heads to do so. Therefore, Vision Transformers with patch input are more head-efficient than pixel input when expressing convolution.
  3. We propose a two-phase training pipeline for Vision Transformers. The key component in this pipeline is to initialize ViT from a well-trained CNN using the construction in our theoretical proof. We empirically show that with the proposed training pipeline that explicitly injects the convolutional bias, ViT can achieve much better performance compared with models trained with random initialization in low data regimes.

A ViT takes sequences of image patches as input, which are processed by transformer layers comprising a multi-head self-attention (MHSA) sub-layer and a feed-forward network (FFN) sub-layer. In this work, the researchers focus on whether an MHSA layer can express a convolutional layer.

The study provides a number of theoretical insights, showing that a ViT MHSA layer can express a convolutional layer in the patch-input setting, and proving lower bounds on the number of heads for self-attention layers to express the convolution operation for both patch and pixel input settings. The researchers conclude that MHSA layers with patch inputs are more head-efficient in expressing convolutions.

Inspired by these theoretical findings, the team proposes a two-phase training pipeline for ViTs in low data regimes. In the first, convolution phase, the team trains a “convolutional” ViT variant where the MHSA layer is replaced by a convolutional layer. In the second, self-attention phase, the pretrained model’s weights are transferred to a transformer model and training continues on the same dataset.

The team conducted experiments with 6-layer ViTs on the low-data regime and trained their model on CIFAR-100. The proposed training schema achieved Top-1 accuracy of 78.74% and Top-5 accuracy of 94.40% with a comparable training time of 0.98×.

Overall, the study proves that ViTs with image patches as input can perform any convolution operation, and demonstrates that the proposed training schema can improve ViT test accuracy, training efficiency and optimization stability in low data regimes.

The paper Can Vision Transformers Perform Convolution? is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Can ViT Layers Express Convolutions? Peking U, UCLA & Microsoft Researchers Say ‘Yes’

  1. Pingback: r/artificial - [R] Can ViT Layers Express Convolutions? Peking U, UCLA & Microsoft Researchers Say ‘Yes’ - Cyber Bharat

  2. Do you think this 2 phase approach will help in imagenet training?

Leave a Reply

Your email address will not be published.

%d bloggers like this: