The current go-to architecture for computer vision tasks is vision transformers (ViTs), which separate input images into nonoverlapping patches and conduct computations on tokens from these patches. The patch size informs a speed/accuracy tradeoff (smaller patches have higher accuracy and compute cost), and models perform optimally only at the patch size they were trained on. Changing the patch size typically requires retraining the model to maintain its performance level.
A Google Research team addresses this limitation in the new paper FlexiViT: One Model for All Patch Sizes. The team leverages ViTs’ unique patch embedding strategy to develop a simple and efficient way of trading off compute and predictive performance with a single model. Their proposed FlexiViT is a flexible ViT that performs well across a wide range of patch sizes, matching or outperforming standard fixed-patch ViT performance with no additional cost.
FlexiViT is built upon the standard ViT architecture. Like ViTs, FlexiViT employs a “patchification” process that tokenizes the input image into a sequence of patches. But instead of training on a fixed patch size, the patch size is randomized during the FlexiViT training process, with the positional and embedding parameters resized adaptively to each patch size. This leads to a single set of weights that perform well across a wide range of patch sizes. FlexiViT uses an optimized resizing strategy and applies a knowledge distillation approach to further improve performance.
To evaluate the flexibility and efficiency of their proposed approach, the team applied FlexiViT models to downstream tasks such as image classification, transfer learning, panoptic and semantic segmentation, image-text retrieval, and open-world recognition.
On image classification tasks, FlexiViT achieved results comparable to conventional ViT architectures when evaluated on their training patch sizes but significantly surpassed such models when dealing with other patch sizes.
The team also shows it is possible to initialize a student FlexiViT with the weights of a well-performing ViT teacher to boost model performance. After initializing and distilling from a powerful ViT-B/8 model, the resulting FlexiViT-B reaches the teacher’s performance at small patch sizes and exhibits significant improvements at larger patch sizes. Finally, they demonstrate FlexiViT’s transfer learning ability by finetuning the model cheaply with a large patch size, then deploying it on downstream tasks with a smaller patch size, where it again achieves strong performance.
The proposed FlexiVit introduces a simple and efficient approach for significantly reducing ViT pretraining costs by training a single model for all patch sizes. The team hopes their results will encourage additional research on the creative application of ViT patchification procedures.
The paper FlexiViT: One Model for All Patch Sizes is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Meet Google’s FlexiViT: A Flexible Vision Transformer for All Patch Sizes”