The Vison Transformer (ViT) has become to dominate the field of computer vison. It has demonstrate superior performance and flexibility in handling various input sequence lengths. It’s strong performance has positioned it as a formidable contender to displace conventional convolutional neural network (CNN).
In a new paper Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, a Google DeepMind research team introduce an advanced version of ViT with Native Resolution ViT (NaViT). This enhanced model is designed to handle input sequences of arbitrary resolutions and aspect ratios, further broadening its potential application in diverse tasks within computer vision.
The team summarizes their main findings in this work as follows:
- Randomly sampling resolutions at training time significantly reduces training cost.
- NaViT results in high performance across a wide range of resolutions, enabling smooth cost-performance trade-off at inference time, and can be adapted with less cost to new tasks.
- Fixed batch shapes enabled by example packing lead to new research ideas, such as aspect-ratio preserving resolution-sampling, variable token dropping rates, and adaptive computation.
NaViT extends ViT with the capability to pack multiple patches from different images in a single sequence, which the researchers termed as Patch n’ Pack. To enable this capability, the team makes some modifications of the original ViT: 1) masked self attention and masked pooling to prevent examples attending to each other; 2) factorized & fractional positional embeddings that enable variable aspect ratios and readily extrapolate to unseen resolutions.
Moreover, Patch n’ Pack makes some new and effective new training techniques applicable. It enables continuous token dropping whereby the token dropping rate can be varied per-image, therefore accelerating training and inference speed. It also can be trained on mixed-resolution images by sampling from a distribution of image sizes while preserving each images’ original aspect ratio. As such, it allows higher throughput and exposure to large images, yielding substantial improvement over conventional ViTs.
In their empirical study, the team evaluated the JFT pretraining performance of NaViT compared to ViT baselines. The results show that NaViT consistently outperforms ViT in terms of performance while significantly improving training efficiency. Moreover, due to its flexibility to be applied to different resolutions at inference time, it can be cheaply adapted to new tasks.
The paper Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.