DeepMind Proposes Novel Vision Transformer for Arbitrary Size & Resolution

The Vison Transformer (ViT) has become to dominate the field of computer vison. It has demonstrate superior performance and flexibility in handling various input sequence lengths. It’s strong performance has positioned it as a formidable contender to displace conventional convolutional neural network (CNN).

In a new paper Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, a Google DeepMind research team introduce an advanced version of ViT with Native Resolution ViT (NaViT). This enhanced model is designed to handle input sequences of arbitrary resolutions and aspect ratios, further broadening its potential application in diverse tasks within computer vision.

The team summarizes their main findings in this work as follows:

Randomly sampling resolutions at training time significantly reduces training cost.
NaViT results in high performance across a wide range of resolutions, enabling smooth cost-performance trade-off at inference time, and can be adapted with less cost to new tasks.
Fixed batch shapes enabled by example packing lead to new research ideas, such as aspect-ratio preserving resolution-sampling, variable token dropping rates, and adaptive computation.

NaViT extends ViT with the capability to pack multiple patches from different images in a single sequence, which the researchers termed as Patch n’ Pack. To enable this capability, the team makes some modifications of the original ViT: 1) masked self attention and masked pooling to prevent examples attending to each other; 2) factorized & fractional positional embeddings that enable variable aspect ratios and readily extrapolate to unseen resolutions.

Moreover, Patch n’ Pack makes some new and effective new training techniques applicable. It enables continuous token dropping whereby the token dropping rate can be varied per-image, therefore accelerating training and inference speed. It also can be trained on mixed-resolution images by sampling from a distribution of image sizes while preserving each images’ original aspect ratio. As such, it allows higher throughput and exposure to large images, yielding substantial improvement over conventional ViTs.

In their empirical study, the team evaluated the JFT pretraining performance of NaViT compared to ViT baselines. The results show that NaViT consistently outperforms ViT in terms of performance while significantly improving training efficiency. Moreover, due to its flexibility to be applied to different resolutions at inference time, it can be cheaply adapted to new tasks.

The paper Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

10 comments on “DeepMind Proposes Novel Vision Transformer for Arbitrary Size & Resolution”

Bitlife

2023-07-17

Are you prepared to play the game BitLife – Life Simulator and experience an interesting virtual life?

Loading...

Reply
fnaf

2023-08-28

It will be interesting to see how these developments impact the broader landscape of computer vision research and applications. As the capabilities of models like NaViT continue to expand, we might witness the reshaping of various industries that heavily rely on visual data analysis.

Loading...

Reply
Moose Jackets

2023-09-19

Thank you for providing such great information. I really like that, I have some suggestions you might like Moose Jackets

Loading...

Reply
gacha life

2023-10-02

I was delighted to see your article on your blog, which is superior to the previous one and demonstrates significant growth; I am very satisfied. I anticipate that your articles will continue to improve.

Loading...

Reply
car games

2023-11-06

Discover the key training and development trends of 2023 that are shaping the future of workforce skills.

Loading...

Reply
retro bowl college

2023-12-22

If you like sport game, Retro Bowl College is the perfect game for you.

Loading...

Reply
Pingback: Vision Transformers are Overrated | Frank’s Ramblings - The TechLead -
Pingback: Vision Transformers Are Overrated - CIBERSEGURANÇA
kinitopet

2024-08-01

The game invites players to capture and train a variety of magical pets. Each pet in kinitopet has distinct skills that can be used in battles against other players and challenges. The main objective is to build a strong team and use their abilities strategically. Players explore different regions, complete quests, and participate in competitive battles to climb the leaderboards.

Loading...

Reply
Films Jackets

2025-10-24

In modern days it’s like a staple item for one’s wardrobe and if you don’t have one well don’t worry about it you can always get one either online or visiting any fashion store or department store because its available everywhere and also in a wide variety.

Loading...

Reply