First proposed in 2020, vision transformers (ViT) have demonstrated promising performance across a variety of computer vision tasks. These breakthroughs however have come at the cost of speed, as ViTs run much slower than convolutional neural networks (CNNs). This latency issue and their extremely high computational costs have made it challenging to deploy ViTs on resource-constrained hardware such as mobile devices, limiting their real-world application.
A research team from Snap Inc. and Northeastern University addresses this issue in the new paper EfficientFormer: Vision Transformers at MobileNet, which identifies inefficient operators in ViT architectures and proposes a new ViT design paradigm. The team’s resulting EfficientFormer models run as fast as lightweight MobileNet CNNs while maintaining the high performance of transformer architectures.
The researchers summarize their study’s main contributions as:
- We revisit the design principles of ViT and its variants through latency analysis. We utilize iPhone 12 as the testbed and publicly available CoreML as the compiler, since the mobile device is widely used and the results can be easily reproduced.
- Based on our analysis, we identify inefficient designs and operators in ViT and propose a new dimension-consistent design paradigm for vision transformers.
- Starting from a supernet with the new design paradigm, we propose a simple yet effective latency-driven slimming method to obtain a new family of models, namely, EfficientFormers. We directly optimize for inference speed instead of MACs or the number of parameters.
The proposed EfficientFormer comprises patch embedding and a stack of meta transformer blocks, where each block contains an unspecified token mixer followed by a multilayer perceptron block. The network has four stages, each serving as an embedding operation that maps the embedding dimensions and downsamples token length. EfficientFormer thus remains a fully transformer-based model that does not use MobileNet structures. The team also introduces a simple yet effective gradient-based search algorithm that obtains candidate networks to optimize EfficientFormer’s inference speed.
In their empirical study, the team compared EfficientFormer with widely used CNN-based models and existing ViTs on image classification, object detection, and segmentation tasks. EfficientFormer outperformed existing transformer models and most competitive CNNs in the experiments, with the fastest variant, EfficientFormer-L1, achieving 79.2 percent top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on an iPhone 12; and the largest variant, EfficientFormer-L7, reaching 83.3 percent accuracy with only 7.0 ms latency.
The study shows that ViTs can reach MobileNet speeds on mobile devices while maintaining transformers’ high performance. The team’s future research will explore EfficientFormer’s potential on other resource-constrained hardware.
The EfficientFormer code and models are available on the project’s GitHub. The paper EfficientFormer: Vision Transformers at MobileNet Speed is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.