In recent years, sparsely-gated Mixture-of-Experts models (sparse MoEs) have garnered substantial attention and acclaim for their remarkable ability to decouple model size from inference efficiency. This enables unprecedented scalability, leading to significant successes across various domains, including natural language processing, computer vision, and speech recognition.
Sparse MoEs offer the tantalizing prospect of augmenting model capabilities while simultaneously mitigating computational costs. This makes them an enticing option for integration with Transformers, the prevailing architectural choice for large-scale visual modeling, albeit constrained by their resource-intensive nature.
In pursuit of this endeavor, an Apple research team has introduced the concept of sparse Mobile Vision MoEs (V-MoEs) in their paper titled “Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts.” These V-MoEs represent a streamlined and mobile-friendly Mixture-of-Experts architecture that efficiently downscales Vision Transformers (ViTs) while preserving impressive model performance.
The team summarizes their main contributions as follows:
- We propose a simplified, mobile-friendly sparse MoE design in which a single router assigns entire images (rather than image patches) to the experts.
- We develop a simple yet robust training procedure in which expert imbalance is avoided by leveraging semantic super-classes to guide the router training.
- We empirically show that our proposed sparse MoE approach allows us to scale-down ViT models by improving their performance vs. efficiency trade-off.
The core innovation of the proposed sparse Mobile V-MoE lies in its utilization of a single per-image router, as opposed to per-patch routing. The conventional per-patch routing typically necessitates the activation of a larger number of experts for each image. In contrast, the per-image router processes entire images as inputs, thereby reducing the number of activated experts per image. And the architecture comprises ViT layers followed by MoE-ViT layers. It’s important to note that, unlike ViT layers, the MoE-ViT layers feature a distinct Multi-Layer Perceptron (MLP) for each expert, while the remaining portions of the layers are shared across all experts.
During the training phase, the researchers adopted a novel approach. They initially trained a dense baseline model and subsequently computed the model’s confusion matrix using a held-out validation set from the training dataset. This confusion matrix served as the foundation for creating a confusion graph, which was further subjected to a graph clustering algorithm. The outcome of this process was a super-class division. This strategy holds promise for enhancing the performance of highly perplexing classes, as different MoE experts specialize in distinct semantic data clusters.
The research team applied this training approach in their experiments on ImageNet-1k. The results demonstrated that MoE-ViT offers an improved trade-off between performance and efficiency when compared to dense ViT. This underscores its potential in resource-constrained applications.
In summary, sparse Mobile Vision MoEs represent a groundbreaking advancement in the realm of Vision Transformers, promising to revolutionize the field by enabling efficient scalability without sacrificing performance. The innovative training strategy and promising results on ImageNet-1k highlight the considerable potential of MoE-ViT in resource-constrained scenarios.
The paper Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.