In the drive to develop human-level AI systems, a common research goal is improving model generalization. In the visual domain, human perception involves processing modalities such as images, videos, depth, etc. Most computer vision models however approach different visual modalities in isolation — and while these models have achieved impressive performance on their defined tasks, they lack the flexibility and generalization ability of the human vision system.
To address this issue, a Meta AI research team has proposed OMNIVORE, a single vision model that can operate across various visual modalities and perform cross-modal generalization. In evaluations, OMNIVORE achieves performance at par or better than traditional modality-specific models of the same size.
The Meta AI researchers set out to train a universal model that could operate on three major visual modalities: images, videos, and single view 3D (depth). To do this, they leveraged the powerful self-attention mechanism of transformer architectures, which can capably handle variable-sized inputs.
The proposed OMNIVORE approach first converts all visual modalities into a common format by representing them via embeddings, then employs a series of spatio-temporal attention operations to construct a unified representation of these different visual modalities. Although OMNIVORE can use any vision transformer (ViT) architecture to process the patch embeddings, the researchers selected Swin Transformer as their base model due to its impressive performance on image and video tasks and the suitability of its self-attention mechanism for spatio-temporal modelling across the patch embeddings.
The team trained OMNIVORE to minimize cross-entropy loss on the training datasets with minibatch stochastic gradient descent (SGD) via two related strategies: constructing mini-batches from each dataset (modality) separately, and constructing mini-batches that mix samples from all datasets.
In their empirical evaluations, the team compared OMNIVORE to its modality-specific counterparts and state-of-the-art models across a wide range of vision tasks, such as fine-grained object recognition, action recognition and single-view 3D scene classification and segmentation.
OMNIVORE demonstrated performance competitive with traditional modality-specific models, achieving 86.0 percent top-1 results on ImageNet, 84.1 percent on Kinetics, and 67.1 percent on SUN RGB-D; and after fine-tuning, surpassed prior work on a variety of vision tasks. A key advantage of the OMNIVORE model is that its representations generalize well across visual modalities even though it was neither trained with corresponding data across modalities nor with any cross-modal consistency losses. The team hopes their study can motivate future research on jointly modelling visual modalities.
The paper Omnivore: A Single Model for Many Visual Modalities is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.