While humans naturally combine various sensory inputs to form a more complete understanding of their environments, even today’s most powerful pretrained AI models can only handle one or two input modalities. In the push to endow their models with more humanlike intelligence, machine learning researchers are increasingly interested in the development of effective multimodal learning techniques.
In the new paper i-Code: An Integrative and Composable Multimodal Learning Framework, a Microsoft Azure Cognitive Services Research team proposes i-Code, a self-supervised pretraining framework that enables the flexible integration of vision, speech, and language modalities. I-Code (the “i” stands for “for integrative multimodal learning”) leverages novel attention mechanisms and loss functions to combine information and learn vector representations from these disparate modalities in a unified manner and outperforms state-of-the-art techniques on video understanding tasks and the GLUE NLP benchmark.
The team introduces two methods to alleviate the heavy training data requirements in a unified vision, language, and speech/audio modality pretraining procedure. They leverage large-scale dual-modality data as supplementary data sources for three-modality video tasks; and, instead of training their model from scratch, propose a fusing architecture that uses contextual outputs from existing state-of-the-art single-modality encoders as a building block.
To build a fusing module that effectively incorporates the outputs of the single-modality encoders and performs cross-modality understanding for final prediction, the researchers pretrain iCode on dual or triple-modality data via a variety of self-supervision objectives. Masked unit modelling is used to convert all input signals to discrete tokens that will later be used to predict the correct token of the masked units for each modality; and contrastive learning is used to predict whether the given signals come from the same triple (or pair) in the training data.
I-Code consists of four modules: the first three are single-modality encoders for vision, language, and speech, respectively; and the last is a modality fusion network used to feed encoded inputs from each modality into a linear projection layer. This setup enables i-Code to process various input types and combinations, including a single modality, any combination of two modalities, or all three modalities.
In their empirical study, the team compared i-Code to various baselines (e.g. MISA, MulT and CLIP ) on downstream tasks such as multimodal sentiment and emotion analysis, multimodal inference, and video question answering.
In the evaluations, i-Code set a new state-of-the-art on five video understanding tasks and the GLUE NLP benchmark, bettering multimodal model performance by 11 percent and validating the potential of the proposed multimodal pretraining scheme.
The paper i-Code: An Integrative and Composable Multimodal Learning Framework is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.