Toward Multi-Modal Understanding and Multi-Modal Intelligence

Recently there are three papers on ArXiv that drew the author's attention, all have focus on learning multi-model information.

Recently, there are three papers on ArXiv that drew my attention. All of them have a similar and related topic: learning multi-modal information. The aims of this article is

  1. To present the novelties of these three papers
  2. To compare their usage
  3. To foresee the next possible research direction for learning multi-modal information

1 Techniques and Novelties

The first paper (Aytar et al.) [1] presents an architecture incorporating three sub-networks: sound, text and image networks. On top of these three networks, there is another network which is shared with the three modalities (Fig 1). The authors claim that the network can do alignment and learn a higher-level representation with these three modalities. After training, transfer of “concept” between untrained modality pairs occurs as well.

Fig 1: The common shared representation layers are fully connected (blue), trained either by transfer loss or by ranking pair loss. The modality-specific representations (grays) is convolutional.

All of the three sub-networks use convolutional network to extract the multi-modal information. In the sound sub-network, the sound is encoded in a spectrogram format. The input spectrogram is a 500*257 signal (i.e. 257 channels (frequency bins) over 500 time steps). Three one dimensional convolutions with kernel size 11, 5 and 3 are used to extract the sound information. In the text sub-network, the text is first pre-processed with a pre-trained word2vec embedding layer with a dimension of 300. Then, a sentence is interpreted into a fixed length of 300*16 vector, with either filling in zeros or cropping longer sentences. Then three one-dimensional convolution layers are used to deal with the image information as well, following the standard AlexNet [2]. But the layers are only stacked until the pool5 layer.

Although three individual CNNs extract enough statistical information on the higher level, as the usual CNN techniques do, the next target is that we should construct readily-available shared across all of the three modalities. One important technique to extract the synchronous representation between these modalities is to get the representation alignment via optimization methods.

Specifically, the alignment is done as follows: assume that the target of learning is to align between one modality x_i and the second modality y_i. If we denote f(xi) to be the representation of the modality x. The metric to examine the alignment can be related to the cosine similarity. In this paper, two approaches regarding the representation alignment were proposed.

The first approach is called model transfer, which was originally used in transfer learning. It uses KL-divergence


Delta_{KL} is a KL-divergence, which is usually used to measure the distance between two probability distributions. In here, the objective of the KL-divergence is to optimize the student signal f(y), a representation of the second modality, i.e. let it approach teacher signal g(x), which is the first modality. Here the model transfer approach is used to train the upper representation of the student models for sound, vision and text to predict class probabilities, to approach the teacher signal. The authors used ImageNet model as a teacher model.

The second approach is called ranking. Different from the model transfer function which only align the upper representation (class probabilities), the ranking loss function is used here to align the hidden representation of different sub-nets and to let it be discriminative.


where Delta is a margin hyper-parameter, Psi is a similarity function (e.g. cosine similarity), and j iterates over all negative samples. Therefore, the loss function tends to push paired samples together while move apart mismatched pair. It is used to apply on the last three hidden representation of the network to pair two modalities one by one:
vision → text, text → vision, vision → sound, and sound → vision

So in the alignment, the paper examined pairs of images and sound (from videos) and pairs of images and text (from caption datasets). Since there are not enough training samples on sound/text pairs, the authors do not train those pairs. But the experiments show interesting alignment, using vision as a bridge to enable transfer between sound/text. The experiment details can be found in the original paper.

The most interesting part is in the visualization. As shown in Fig. 2 there are two interesting properties: Firstly, although the semantics are not explicitly trained on the hidden layers, they are emerged as pre-symbolic concepts in an unsupervised way. Secondly, many of the hidden units are able to detect objects appear independently in different modalities. So the alignment probability happens on the object level.

4.jpegHidden Unit Visualization

The second paper by Kaiser et al. [3] builds an unified model to do multi-modal information processing. We have witnessed many different deep-learning architectures accomplish different tasks, such as speech recognition, image classification, and translation. Compared to the previous deep multi-modal extraction tasks, which extract the common shared representation hierarchically, thanks to the computational resources from Google, this work attempts to solve a few problems simultaneously, some of which even incorporates a large number of datasets. Similar as the first paper, it tries to learn different modalities. But its main focus is to solve multiple machine learning tasks with a unique architecture. The common features representation is not its focus.

5.jpegFigure 3: The MultiModel architecture by Kaiser et al. [3].

The multi-model architecture incorporates a few basic structures:

  1. Convolution layers: allow the model to detect local patterns and generalize across space.
  2. Attention layers: allow to focus on specific elements to improve performance of the model.
  3. Sparsely-gated mixture-of-experts: gives the model capacity without excessive computation cost.

These basic structures combine together to form the overall architecture as either an encoder or decoder block as shown in Figure 3. Different color blocks represent different modality sub-nets, all of which share the same encoder-decoder architecture. We will not introduce these three building blocks in detail, because there are numerous reviews on it. The figure below depicts the architecture of each building block one by one.
Fig 4. Technical Architecture of different blocks [4]

Regarding the modality sub-nets, the four modality nets use slightly different architectures, but mainly based on word embedding (text) and convolutional network (audio and image) to extract the input features.

At the first experiment, a few benchmark tests (ImageNet and translation) were conducted which show comparable results to the state-of-the-art tests, although the the authors claim that the hyper-parameters were not really well-tuned. The second experiment compare with the joint training of 8 problems with a training of a single problem. As shown in Tab. 2, not surprisingly, the joint training performs better than the single problem. Although the reason is not analyzed in detail in this article, but I suspect it has similar reason as the first article we introduced: the higher-level concept is somehow represented between the shared representation.


Tab. 1. Comparisons with MultiModel and state-of-the-art in three tasks [4][5].
Tab. 2. Comparisons of MultiModel trained on joint 8-problem and single problem.

The third paper by Hristov et al. [6] tries to cope with the raw multi-modal inputs together with a robot manipulation task. Therefore, the motor action can be regarded as another modality. Furthermore, different from the previous two papers, the multi-modal inputs, as well as the output, are used as the grounding sources of the natural language symbols. Specifically, as shown in Fig. 5, the symbolic features are grounded with the color and size features which are presented as a normal distribution in that feature space (e).

Fig. 5. Overview of the full system pipeline of [6]. (a) Input to the system are language instructions, together with eye-tracking fixations and a top view of the camera. (b) Natural language are deterministically parsed to an abstract plan language (c) Using the abstract plan, a set of labelled image patches is produced from the multi-modal data inputs (d) Doing Feature extraction from the image patches. (e) The parsed symbol is grounded with the features.

The action, on the other hand, is grounded via semantic parsing. Therefore, while doing end-to-end training, the actions are abstracted as the format (action target location) (b). One example of parsing can be found in Fig. 6
The grounding the (action target location) via probabilistic learning based on Fig. 7 after the objects were represented in multi-modal data from the image (Algorithm 1) as well as eye tracking. Just like for the previous two paper, the detailed learning algorithm 1 is not introduced. Readers with interests can refer to [6].

6. Example of parsing using dependency graphs.

Fig 7. The probabilistic model used for symbol grounding based on image and eye-tracking data [7].

2 Comparisons and Comments

As we discussed a bit in the last session, the three papers concentrate on different perspectives of multi-modal learning. The first paper provides novel algorithms and architectures to align different modalities. The main finding is that the hierarchical network, which is somehow similar to our cognitive functions, extracts concepts, or the meaning of the objects to be specific, at a higher-level. This is also further proved by its trans-modality test (sound/text). This is the most interesting and novel part, which somehow mimics the brain cross-modality function. This work, to some extent, build a possibility that this concept can be further used as a meta-cognition.

Thanks to the massive computational power from Google, the second paper used a lot of effort to build and test a unified network model to accomplish various goals. It also reveals a few key neural structures which may be crucial in an universal AI machine. They are useful either to extract the features of different modalities, or to be adaptively select the best internal strategies to solve the problems encountered. At the end, I will not be surprised that Google uses the same architecture to finish most of its AI related services. And this architecture will be a milestone to build an AGI (Artificial General Intelligence) machine.

The third paper does not employ any fashionable deep learning algorithms. But I think it proposes another important aspect of modality – motor action, if our goal is to build a machine or robot that can move. Moreover, the grounding problem of symbols will be also necessary to connect the state-of-the-art image/sound recognition problems with the GOFAI (“Good Old-Fashioned Artificial Intelligence”). Although the grounding problem with hierarchical multi-modal networks is not new (e.g.[8]), the state-of-the-art multi-modal architecture and new algorithms (e.g. [1]) will be more helpful for us (and for the machines) to discover the meaning of the multi-modal world, and to further calculate using a symbolic representation on a higher cognitive level.


[1] Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. “See, Hear, and Read: Deep Aligned Representations.” arXiv preprint arXiv:1706.00932 (2017).

[2]Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

[3] Kaiser, Lukasz, et al. “One Model To Learn Them All.” arXiv preprint arXiv:1706.05137 (2017).

[4] Szegedy, Christian, et al. “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.” AAAI. 2017.

[5] Shazeer, Noam, et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.” arXiv preprint arXiv:1701.06538 (2017).

[6] Hristov, Yordan, et al. “Grounding Symbols in Multi-Modal Instructions.” arXiv preprint arXiv:1706.00355 (2017).

[7] Rothkopf, Constantin A., Dana H. Ballard, and Mary M. Hayhoe. “Task and context determine where you look.” Journal of vision 7.14 (2007): 16-16.

[8] Zhong, Junpei, et al. “Sensorimotor Input as a Language Generalisation Tool: A Neurorobotics Model for Generation and Generalisation of Noun-Verb Combinations with Sensorimotor Inputs.” arXiv preprint arXiv:1605.03261(2016).

Author: Joni Chung | Reviewer: Haojin Yang

About Synced

Machine Intelligence | Technology & Industry | Information & Analysis

0 comments on “Toward Multi-Modal Understanding and Multi-Modal Intelligence

Leave a Reply

Your email address will not be published.

%d bloggers like this: