The development of efficient multimodal models has become a hot research area in the deep learning community. Although the skills required for tackling vision or language tasks may seem disparate, a high degree of overlap exists in the semantic vector spaces of contrastively trained vision and language encoders. Visual question answering, for example, requires processing and understanding both visual and textual entailments to capture semantic meanings. Might it be possible to leverage such joint embeddings to enable models to learn shared high-level semantic representations for both text and image inputs?
In the new paper I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data, a research team from the Allen Institute for Artificial Intelligence proposes Cross Modal Transfer On Semantic Embeddings (CLOSE). The novel approach learns high-level skills from textual data alone, then uses these skills to complete vision tasks without additional visual training data.
The CLOSE pipeline first encodes inputs via a contrastive model’s image/text encoder. A fine-tuned pretrained language model then processes these vectors and any additional input text to generate the final output text.
To reduce the differences between the image and text vectors in practice, the team employs linear and structured noise adapters for text modification. Gaussian noise is also added to boost model performance.
In their empirical studies, the team used only data generated by a language model to train models on the tasks of image captioning, visual entailment, and visual question answering. The resulting models only slightly underperformed models trained directly on images, confirming the proposed approach’s high transfer capability between the two modalities.
Overall, this work validates the potential of leveraging the multimodal semantic vector space learned by contrastive models for efficient cross-modal generalization. The team believes the development of more powerful contrastive models spanning more modalities will enable CLOSE to yield models with even better generalization abilities.
The paper I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.