Transformer architectures have achieved impressive performance in vision-language (VL) representation learning when trained on text-annotated images or videos. It remains challenging, however, for transformers to learn VL representations without relying on text, i.e. using only low-level visual and acoustic inputs.
In the new paper TVLT: Textless Vision-Language Transformer, researchers from UNC Chapel Hill present the Textless Vision-Language Transformer (TVLT) for vision-and-language representation learning. TVLT uses only raw visual and audio inputs and performs comparably to its text-based counterparts but requires only 1/3 the parameters and achieves 28x faster inference speeds.
The TVLT’s main architecture is a transformer comprising a 12-layer encoder and an 8-layer decoder. It takes its inputs as a list of embeddings obtained directly from perception-level video and audio and does not include any text-specific modules for automatic speech recognition (ASR) or tokenization.
The input embeddings are a combination of 1) modality embedding, 2) temporal/spatial embeddings for video, 3) temporal/frequency embeddings for audio, and 4) vision/audio patch embeddings.
The TVLT is pretrained with two objectives: vision-audio matching (VAM) and masked autoencoding (MAE). VAM is employed to learn the global cross-modal representations, and a linear layer with sigmoid activation is then applied to the encoder to obtain a matching probability. Finally, the binary cross-entropy loss is computed.
MAE is used to improve unimodal representations by masking random patches of visual frames and the audio spectrogram and reconstructing missing inputs. The novel approach slices the audio and video parts of the encoder output and feeds them to the decoder independently instead of jointly, which saves compute costs and boosts finetuning performance.
In their empirical study, the team compared TVLT with text-based counterparts on audio-to-video retrieval, video-based multimodal sentiment analysis, and visual question-answering benchmarks.
In the experiments, TVLT achieved performance competitive with state-of-the-art audio-based vision-and-language models on visual question answering, image retrieval, video retrieval and multimodal sentiment analysis. Moreover, it required only 1/3 of the parameters, and its inference speed was 28x faster than the text-based methods.
Overall, this paper showcases the powerful performance of TVLT and advances the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without the need for traditional but computationally expensive text modelling.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.