UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster

In the new paper TVLT: Textless Vision-Language Transformer, researchers from UNC Chapel Hill present the Textless Vision-Language Transformer (TVLT) for vision-and-language representation learning. TVLT uses only raw visual and audio inputs and performs comparably to its text-based counterparts but requires only 1/3 the parameters and achieves 28x faster inference speeds.

Transformer architectures have achieved impressive performance in vision-language (VL) representation learning when trained on text-annotated images or videos. It remains challenging, however, for transformers to learn VL representations without relying on text, i.e. using only low-level visual and acoustic inputs.

In the new paper TVLT: Textless Vision-Language Transformer, researchers from UNC Chapel Hill present the Textless Vision-Language Transformer (TVLT) for vision-and-language representation learning. TVLT uses only raw visual and audio inputs and performs comparably to its text-based counterparts but requires only 1/3 the parameters and achieves 28x faster inference speeds.

The TVLT’s main architecture is a transformer comprising a 12-layer encoder and an 8-layer decoder. It takes its inputs as a list of embeddings obtained directly from perception-level video and audio and does not include any text-specific modules for automatic speech recognition (ASR) or tokenization.

The input embeddings are a combination of 1) modality embedding, 2) temporal/spatial embeddings for video, 3) temporal/frequency embeddings for audio, and 4) vision/audio patch embeddings.

The TVLT is pretrained with two objectives: vision-audio matching (VAM) and masked autoencoding (MAE). VAM is employed to learn the global cross-modal representations, and a linear layer with sigmoid activation is then applied to the encoder to obtain a matching probability. Finally, the binary cross-entropy loss is computed.

MAE is used to improve unimodal representations by masking random patches of visual frames and the audio spectrogram and reconstructing missing inputs. The novel approach slices the audio and video parts of the encoder output and feeds them to the decoder independently instead of jointly, which saves compute costs and boosts finetuning performance.

In their empirical study, the team compared TVLT with text-based counterparts on audio-to-video retrieval, video-based multimodal sentiment analysis, and visual question-answering benchmarks.

In the experiments, TVLT achieved performance competitive with state-of-the-art audio-based vision-and-language models on visual question answering, image retrieval, video retrieval and multimodal sentiment analysis. Moreover, it required only 1/3 of the parameters, and its inference speed was 28x faster than the text-based methods.

Overall, this paper showcases the powerful performance of TVLT and advances the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without the need for traditional but computationally expensive text modelling.

The code and checkpoints are available on the project’s GitHub. The paper TVLT: Textless Vision-Language Transformer is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

6 comments on “UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster”

9Apps

2022-10-06

T5 language model by 0.6 percent, and was the first model to surpass the human baseline.

Loading...

Reply
contexto

2023-07-09

They must be emotionally and physically strong, and able to be unaffected by what they see, whether in the past or in the future.

Loading...

Reply
Wordle answer today

2023-10-16

Our tutorial will teach you the methods and tips you need to complete any Wordle puzzle fast and easily.

Loading...

Reply
Hydra Launcher

2024-11-17

I’m sorry but I have a hard time understanding. What does this do?

Loading...

Reply
bad parenting

2025-01-13

This discussion on TVLT’s performance is fascinating! It highlights the potential for learning effective visual-linguistic representations, which could revolutionize AI development. Speaking of innovative learning methods, have you all tried the “Bad Parenting” game? It cleverly challenges traditional parenting concepts and serves as a fun tool for reflection. I’d love to hear your thoughts on how such games can complement AI training methodologies!

Loading...

Reply
COOKIE CLICKER

2025-08-15

Indeed, the TVLT discussion sparks insightful thoughts on visual-linguistic AI! Exploring unorthodox training approaches, like unconventional parenting simulations, is intriguing. In that spirit of unique experiences, have you explored Monkey Mart? This captivating game, monkey mart, provides an engaging simulation, offering a novel perspective on management and resource allocation. See more at GAME monkey mart and share your thoughts on its potential influence on AI’s learning process. https://cookie-clicker.one/

Loading...

Reply

UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster

Like this:

6 comments on “UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster”

Leave a Reply Cancel reply

Related

Share this:

Like this:

6 comments on “UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster”

Leave a Reply Cancel reply

Related