In recent years, large language models (LLMs) trained on huge corpora have made tremendous progress in the field of natural language processing, resulting in countless successful real-world applications. Broadly speaking, LLMs can be used under two main settings: a text2vec (text-to-vector) setting for natural language understanding tasks; and a text2text setting for generating output text based on an input text in tasks such as machine translation.
In the new paper Vec2text With Round-Trip Translations, a Google Brain research team explores LLMs’ capabilities for generating arbitrary natural language text from inputs of fixed-size vectors — a vec2text setting — and proposes a simple data augmentation approach based on round-trip translations to improve vec2text model performance.
The team summarizes their work’s main contributions as follows:
- We define the vec2text setting and propose four properties that such a model should possess: universality, fluency, semantic structure, and diversity.
- We further derive several quantitative and qualitative analyses to assess a vec2text model in these dimensions.
- We implement and train a T5-based autoencoder model on sentences extracted from the massive C4 dataset (Raffel et al., 2019) and confirm commonly held beliefs that the decoder of such models has a poorly structured input space.
- We propose a novel approach that uses round trip translations (RTT) to obtain a nicely behaved vec2text model.
The proposed vec2text models aim at semantically controlling LLM outputs using continuous vector spaces. The team believes universal vec2text models should be able to generate arbitrary texts for a wide variety of tasks, and the paper defines four essential qualities for such models:
- Universality: For each English sentence, there should be an embedding in the control space C that generates sentences that have the same meaning as the initial sentence with high probability. Intuitively, this property guarantees that any meaning can be expressed through a suitable choice of input to the vec2text model.
- Diversity: Decoding from each embedding in the compact space C should have high entropy. Intuitively, this encourages that the vec2text model can express any meaning in a variety of different ways.
- Fluency: Decoding from each embedding in the compact space C should lead to valid English sentences. Intuitively, this guarantees that there are no “holes” in C where the vec2text model produces sentences that are not perceived as natural by humans.
- Semantic structure: Two embeddings in the compact space C that are closeby should lead to similar distributions over sentences. Intuitively, this guarantees that the changes in meaning when the input to the vec2text model is changed are not abrupt.
To build their universal vec2text model, the team trained a T5-based universal sentence auto-encoder using round-trip translations (RTT) — a simple, automatic and scalable data augmentation approach that creates datasets containing both sentences and their paraphrasing. In RTT, a sentence in a source language is translated to a second “pivot language,” and that sentence is then translated back to the source language.
In their empirical study, the team evaluated the proposed universal vec2text model with regard to the four desirable properties of universality, fluency, semantic structure, and diversity. The results show that the universal vec2text model neatly satisfies these properties while significantly outperforming both standard and denoising auto-encoders.
The paper Vec2text With Round-Trip Translations is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.