Transformer architectures have advanced the state-of-the-art on many natural language processing (NLP) tasks. These performance improvements however have relied on extensive scaling that has pushed model sizes to over one hundred billion parameters, resulting in massive memory and computational burdens.
In the new paper Improving Language Models by Retrieving From Trillions of Tokens, a DeepMind research team endeavors to enhance the performance of transformer language models without significantly increasing their size or training on additional data. They propose RETRO (Retrieval-Enhanced Transformer), an enhanced auto-regressive language model that conditions on document chunks retrieved from a large corpus and achieves performance comparable to GPT-3 and Jurassic-1 models on the Pile dataset while using 25× fewer parameters.
The team summarizes their study’s contributions as:
- We introduce RETRO, a retrieval-enhanced autoregressive language model. We use a chunked cross-attention module to incorporate the retrieved text, with time complexity linear in the amount of retrieved data. We show that retrieving based on a pretrained frozen BERT model works at scale, removing the need for training and updating a retriever network.
- We show that our method scales well with model size and database size: RETRO provides a constant gain for models ranging from 150M to 7B parameters, and can be improved at evaluation time by increasing the database size and the number of retrieved neighbours. Our largest model obtains state-of-the-art results on a range of downstream evaluation datasets including Wikitext103 (Merity et al., 2017) and the Pile (Gao et al., 2020). We show that RETRO can be fine-tuned to achieve competitive performance on downstream tasks such as question answering.
- We propose an evaluation aware of proximity of test documents with the training set, addressing the problem of test set leakage (Lee et al., 2021). This is relevant for all language models, and especially for retrieval-enhanced models since they have direct access to the training dataset during evaluation. Using this methodology, we show that the performance of RETRO comes from both explicit neighbour copying and general knowledge extraction.
The team first constructs a key-value database, where keys represent frozen BERT embeddings and values store raw chunks of text tokens. The storage of such key-value pairs avoids having to periodically re-compute embeddings over the entire database during training. Each training sequence is then split into chunks that have been augmented with their 𝑘-nearest neighbour retrieved from the database. Finally, RETRO’s encoder-decoder architecture integrates these retrieved chunks into the model’s predictions.
The proposed RETRO model retrieves data through a cross-attention mechanism. The retrieved tokens are first fed into an encoder transformer to compute the encoded neighbours set; then, the transformer decoder interleaves the RETRO and standard transformer blocks. This design enables the Retro architecture to model arbitrary text sequences whilst retrieving from databases with trillions of tokens, resulting in time complexity that is linear to the amount of retrieved data.
The researchers evaluated RETRO on C4 (Raffel et al., 2020), Wikitext103 (Merity et al., 2017), Curation Corpus (Curation, 2020), Lambada (Paperno et al., 2016) and the Pile (Gao et al., 2020) datasets, and on a set of manually selected Wikipedia articles. They reported results on question answering on the Natural Questions dataset (Kwiatkowski et al., 2019) and on evaluation metrics with leakage filtering to better understand the source of the gains produced using their novel retrieval process.
The RETRO model attained performance comparable to GPT-3 and Jurassic-1 models on the Pile dataset while using 25× fewer parameters. On Wikitext103, RETRO outperformed previous models trained on large-scale datasets and was competitive on retrieval-intensive downstream tasks such as question answering. RETRO also outperformed the baseline models at all leakage levels.
The DeepMind team believes theirs is the first work to demonstrate the benefits of scaling the retrieval database to trillions of tokens for large parametric language models, and that semi-parametric approaches can provide a more efficient method than raw parameter scaling for enhancing language models.
The paper Improving Language Models by Retrieving From Trillions of Tokens is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.