In recent years, there has been a surge of interest in generative retrieval approaches, which represent a fresh paradigm aiming to transform traditional information retrieval methods. These approaches leverage the power of a single sequence-to-sequence Transformer model to encode and process an entire document corpus. However, up until now, generative retrieval approaches have been primarily confined to smaller document corpora, typically containing around 100k entries.
In a new paper How Does Generative Retrieval Scale to Millions of Passages?, a research team from Google Research and University of Waterloo performs the first empirical study of generative retrieval across various corpus scales, even scaling up to the entire MS MARCO passage ranking task that contains 8.8M passages, aiming to provide insights on scaling generative retrieval to millions of passages.
Generative retrieval approach takes the information retrieval task as a single sequence-to-sequence model that maps queries directly to relevant document identifiers by using the Differentiable Search Index (DSI). DSI achieves this by indexing and retrieval. In the training stage it learns to generate the docid based on the document content or a relevant query while in the inference stage processes a query and outputs the retrieval results as a ranked list of identifiers.
In this work, the team considers different design choices for document representations and naive identifiers, discusses the gaps between the index and retrieval tasks and the coverage gap, four kinds of document identifiers, including unstructured atomic identifiers (Atomic IDs), naive string identifiers (Naive IDs), semantically structured identifiers (Semantic IDs), and the 2D Semantic IDs, as well as reviews three model components, including Prefix-Aware Weight-Adaptive Decoder (PAWA), Constrained decoding and Consistency loss.
Lastly, the researchers explored the behavior of generative retrieval models on MS MARCO passage ranking task that contains a corpus of 8.8M passages. And they evaluated model sizes up to 11B parameters. They summarize the main findings as follows:
- Of the methods considered, we find synthetic query generation to be the single most critical component as corpus size grows.
- As corpus size increases, discussion of compute cost is crucial.
- Increasing the model size is necessary for improved generative retrieval effectiveness.
Overall, this work provides insights on how generative retrieval scale to millions of passages, and raises new questions in this field, such as how to properly leverage large language models and their scaling power to assist generative retrieval on large corpora? The researchers claim they will explore more in the future work.
The paper How Does Generative Retrieval Scale to Millions of Passages? on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.