Text embeddings are low-dimensional vector representations of arbitrary-length texts that play a crucial role in natural language processing tasks such as large-scale retrieval. While contrastive learning approaches can improve the quality of text embeddings by enhancing their sequence-level representations from text pairs, the resulting embeddings still struggle to match the performance of the popular BM25 baseline ranking function without further finetuning.
In the new paper Text Embeddings by Weakly-Supervised Contrastive Pre-training, a Microsoft research team introduces Embeddings from Bidirectional Encoder Representations (E5), a general-purpose text embedding model for tasks requiring a single-vector representation of texts and the first model to surpass the BM25 baseline on the BEIR retrieval benchmark under a zero-shot setting.
In their first step, the researchers mine the Internet to compile CCPairs (Colossal Clean Text Pairs), a huge, diverse and high-quality text-pair dataset for training general-purpose text embeddings. They employ a novel consistency-based filtering technique to further improve data quality, and end up with ∼270M text pairs for contrastive pretraining.
After applying contrastive pretraining to the CCPairs, the team further refines the output embeddings and injects human knowledge by training their model on a small, high-quality labelled dataset compiled from the NLI 6 Natural Language Inference, MS-MARCO passage ranking, and Natural Questions datasets.
The resulting high-quality text embeddings demonstrate strong transferability across a wide range of tasks without any finetuning of the model parameters, validating the approach’s suitability for both zero-shot and finetuned settings.
In the team’s empirical experiments, E5 became the first model to beat the strong BM25 baseline under a zero-shot setting on the BEIR retrieval benchmark. In a fine-tuned setting on the MTEB benchmark, E5 outperformed the state-of-the-art embedding model that has 40x more parameters.
Overall, the study shows that E5 text embeddings can be contrastively trained with only unlabelled text pairs, that the approach offers strong, off-the-shelf performance on tasks requiring single-vector text representations, and that it produces superior fine-tuned performance on downstream tasks.
The code is available on the project’s GitHub. The paper Text Embeddings by Weakly-Supervised Contrastive Pre-training is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
The team’s empirical investigations showed that E5 was the first model to outperform the robust BM25 baseline on the BEIR retrieval benchmark while using a zero-shot configuration.