AI Machine Learning & Data Science Research

Microsoft’s E5 Text Embedding Model Tops the MTEB Benchmark With 40x Fewer Parameters

In the new paper Text Embeddings by Weakly-Supervised Contrastive Pre-training, a Microsoft research team introduces Embeddings from Bidirectional Encoder Representations (E5), a general-purpose text embedding model for tasks requiring a single-vector representation of texts and the first model to surpass the BM25 baseline on the BEIR retrieval benchmark under a zero-shot setting.

Text embeddings are low-dimensional vector representations of arbitrary-length texts that play a crucial role in natural language processing tasks such as large-scale retrieval. While contrastive learning approaches can improve the quality of text embeddings by enhancing their sequence-level representations from text pairs, the resulting embeddings still struggle to match the performance of the popular BM25 baseline ranking function without further finetuning.

In the new paper Text Embeddings by Weakly-Supervised Contrastive Pre-training, a Microsoft research team introduces Embeddings from Bidirectional Encoder Representations (E5), a general-purpose text embedding model for tasks requiring a single-vector representation of texts and the first model to surpass the BM25 baseline on the BEIR retrieval benchmark under a zero-shot setting.

In their first step, the researchers mine the Internet to compile CCPairs (Colossal Clean Text Pairs), a huge, diverse and high-quality text-pair dataset for training general-purpose text embeddings. They employ a novel consistency-based filtering technique to further improve data quality, and end up with ∼270M text pairs for contrastive pretraining.

After applying contrastive pretraining to the CCPairs, the team further refines the output embeddings and injects human knowledge by training their model on a small, high-quality labelled dataset compiled from the NLI 6 Natural Language Inference, MS-MARCO passage ranking, and Natural Questions datasets.

The resulting high-quality text embeddings demonstrate strong transferability across a wide range of tasks without any finetuning of the model parameters, validating the approach’s suitability for both zero-shot and finetuned settings.

image.png

In the team’s empirical experiments, E5 became the first model to beat the strong BM25 baseline under a zero-shot setting on the BEIR retrieval benchmark. In a fine-tuned setting on the MTEB benchmark, E5 outperformed the state-of-the-art embedding model that has 40x more parameters.

Overall, the study shows that E5 text embeddings can be contrastively trained with only unlabelled text pairs, that the approach offers strong, off-the-shelf performance on tasks requiring single-vector text representations, and that it produces superior fine-tuned performance on downstream tasks.

The code is available on the project’s GitHub. The paper Text Embeddings by Weakly-Supervised Contrastive Pre-training is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

10 comments on “Microsoft’s E5 Text Embedding Model Tops the MTEB Benchmark With 40x Fewer Parameters

  1. شكرا

  2. The team’s empirical investigations showed that E5 was the first model to outperform the robust BM25 baseline on the BEIR retrieval benchmark while using a zero-shot configuration.

  3. The E5 model sounds impressive, especially surpassing BM25 in zero-shot tasks! It’s exciting to see how these advancements can improve text retrieval.

  4. The E5 model sounds impressive, especially surpassing BM25 in zero-shot tasks! It’s exciting to see how these advancements can improve text retrieval.

  5. The introduction of E5, a new text embedding model from Microsoft, showcases advancements in natural language processing by surpassing the BM25 baseline on the BEIR benchmark. This model utilizes weakly-supervised contrastive learning to refine text embeddings effectively. Just as one must navigate challenges in the Slope, improving vector representations requires precise adjustments and fine-tuning to achieve optimal performance.

  6. This is an impressive dataset for natural language processing! It’s fascinating how the researchers utilized novel techniques to ensure data integrity. For anyone interested in text analysis, incorporating tools like Monkey Mart could be valuable in exploring dialogue and interaction dynamics. It demonstrates how engaging with diverse data can lead to better insights. Looking forward to seeing how these advancements transform machine learning applications!

  7. Great insights on text embeddings! It’s fascinating how they enhance NLP tasks. If you’re exploring advanced techniques, you might enjoy Infinite Craft, where text understanding is key for gameplay and crafting. The synergy between embeddings and efficient retrieval methods could foster even better game mechanics, don’t you think?

  8. By outperforming the BM25 baseline on the BEIR benchmark, Microsoft’s new text embedding model E5 demonstrates advances in natural language processing. This model efficiently refines text embeddings through the use of weakly-supervised contrastive learning.

  9. Indeed, Microsoft’s E5 model showcases notable NLP progress by surpassing the BM25 baseline on the BEIR benchmark. Its success lies in leveraging weakly-supervised contrastive learning for efficient text embedding refinement. Are you ready for some fun after exploring cutting-edge AI? Then, visit the popular simulation GAME monkey mart to manage your own supermarket, harvest crops, and sell products! https://cookie-clicker.one/

  10. This is pretty cool. I’ve been messing around with text embeddings for retrieval stuff and it’s always a struggle to get them to beat BM25 without a ton of finetuning. The E5 model sounds like a game-changer, especially beating BM25 in zero-shot. That CCPairs dataset they built must be massive!

Leave a Reply

Your email address will not be published. Required fields are marked *