For years, embedding models based on bidirectional language models have led the field, excelling in retrieval and general-purpose embedding tasks. However, past top-tier methods have relied on fine-tuning Large Language Models (LLMs) with extensive amounts of proprietary synthetic data from GPT-4, which isn’t accessible to the broader community.
In a new paper NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, an NVIDIA research team introduces NV-Embed. This generalist embedding model significantly boosts the performance of decoder-only LLMs in embedding and retrieval tasks while maintaining simplicity and reproducibility.


The team presents a novel latent attention layer for their model architecture, which pools embeddings from a sequence of tokens. Unlike the commonly used average pooling in bidirectional embedding models or the last token embedding in decoder-only LLMs, this new pooling method consistently enhances the accuracy of retrieval and other downstream tasks.
To further improve representation learning, the researchers eliminate the causal attention mask during the contrastive training of decoder-only LLMs, leading to substantial performance gains. This approach is simpler yet more effective than recent related methods that require additional training phases involving masked token prediction or mixed training objectives.
For model training, they employ a two-stage contrastive instruction-tuning method, starting with the pretrained Mistral-7B. The first stage involves contrastive training with instructions on retrieval datasets, using in-batch negative and curated hard negative examples. In the second stage, they incorporate carefully curated non-retrieval datasets into the stage-one training data, disabling in-batch negative training to avoid misleading results for non-retrieval tasks.
This strategy not only improves classification, clustering, and semantic textual similarity tasks but also unexpectedly enhances retrieval performance. Importantly, the training data is entirely publicly available and free of any proprietary synthetic data from models like GPT-4. Furthermore, the model is not fine-tuned from existing embedding models.


Combining these techniques, the NV-Embed model achieves a new record high score of 69.32, ranking first (as of May 22, 2024) on the Massive Text Embedding Benchmark (MTEB) across 56 embedding tasks. It significantly surpasses previous leading embedding models, such as E5-mistral-7b-instruct (score: 66.63), SFREmbedding (score: 67.56), and Voyage-large-2-instruct (score: 68.28). Notably, NV-Embed also achieved the highest score of 59.35 on 15 retrieval tasks within the MTEB, which is derived from the BEIR benchmark.
The code is available on Hugging Face. The paper NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models is on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Our state-of-the-art facility, combined with our team of certified technicians, ensures that every job is completed to the highest standards of quality and safety. At our company, we offer a comprehensive range of services to address all aspects of automotive collision repair.
Nestled between a bustling kyrgyzstan city and modern conveniences, Osh University has a rich history. Our university, which is situated in a multicultural area, provides a distinctive learning atmosphere that enhances the educational experience for students. Amidst the alluring beauty of Kyrgyzstan's capital, put your trust in Osh University to deliver an outstanding education.