LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
AI-generated Key Points
- Text embedding models are crucial in NLP for encoding semantic content into vector representations
- Bidirectional encoders like BERT and T5 have traditionally been used for text embedding tasks
- Decoder-only large language models (LLMs) are emerging as powerful alternatives for text embedding tasks
- LLM2Vec enables bidirectional attention, masked next token prediction, and unsupervised contrastive learning to transform decoder-only LLMs into robust text encoders
- Significant advancements have been made in English word- and sequence-level tasks using LLM2Vec with parameters ranging from 1.3B to 7B
- LLM2Vec outperforms traditional encoder-only models on word-level tasks and achieves new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB)
- Combining LLM2Vec with supervised contrastive learning techniques leads to state-of-the-art results on MTEB among models trained solely on publicly available data
- The study by Parishad BehnamGhader et al. showcases the efficacy of transforming large language models into powerful text encoders without costly adaptations or synthetic data generation
Authors: Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy
Abstract: Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.