Towards General Text Embeddings with Multi-stage Contrastive Learning

AI-generated keywords: GTE Text Embedding Contrastive Learning NLP Code-related Tasks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • GTE is a general-purpose text embedding model trained using multi-stage contrastive learning.
  • The authors aim to unify various NLP tasks into a single format by training a unified text embedding model through contrastive learning over diverse datasets from multiple sources.
  • By increasing the amount of training data during unsupervised pre-training and supervised fine-tuning stages, the authors achieve substantial performance improvements over existing embedding models.
  • GTE$_\text{base}$ outperforms OpenAI's black-box embedding API and larger text embedding models by 10 times on the massive text embedding benchmark, even with a modest parameter count of 110M.
  • GTE outperforms previous best code retrievers of similar size without additional fine-tuning on each programming language individually, highlighting its capability to handle code-related tasks effectively.
  • Multi-stage contrastive learning is the key factor behind these impressive results.
  • GTE offers a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang

License: CC BY-NC-ND 4.0

Abstract: We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.

Submitted to arXiv on 07 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.03281v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The paper introduces GTE, a general-purpose text embedding model that is trained using multi-stage contrastive learning. The authors aim to unify various NLP tasks into a single format by training a unified text embedding model through contrastive learning over diverse datasets from multiple sources. By significantly increasing the amount of training data during both unsupervised pre-training and supervised fine-tuning stages, the authors achieve substantial performance improvements over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms OpenAI's black-box embedding API and even surpasses larger text embedding models by 10 times on the massive text embedding benchmark. Moreover, without additional fine-tuning on each programming language individually, GTE outperforms previous best code retrievers of similar size by treating code as text. This highlights the model's capability to handle code-related tasks effectively. The key factor behind these impressive results is the effective utilization of multi-stage contrastive learning. By harnessing this approach, GTE offers a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks. In conclusion, the paper presents GTE as an advanced text embedding model that achieves state-of-the-art performance by leveraging multi-stage contrastive learning. Its ability to handle diverse NLP tasks and perform well in code-related tasks makes it a valuable tool for various applications in natural language processing and programming.
Created on 24 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.