ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

AI-generated keywords: Replaced Token Detection Masked Language Modeling Pre-training Sample Efficiency Contextual Representations

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors propose a new pre-training task called replaced token detection as an alternative to masked language modeling (MLM) methods like BERT
  • Approach corrupts input tokens by replacing them with plausible alternatives from a small generator network
  • Discriminative model is trained to predict whether each token in the corrupted input was replaced or not
  • New pre-training task is more efficient than MLM because it is defined over all input tokens, not just the masked subset
  • Contextual representations learned by this approach outperform those learned by BERT, given the same resources
  • Particularly strong gains for small models, outperforming GPT on GLUE benchmark with less compute resources
  • Performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute resources
  • Effective and efficient alternative to MLM for improving contextual representations in natural language processing tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

ICLR 2020

Abstract: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

Submitted to arXiv on 23 Mar. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2003.10555v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The authors of the paper propose a new pre-training task called replaced token detection as an alternative to the commonly used masked language modeling (MLM) methods like BERT. Rather than masking input tokens with [MASK], the approach corrupts them by replacing some tokens with plausible alternatives sampled from a small generator network. Rather than training a model to predict the original identities of the corrupted tokens, a discriminative model is trained to predict whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate that this new pre-training task is more efficient than MLM because it is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by this approach substantially outperform those learned by BERT, given the same model size, data and compute resources. The gains are particularly strong for small models, as evidenced by a model trained on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Furthermore, this approach also works well at scale. It performs comparably to RoBERTa and XLNet while utilizing less than 1/4 of their compute resources and even outperforms them when using the same amount of compute. Overall, this research presents an effective and efficient alternative to MLM pre-training methods for improving contextual representations in natural language processing tasks.
Created on 18 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.