What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

AI-generated keywords: Zero-shot Generalization

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Large pretrained Transformer language models can perform tasks they were not explicitly trained on (zero-shot generalization)
  • Limited systematic comparison of different model architectures and pretraining objectives
  • Comprehensive evaluation conducted on text-to-text models with three different architectures: causal decoder-only, non-causal decoder-only, and encoder-decoder
  • Models trained using autoregressive language modeling and masked language modeling as pretraining objectives
  • Evaluation done with and without multitask prompted finetuning
  • Models trained with over 5 billion parameters for more than 170 billion tokens
  • Causal decoder-only models trained on autoregressive language modeling objective show strongest zero-shot generalization after unsupervised pretraining
  • Non-causal visibility models trained on masked language modeling objective followed by multitask finetuning perform best overall
  • Pretrained non-causal decoder models can be adapted into generative causal decoder models using autoregressive language modeling as a downstream task
  • Pretrained causal decoder models can be efficiently adapted into non-causal decoder models after multitask finetuning
  • Insights provided for selecting appropriate architecture and pretraining objectives to maximize zero-shot generalization in large pretrained Transformer language models
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel

Abstract: Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective.

Submitted to arXiv on 12 Apr. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2204.05832v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In this study, the authors investigate the performance of large pretrained Transformer language models in zero-shot generalization, where models can perform tasks they were not explicitly trained on. They highlight that different state-of-the-art models use varying architectures and pretraining objectives, but there has been limited systematic comparison of these factors. To address this gap, the authors conduct a comprehensive evaluation of modeling choices and their impact on zero-shot generalization. The focus is specifically on text-to-text models, and the authors experiment with three different model architectures: causal decoder-only, non-causal decoder-only, and encoder-decoder. These models are trained using two pretraining objectives: autoregressive language modeling and masked language modeling. Additionally, the models are evaluated both with and without multitask prompted finetuning. To ensure robust conclusions that can be applied to larger scales, the authors train models with over 5 billion parameters for more than 170 billion tokens. The experimental results reveal that causal decoder-only models trained on an autoregressive language modeling objective demonstrate the strongest zero-shot generalization after unsupervised pretraining. However, among all experiments conducted it is observed that models with non-causal visibility on their input trained using a masked language modeling objective followed by multitask finetuning perform best. This finding suggests considering adaptation of pretrained models across different architectures and objectives. Furthermore, the study reveals that pretrained non-causal decoder models can be adapted into generative causal decoder models by utilizing autoregressive language modeling as a downstream task. Similarly, pretrained causal decoder models can be efficiently adapted into non-causal decoder models to achieve competitive performance after multitask finetuning. Overall, this research provides valuable insights into selecting appropriate architecture and pretraining objectives for maximizing zero-shot generalization in large pretrained Transformer language models. The code and checkpoints used in this study are made available for further exploration.
Created on 19 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.