Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

AI-generated keywords: Transformer model

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The transformer model is widely used for natural language processing tasks
  • A fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training?
  • The authors propose a simple and efficient method called Attention with Linear Biases (ALiBi) to address this issue
  • ALiBi biases the query-key attention scores with a term that is proportional to their distance instead of adding positional embeddings to the word embeddings
  • ALiBi allows for extrapolation and achieves better performance than existing proposals on the WikiText-103 benchmark
  • The authors provide an analysis of ALiBi to understand why it leads to better performance
  • ALiBi enables training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but 11% faster and using 11% less memory.
  • This paper presents an important contribution towards achieving extrapolation at inference time for longer sequences in natural language processing tasks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ofir Press, Noah A. Smith, Mike Lewis

Abstract: Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

Submitted to arXiv on 27 Aug. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2108.12409v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The transformer model, introduced by Vaswani et al. in 2017, has been widely used for natural language processing tasks. However, a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? In this paper, the authors address this issue by proposing a simple and efficient method called Attention with Linear Biases (ALiBi). ALiBi biases the query-key attention scores with a term that is proportional to their distance instead of adding positional embeddings to the word embeddings. The authors show that this method allows for extrapolation and achieves better performance than existing proposals on the WikiText-103 benchmark. Additionally, they provide an analysis of ALiBi to understand why it leads to better performance. The authors demonstrate that ALiBi enables training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but 11% faster and using 11% less memory. Overall, this paper presents an important contribution towards achieving extrapolation at inference time for longer sequences in natural language processing tasks.
Created on 11 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.