Hyena Hierarchy: Towards Larger Convolutional Language Models

AI-generated keywords: Hyena Attention NLP Transformers Efficiency

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Deep learning has made significant strides in NLP tasks, with large Transformers being a popular choice.
  • Attention operator - a core building block of Transformers - exhibits quadratic cost in sequence length, which limits the amount of context that can be accessed.
  • Existing subquadratic methods based on low-rank and sparse approximations have been developed but still need to be combined with dense attention layers to match the performance of Transformers.
  • Michael Poli and his team propose Hyena - a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.
  • In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models.
  • The team also set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K.
  • Hyena operators are twice as fast as highly optimized attention at sequence length 8K and 100x faster at sequence length 64K.
  • Overall, Hyena offers an efficient alternative to traditional attention mechanisms used in NLP tasks. Its success highlights the potential for further research into developing more efficient deep learning models that can handle larger sequences while maintaining high levels of accuracy.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

Abstract: Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

Submitted to arXiv on 21 Feb. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2302.10866v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In recent years, deep learning has made significant strides in natural language processing (NLP) tasks, with large Transformers being a popular choice due to their ability to learn at scale. However, the attention operator - a core building block of Transformers - exhibits quadratic cost in sequence length, which limits the amount of context that can be accessed. While existing subquadratic methods based on low-rank and sparse approximations have been developed, they still need to be combined with dense attention layers to match the performance of Transformers. This indicates a gap in capability that needs to be addressed. To address this issue, Michael Poli and his team propose Hyena - a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. The team also set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Furthermore, Hyena operators are twice as fast as highly optimized attention at sequence length 8K and 100x faster at sequence length 64K. Overall, Hyena offers an efficient alternative to traditional attention mechanisms used in NLP tasks. Its success highlights the potential for further research into developing more efficient deep learning models that can handle larger sequences while maintaining high levels of accuracy.
Created on 23 Apr. 2023
Available in other languages: fr

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.