Fast Inference from Transformers via Speculative Decoding

Summaries already available in other languages: fr

Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias

Abstract: Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method supports existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.

Submitted to arXiv on 30 Nov. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2211.17192v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The summary is not ready yet

The key points are not ready yet

The Layman's summary is not ready yet

The blog article is not ready yet

Created on 21 Sep. 2023

Available in other languages: fr

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

58.5%

Efficiently Scaling Transformer Inference

cs.LG

58.3%

Inference with Reference: Lossless Acceleration of Large Language Models

cs.CL

57.8%

Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

cs.CL

57.0%

Contrastive Decoding Improves Reasoning in Large Language Models

cs.CL

56.7%

Emergent Abilities of Large Language Models

cs.CL

56.4%

A Hierarchical Bayesian Model for Deep Few-Shot Meta Learning

cs.LG

55.5%

LLaMA: Open and Efficient Foundation Language Models

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.