Inference with Reference: Lossless Acceleration of Large Language Models

AI-generated keywords: Natural Language Processing LLMA LLM Cache-assisted Generation Barack Obama

AI-generated Key Points

  • LLMA (Lossless Language Model Accelerator) is a new method proposed in Natural Language Processing to speed up Large Language Model (LLM) inference with references.
  • LLMA selects a text span from the reference and copies its tokens to the decoder, then efficiently checks their appropriateness as the decoding result in parallel within one decoding step.
  • LLMA achieves over 2x speed-up for LLMs with identical generation results as greedy decoding in practical generation scenarios where significant overlap between in-context reference and outputs exists, such as search engines and multi-turn conversations.
  • Real insulin was derived from the pancreases of cows or pigs until the 1980s when human insulin or insulin analogs were produced through genetic engineering using bacteria or yeast.
  • Cache-assisted generation is discussed through an example of Barack Obama's educational background.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei

9 pages
License: CC BY 4.0

Abstract: We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios (e.g., retrieved documents). LLMA first selects a text span from the reference and copies its tokens to the decoder and then efficiently checks the tokens' appropriateness as the decoding result in parallel within one decoding step. The improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs with identical generation results as greedy decoding in many practical generation scenarios where significant overlap between in-context reference and outputs exists (e.g., search engines and multi-turn conversations).

Submitted to arXiv on 10 Apr. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2304.04487v1

In the field of Natural Language Processing, researchers have proposed a new method called LLMA (Lossless Language Model Accelerator) to speed up Large Language Model (LLM) inference with references. The motivation behind this method is that there are often identical text spans between the decoding result by an LLM and the reference available in many real-world scenarios, such as retrieved documents. LLMA selects a text span from the reference and copies its tokens to the decoder, then efficiently checks their appropriateness as the decoding result in parallel within one decoding step. This improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs with identical generation results as greedy decoding in practical generation scenarios where significant overlap between in-context reference and outputs exists, such as search engines and multi-turn conversations. In another context, a query is presented regarding where real insulin comes from. The answer is found in external corpus documents that state that real insulin was derived from the pancreases of cows or pigs until the 1980s when human insulin or insulin analogs were produced through genetic engineering using bacteria or yeast. Lastly, cache-assisted generation is discussed through an example of Barack Obama's educational background. The LM decoder generates answers to queries about which universities he attended and his education background based on cached sessions. It is revealed that Obama attended public schools in Hawaii before transferring to Columbia University in New York City where he studied political science and graduated with a B.A. degree in 1983.
Created on 26 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.