APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

AI-generated keywords: Large Language Models Parallel Auto-regressive Generation Efficient Deployment Strategies Hierarchical Structures High-throughput Serving Scenarios

AI-generated Key Points

  • Introduction of parallel auto-regressive generation method to enhance efficiency of large language models (LLMs) in serving scenarios
  • Instruction-tuning on general domain data with hierarchical structures allows LLMs to independently plan and perform auto-parallel auto-regressive (APAR) generation
  • Up to 2x speed-up achieved, up to 4x when combined with speculative decoding
  • APAR reduces key-value cache consumption and attention computation during generation, leading to throughput increase of 20-70% and latency reduction of 20-35%
  • Structuring paragraphs in root-and-details format for accurate extraction of hierarchical structures from LLM responses
  • Excluding unstructured data like code and math content from structure extraction but using as negative examples during model training
  • Experimental setup involves fine-tuning APAR on vicuna-v1.3-{7B,13B} models for Vanilla-APAR, Medusa-APAR, and Batched-APAR implementations
  • Training samples drawn from structured and unstructured data sources at a ratio of 1:1; specific batch sizes and learning rates used before training additional medusa heads for improved performance
  • Highlighting how APAR enhances LLM generation efficiency by enabling parallel decoding threads while maintaining coherence and accuracy in text prediction across serving scenarios
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mingdao Liu, Aohan Zeng, Bowen Wang, Peng Zhang, Jie Tang, Yuxiao Dong

14 pages
License: CC BY 4.0

Abstract: The massive adoption of large language models (LLMs) demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving. In this work, we introduce a parallel auto-regressive generation method. By instruct-tuning on general domain data that contains hierarchical structures, we enable LLMs to independently plan their generation process and perform auto-parallel auto-regressive (APAR) generation, significantly reducing the number of generation steps. APAR alone can achieve up to 2x speed-up, and when combined with speculative decoding, the speed-up can reach up to 4x. In addition, APAR reduces the key-value cache consumption and attention computation during generation. This leads to a throughput increase of 20-70% and a latency reduce of 20-35% in high-throughput scenarios, compared to state-of-the-art serving frameworks.

Submitted to arXiv on 12 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.06761v1

This work introduces a parallel auto-regressive generation method to enhance the efficiency of large language models (LLMs) in various serving scenarios. By instruct-tuning on general domain data with hierarchical structures, LLMs can independently plan their generation process and perform auto-parallel auto-regressive (APAR) generation. This approach results in up to a 2x speed-up and up to 4x when combined with speculative decoding. Furthermore, APAR reduces key-value cache consumption and attention computation during generation, resulting in a throughput increase of 20-70% and a latency reduction of 20-35% compared to existing serving frameworks. To accurately extract hierarchical structures from LLM responses, paragraphs are structured in a root-and-details format where the first sentence serves as the root summarizing the main idea. Unstructured data such as code and math content is excluded from structure extraction but included as negative examples during model training to prevent excessive branching during decoding. The experimental setup involves fine-tuning APAR on vicuna-v1.3-{7B,13B} models to produce APAR-{7B,13B}, which are evaluated using three settings: Vanilla-APAR implemented with transformers, Medusa-APAR implemented with Medusa for speculative decoding, and Batched-APAR implemented with vLLM for high-throughput serving scenarios. During training, samples are drawn from both structured and unstructured data sources at a ratio of 1:1. Models are fine-tuned with specific batch sizes and learning rates before training additional medusa heads for improved performance. Overall, this work highlights how APAR enhances LLM generation efficiency by enabling parallel decoding threads while maintaining coherence and accuracy in text prediction across various serving scenarios.
Created on 18 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.