Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

AI-generated keywords: Generative Artificial Intelligence Large Language Models Edge Intelligence Optimization Transformer Decoder-based LLMs

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) revolutionize content creation
  • LLMs are resource-intensive, often requiring cloud hosting
  • Edge intelligence traditionally addresses challenges of real-time AI computation on edge devices
  • Existing research focuses on conventional AI models, leaving a gap in addressing LLM inference characteristics
  • Researchers introduce an optimized model for LLM inference tailored for edge devices
  • Approach maximizes inference throughput through batching techniques and model quantization
  • Optimization process considers edge resource limitations and user requirements for latency and accuracy
  • Researchers propose an optimal Depth-First Tree-Searching algorithm with online tree-pruning (DFTSP) to tackle complexity efficiently
  • Simulation results show DFTSP outperforms other benchmarks in terms of throughput and reduces time complexity by over 45%
  • Zhang et al.'s work enhances efficiency in large language model inference through innovative techniques
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xinyuan Zhang, Jiang Liu, Zehui Xiong, Yudong Huang, Gaochang Xie, Ran Zhang

Abstract: Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.

Submitted to arXiv on 12 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.07140v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Generative Artificial Intelligence (GAI) has revolutionized content creation with its remarkable abilities, and Large Language Models (LLMs) are leading the charge in this domain. However, the resource-intensive nature of LLMs often necessitates cloud hosting, which brings about concerns related to privacy, latency, and usage restrictions. While edge intelligence has traditionally been employed to address these challenges by enabling real-time AI computation on edge devices close to data sources, existing research has primarily focused on conventional AI models, leaving a gap in addressing the unique characteristics of LLM inference. The team introduces an specifically tailored for . By implementing batching techniques and model quantization on resource-constrained edge devices, they develop an inference model designed for . The primary objective of their approach is to maximize inference throughput through efficient batch scheduling and optimal allocation of communication and computation resources. This optimization process takes into account various factors such as edge resource limitations and diverse user requirements concerning latency and accuracy. To tackle the inherent complexity of this NP-hard problem, the researchers introduce an optimal Depth-First Tree-Searching algorithm with online tree-pruning (DFTSP), which operates within a feasible time complexity. Simulation results demonstrate that DFTSP outperforms other batching benchmarks in terms of throughput across different user settings and quantization techniques. Additionally, it significantly reduces time complexity by over 45% compared to brute-force searching methods. This refined summary showcases the groundbreaking work conducted by Zhang et al., highlighting their contributions towards enhancing efficiency in large language model inference through .
Created on 20 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.