Punica: Multi-Tenant LoRA Serving

AI-generated keywords: Punica LoRA GPU Efficiency Scheduler

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Punica is a system designed for serving multiple Low-rank adaptation (LoRA) models in a shared GPU cluster.
  • It introduces a new CUDA kernel design that enables batching of GPU operations for different LoRA models.
  • This approach enhances GPU efficiency in terms of both memory and computation by allowing a GPU to hold only one copy of the underlying pre-trained model while serving multiple, diverse LoRA models.
  • Punica includes a scheduler that consolidates multi-tenant LoRA serving workloads within the shared GPU cluster.
  • Evaluations using a fixed-sized GPU cluster show that Punica achieves 12 times higher throughput compared to state-of-the-art LLM serving systems when serving multiple LoRA models, with only 2ms latency per token.
  • The authors of Punica are Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy.
  • Punica's open source code repository is available on GitHub at https://github.com/punica-ai/punica.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lequn Chen (University of Washington), Zihao Ye (University of Washington), Yongji Wu (Duke University), Danyang Zhuo (Duke University), Luis Ceze (University of Washington), Arvind Krishnamurthy (University of Washington)

Abstract: Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica .

Submitted to arXiv on 28 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.18547v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Punica is a system designed to serve multiple Low-rank adaptation (LoRA) models in a shared GPU cluster. It introduces a new CUDA kernel design that enables batching of GPU operations for different LoRA models. This approach allows a GPU to hold only one copy of the underlying pre-trained model while serving multiple, diverse LoRA models and significantly enhances GPU efficiency in terms of both memory and computation. The system also includes a scheduler that consolidates multi-tenant LoRA serving workloads within the shared GPU cluster. Through evaluations using a fixed-sized GPU cluster, Punica demonstrates impressive results: it achieves 12 times higher throughput compared to state-of-the-art LLM (Low-Level Model) serving systems when serving multiple LoRA models and adds only 2ms latency per token. The authors of Punica are Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy; its open source code repository is available on GitHub at https://github.com/punica-ai/punica. In summary, Punica addresses the need for efficient serving of multiple LoRA models by introducing novel techniques such as batched GPU operations and workload consolidation. Its impressive performance improvements make it an important contribution to the field of model adaptation and deployment in shared GPU clusters.
Created on 16 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.