Punica: Multi-Tenant LoRA Serving
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Punica is a system designed for serving multiple Low-rank adaptation (LoRA) models in a shared GPU cluster.
- It introduces a new CUDA kernel design that enables batching of GPU operations for different LoRA models.
- This approach enhances GPU efficiency in terms of both memory and computation by allowing a GPU to hold only one copy of the underlying pre-trained model while serving multiple, diverse LoRA models.
- Punica includes a scheduler that consolidates multi-tenant LoRA serving workloads within the shared GPU cluster.
- Evaluations using a fixed-sized GPU cluster show that Punica achieves 12 times higher throughput compared to state-of-the-art LLM serving systems when serving multiple LoRA models, with only 2ms latency per token.
- The authors of Punica are Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy.
- Punica's open source code repository is available on GitHub at https://github.com/punica-ai/punica.
Authors: Lequn Chen (University of Washington), Zihao Ye (University of Washington), Yongji Wu (Duke University), Danyang Zhuo (Duke University), Luis Ceze (University of Washington), Arvind Krishnamurthy (University of Washington)
Abstract: Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica .
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.