SliceGPT: Compress Large Language Models by Deleting Rows and Columns

AI-generated keywords: SliceGPT Sparsification Post-training Computational Invariance Transformer Networks

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Authors Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman address the need to alleviate compute and memory resource costs associated with large language models.
The paper introduces SliceGPT, a post-training sparsification scheme that replaces weight matrices with smaller dense matrices to reduce network embedding dimensions.
SliceGPT demonstrates effectiveness by removing up to 25% of model parameters while maintaining high zero-shot task performance levels for LLAMA2-70B, OPT 66B, and Phi-2 models.
One key advantage of SliceGPT is its ability to enable sliced models to run on fewer GPUs and operate faster without additional code optimization.
On consumer GPUs (24GB) and A100 GPUs (40GB), SliceGPT significantly reduces total compute for inference on various models.
The authors introduce a new insight called "Slice Transformer" in transformer networks which enhances SliceGPT's effectiveness and could inspire future approaches for reducing memory and computation demands in pre-trained models.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

arXiv: 2401.15024v1 - DOI (cs.LG)

22 pages, 8 figures, accepted at ICLR24

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

Submitted to arXiv on 26 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.15024v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their paper titled "SliceGPT: Compress Large Language Models by Deleting Rows and Columns," authors Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman address the growing need to alleviate the substantial compute and memory resource costs associated with large language models used in natural language processing. The paper spans over 22 pages with 8 figures and has been accepted at ICLR24. The code for implementing SliceGPT is available at https://github.com/microsoft/TransformerCompression. introduces a novel post-training sparsification scheme that replaces each weight matrix in a model with a smaller (dense) matrix, effectively reducing the embedding dimension of the network. This addresses the need for as a solution to resource constraints faced by large language models. Existing techniques have limitations such as additional data structures and limited speedup capabilities on current hardware. Through extensive experimentation, demonstrates its effectiveness by removing up to 25% of model parameters (including embeddings) for LLAMA2-70B, OPT 66B, and Phi-2 models while maintaining impressive zero-shot task performance levels of 99%, 99%, and 90% respectively compared to the dense model. One key advantage of is its ability to enable sliced models to run on fewer GPUs and operate faster without requiring additional code optimization. For instance, on 24GB consumer GPUs, it reduces total compute for inference on LLAMA2-70B to just 64% of that required by the dense model; on 40GB A100 GPUs, this reduction is even more significant at 66%. Furthermore, the authors introduce a new insight called in transformer networks which underpins effectiveness. They believe that this insight could inspire future avenues for reducing memory and computation demands for pre-trained models.

- Authors Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman address the need to alleviate compute and memory resource costs associated with large language models.
- The paper introduces SliceGPT, a post-training sparsification scheme that replaces weight matrices with smaller dense matrices to reduce network embedding dimensions.
- SliceGPT demonstrates effectiveness by removing up to 25% of model parameters while maintaining high zero-shot task performance levels for LLAMA2-70B, OPT 66B, and Phi-2 models.
- One key advantage of SliceGPT is its ability to enable sliced models to run on fewer GPUs and operate faster without additional code optimization.
- On consumer GPUs (24GB) and A100 GPUs (40GB), SliceGPT significantly reduces total compute for inference on various models.
- The authors introduce a new insight called "Slice Transformer" in transformer networks which enhances SliceGPT's effectiveness and could inspire future approaches for reducing memory and computation demands in pre-trained models.

SummaryAuthors Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman found a way to make big language models use less computer power and memory. They made SliceGPT, which makes the model smaller without losing its abilities for different tasks. SliceGPT can work well on fewer computers and run faster without needing extra changes in the code. It helps save computing power when using different types of GPUs. The authors also came up with a new idea called "Slice Transformer" that can make models more efficient in the future. Definitions- Authors: People who write books or papers. - Compute: Using computers to solve problems or do tasks. - Memory: A computer's ability to store information. - Language Models: Programs that understand and generate human language. - Sparsification: Making something less dense or crowded. - Parameters: Settings or values used by a program. - GPUs: Graphics Processing Units, special computer chips for handling graphics and complex calculations efficiently. - Inference: Drawing conclusions based on evidence or reasoning.

Introduction: Natural Language Processing (NLP) has seen a significant boost in recent years with the development of large language models such as GPT-3, BERT, and T5. These models have been trained on massive amounts of data and have shown impressive performance on various NLP tasks. However, their success comes at a cost - substantial compute and memory resources are required to train and run these models. This poses a challenge for researchers and organizations that do not have access to high-end hardware or cloud computing resources. To address this issue, Saleh Ashkboos et al., from Microsoft Research, have proposed a novel post-training sparsification scheme called SliceGPT in their paper titled "SliceGPT: Compress Large Language Models by Deleting Rows and Columns." This paper has been accepted at the 24th International Conference on Learning Representations (ICLR24) and is available online. Overview of SliceGPT: The main idea behind SliceGPT is to reduce the embedding dimension of large language models by replacing each weight matrix with a smaller dense matrix. This results in a more compact model without compromising its performance on downstream tasks. The authors demonstrate the effectiveness of SliceGPT through extensive experimentation on three different pre-trained transformer-based models - LLAMA2-70B, OPT 66B, and Phi-2. Experimental Results: The experiments conducted by the authors show that SliceGPT can remove up to 25% of model parameters (including embeddings) while maintaining impressive zero-shot task performance levels compared to the dense model. For instance, for LLAMA2-70B, OPT 66B, and Phi-2 models, SliceGPT achieves task performance levels of 99%, 99%, and 90% respectively compared to the dense model. One key advantage of SliceGPT is its ability to enable sliced models to run on fewer GPUs while operating faster without requiring additional code optimization. For example, on 24GB consumer GPUs, SliceGPT reduces the total compute for inference on LLAMA2-70B to just 64% of that required by the dense model. On 40GB A100 GPUs, this reduction is even more significant at 66%. Insight behind SliceGPT's Effectiveness: The authors introduce a new insight called "sliceable attention" in transformer networks which underpins SliceGPT's effectiveness. This insight suggests that certain layers in transformer networks can be sliced without affecting their performance on downstream tasks. The authors believe that this insight could inspire future avenues for reducing memory and computation demands for pre-trained models. Availability of Code: The code for implementing SliceGPT is available at https://github.com/microsoft/TransformerCompression. This allows researchers and practitioners to easily replicate the results presented in the paper and apply SliceGPT to their own models. Conclusion: In conclusion, Saleh Ashkboos et al.'s paper "SliceGPT: Compress Large Language Models by Deleting Rows and Columns" introduces a promising solution to address resource constraints faced by large language models used in NLP tasks. Through extensive experimentation, the authors demonstrate its effectiveness in reducing model parameters while maintaining impressive task performance levels compared to dense models. The availability of code also makes it easier for others to adopt this technique and potentially improve upon it in future research. With its potential impact on reducing memory and computation demands for pre-trained models, SliceGPT opens up new possibilities for making large language models more accessible and practical for real-world applications.

Created on 31 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.