SliceGPT: Compress Large Language Models by Deleting Rows and Columns

AI-generated keywords: SliceGPT Sparsification Post-training Computational Invariance Transformer Networks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman address the need to alleviate compute and memory resource costs associated with large language models.
  • The paper introduces SliceGPT, a post-training sparsification scheme that replaces weight matrices with smaller dense matrices to reduce network embedding dimensions.
  • SliceGPT demonstrates effectiveness by removing up to 25% of model parameters while maintaining high zero-shot task performance levels for LLAMA2-70B, OPT 66B, and Phi-2 models.
  • One key advantage of SliceGPT is its ability to enable sliced models to run on fewer GPUs and operate faster without additional code optimization.
  • On consumer GPUs (24GB) and A100 GPUs (40GB), SliceGPT significantly reduces total compute for inference on various models.
  • The authors introduce a new insight called "Slice Transformer" in transformer networks which enhances SliceGPT's effectiveness and could inspire future approaches for reducing memory and computation demands in pre-trained models.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

22 pages, 8 figures, accepted at ICLR24

Abstract: Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

Submitted to arXiv on 26 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.15024v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "SliceGPT: Compress Large Language Models by Deleting Rows and Columns," authors Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman address the growing need to alleviate the substantial compute and memory resource costs associated with large language models used in natural language processing. The paper spans over 22 pages with 8 figures and has been accepted at ICLR24. The code for implementing SliceGPT is available at https://github.com/microsoft/TransformerCompression. introduces a novel post-training sparsification scheme that replaces each weight matrix in a model with a smaller (dense) matrix, effectively reducing the embedding dimension of the network. This addresses the need for as a solution to resource constraints faced by large language models. Existing techniques have limitations such as additional data structures and limited speedup capabilities on current hardware. Through extensive experimentation, demonstrates its effectiveness by removing up to 25% of model parameters (including embeddings) for LLAMA2-70B, OPT 66B, and Phi-2 models while maintaining impressive zero-shot task performance levels of 99%, 99%, and 90% respectively compared to the dense model. One key advantage of is its ability to enable sliced models to run on fewer GPUs and operate faster without requiring additional code optimization. For instance, on 24GB consumer GPUs, it reduces total compute for inference on LLAMA2-70B to just 64% of that required by the dense model; on 40GB A100 GPUs, this reduction is even more significant at 66%. Furthermore, the authors introduce a new insight called in transformer networks which underpins effectiveness. They believe that this insight could inspire future avenues for reducing memory and computation demands for pre-trained models.
Created on 31 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.