Accuracy is Not All You Need

AI-generated keywords: Large Language Models Compression Techniques Accuracy Evaluation User Experience Distance Metrics

AI-generated Key Points

  • Study conducted by Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, and Ramachandran Ramjee focuses on compression of Large Language Models (LLMs) using techniques like quantization.
  • Traditional method of evaluating compressed models is by comparing accuracy on benchmarks to baseline model.
  • Flips phenomenon occurs where correct answers in baseline model become incorrect in compressed model and vice versa.
  • Extensive analysis across multiple compression techniques, models, and datasets shows significant discrepancies in user experience despite similar accuracy levels.
  • Compressed models perform considerably worse than baseline models in free-form generative tasks according to qualitative and quantitative evaluations using MT-Bench.
  • Researchers propose evaluating compression techniques using distance metrics like KL-Divergence and flips in addition to accuracy measurements for a more comprehensive understanding of model performance.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, Ramachandran Ramjee

https://proceedings.neurips.cc/paper_files/paper/2024/hash/e0e956681b04ac126679e8c7dd706b2e-Abstract-Conference.html
License: CC BY 4.0

Abstract: When Large Language Models (LLMs) are compressed using techniques such as quantization, the predominant way to demonstrate the validity of such techniques is by measuring the model's accuracy on various benchmarks.If the accuracies of the baseline model and the compressed model are close, it is assumed that there was negligible degradation in quality.However, even when the accuracy of baseline and compressed model are similar, we observe the phenomenon of flips, wherein answers change from correct to incorrect and vice versa in proportion.We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar.We further evaluate compressed models qualitatively and quantitatively using MT-Bench and show that compressed models are significantly worse than baseline models in this free-form generative task.Thus, we argue that compression techniques should also be evaluated using distance metrics.We propose two such metrics, KL-Divergence and flips, and show that they are well correlated.

Submitted to arXiv on 12 Jul. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2407.09141v1

In the study "Accuracy is Not All You Need," conducted by Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, and Ramachandran Ramjee, the focus is on the compression of Large Language Models (LLMs) using techniques like quantization. The traditional method of evaluating these compressed models is by comparing their accuracy on various benchmarks to that of the baseline model. If the accuracies are similar, it is generally assumed that there has been minimal degradation in quality. However, even when accuracy levels are comparable between baseline and compressed models, a phenomenon known as flips occurs. Flips refer to instances where correct answers in the baseline model become incorrect in the compressed model and vice versa. The researchers conducted an extensive analysis across multiple compression techniques, models, and datasets to explore how the behavior of compressed models differs from that of baseline models when presented to end-users. Despite similar accuracy levels, they found significant discrepancies in user experience. Additionally, qualitative and quantitative evaluations using MT-Bench revealed that compressed models performed considerably worse than baseline models in free-form generative tasks. To address these issues, the researchers propose evaluating compression techniques using distance metrics in addition to accuracy measurements. They introduce two such metrics: KL-Divergence and flips, which demonstrate a strong correlation. By incorporating these metrics into evaluation processes,a more comprehensive understanding of model performance can be achieved. Overall,the study highlights the importance of considering factors beyond just accuracy when assessing compressed LLMs.By taking into account user experience and employing appropriate distance metrics for evaluation,researchers can gain deeper insights into the true impact of compression techniques on model quality and performance.
Created on 07 May. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.