LogShrink: Effective Log Compression by Leveraging Commonality and Variability of Log Data

AI-generated keywords: Log data system execution log compression empirical study commonality and variability analysis

AI-generated Key Points

  • Log data plays a vital role in system execution by capturing events and states.
  • The increasing scale of systems has led to a surge in log data generation, reaching several petabytes per day in production environments.
  • Log compression is crucial to reduce storage burden while enabling comprehensive log analysis.
  • Existing compression methods struggle to effectively leverage the unique characteristics of log data.
  • A comprehensive empirical study identified key characteristics of log data that can enhance the compression process.
  • LogShrink was introduced as an innovative and efficient log compression method that utilizes commonality and variability in log data.
  • LogShrink employs an analyzer based on longest common subsequence and entropy techniques to identify shared patterns and differences within log messages for more compact representations.
  • The clustering-based sequence sampler in LogShrink expedites the identification of commonality and variability in log data.
  • Extensive experiments using 16 public datasets showcased LogShrink's superior performance over specific and general-purpose compression methods, achieving compression ratios ranging from 16% to 356% on average while maintaining reasonable speeds.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiaoyun Li, Hongyu Zhang, Van-Hoang Le, Pengfei Chen

Accepted by ICSE 2024 Research Track
License: CC BY 4.0

Abstract: Log data is a crucial resource for recording system events and states during system execution. However, as systems grow in scale, log data generation has become increasingly explosive, leading to an expensive overhead on log storage, such as several petabytes per day in production. To address this issue, log compression has become a crucial task in reducing disk storage while allowing for further log analysis. Unfortunately, existing general-purpose and log-specific compression methods have been limited in their ability to utilize log data characteristics. To overcome these limitations, we conduct an empirical study and obtain three major observations on the characteristics of log data that can facilitate the log compression task. Based on these observations, we propose LogShrink, a novel and effective log compression method by leveraging commonality and variability of log data. An analyzer based on longest common subsequence and entropy techniques is proposed to identify the latent commonality and variability in log messages. The key idea behind this is that the commonality and variability can be exploited to shrink log data with a shorter representation. Besides, a clustering-based sequence sampler is introduced to accelerate the commonality and variability analyzer. The extensive experimental results demonstrate that LogShrink can exceed baselines in compression ratio by 16% to 356% on average while preserving a reasonable compression speed.

Submitted to arXiv on 18 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.09479v1

In the realm of system execution, log data plays a vital role in capturing events and states. However, with the increasing scale of systems, the generation of log data has surged dramatically. This results in a substantial burden on storage resources, reaching several petabytes per day in production environments. To mitigate this challenge, log compression has emerged as a critical task to reduce disk storage while enabling comprehensive log analysis. Unfortunately, existing compression methods have struggled to effectively leverage the unique characteristics of log data. To address these limitations, a comprehensive empirical study was conducted to delve into the intricacies of log data. Three key observations were made regarding its characteristics that could significantly aid in enhancing the compression process. Building upon these insights, LogShrink was introduced as an innovative and efficient log compression method that capitalizes on the commonality and variability present in log data. By employing an analyzer based on longest common subsequence and entropy techniques, LogShrink identifies shared patterns and differences within log messages to achieve more compact representations. A pivotal component of LogShrink is its clustering-based sequence sampler which expedites the identification of commonality and variability in log data. Extensive experiments were carried out using 16 public datasets from diverse software systems to evaluate LogShrink's performance. The results showcased its superiority over two other specific compression methods for logs and three general-purpose compression methods by achieving compression ratios ranging from 16% to 356% on average while maintaining reasonable speeds. The ablation study conducted on LogShrink further validated the efficacy of its commonality and variability analyzer along with the clustering-based sequence sampling technique in enhancing both compression ratio and speed. In summary, this paper contributes significantly by conducting empirical studies on real-world log datasets, proposing a novel approach for log compression based on commonality and variability analysis, showcasing superior performance through extensive experiments, and making the tool's source code and experimental data publicly available for further research endeavors.
Created on 08 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.