The Technical Debt Dataset

AI-generated keywords: Technical Debt Dataset Code Analysis Empirical Studies Project Measurement

AI-generated Key Points

Technical Debt analysis is increasingly popular in research and industry
Various tools for static code analysis are used to assess code quality
Conducting empirical studies on software projects can be costly and time-consuming
The Technical Debt Dataset provides curated project measurement data from 33 Java projects within the Apache Software Foundation
The dataset includes information on SonarQube issues, code smells, faults, refactorings, and fault-inducing commits
PyDriller was used to extract detailed information from Git logs for file modifications in each commit
Information collected includes software metrics, complexity metrics, test coverage details, and duplications
The dataset aims to facilitate comparisons in Technical Debt research by providing a common dataset for researchers to utilize

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Valentina Lenarduzzi, Nyyti Saarimäki, Davide Taibi

The Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE'19), September 18, 2019, Recife, Brazil

arXiv: 1908.00827v1 - DOI (cs.SE)

License: CC BY-SA 4.0

Abstract: Technical Debt analysis is increasing in popularity as nowadays researchers and industry are adopting various tools for static code analysis to evaluate the quality of their code. Despite this, empirical studies on software projects are expensive because of the time needed to analyze the projects. In addition, the results are difficult to compare as studies commonly consider different projects. In this work, we propose the Technical Debt Dataset, a curated set of project measurement data from 33 Java projects from the Apache Software Foundation. In the Technical Debt Dataset, we analyzed all commits from separately defined time frames with SonarQube to collect Technical Debt information and with Ptidej to detect code smells. Moreover, we extracted all available commit information from the git logs, the refactoring applied with Refactoring Miner, and fault information reported in the issue trackers (Jira). Using this information, we executed the SZZ algorithm to identify the fault-inducing and -fixing commits. We analyzed 78K commits from the selected 33 projects, detecting 1.8M SonarQube issues, 38K code smells, 28K faults and 57K refactorings. The project analysis took more than 200 days. In this paper, we describe the data retrieval pipeline together with the tools used for the analysis. The dataset is made available through CSV files and an SQLite database to facilitate queries on the data. The Technical Debt Dataset aims to open up diverse opportunities for Technical Debt research, enabling researchers to compare results on common projects.

Submitted to arXiv on 02 Aug. 2019

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1908.00827v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Technical Debt analysis is becoming increasingly popular in both research and industry. Various tools for static code analysis are being adopted to assess the quality of code. However, conducting empirical studies on software projects can be costly and time-consuming. Results are often difficult to compare due to differences in projects being considered. To address this issue, the Technical Debt Dataset has been introduced as a curated collection of project measurement data from 33 Java projects within the Apache Software Foundation. In the creation of the Technical Debt Dataset, all commits from defined time frames were analyzed using SonarQube to gather Technical Debt information and Ptidej to detect code smells. Additionally, information on refactoring applied in each commit was collected using Refactoring Miner, along with fault information reported in issue trackers such as Jira. By implementing the SZZ algorithm, fault-inducing and -fixing commits were identified among 78K commits from the selected projects. The dataset revealed significant insights, detecting 1.8M SonarQube issues, 38K code smells, 28K faults, and 57K refactorings across the projects. The extensive project analysis spanned over 200 days and aimed to facilitate comparisons in Technical Debt research by providing a common dataset for researchers to utilize. Furthermore, PyDriller was utilized to extract detailed information from Git logs including file modifications in each commit. Various sets of information related to code quality were collected such as software metrics encompassing size-related metrics, complexity metrics like Cyclomatic Complexity and cognitive complexity, test coverage details including lines not covered by tests, and duplications in terms of duplicated lines and files. Overall, through the integration of multiple tools like PyDriller and Ptidej alongside comprehensive data collection methods, the Technical Debt Dataset offers a valuable resource for researchers looking to delve deeper into technical debt analysis within software projects. The dataset is made accessible through CSV files and an SQLite database for ease of querying data and aims to foster further advancements in understanding technical debt implications on software development practices.

- Technical Debt analysis is increasingly popular in research and industry
- Various tools for static code analysis are used to assess code quality
- Conducting empirical studies on software projects can be costly and time-consuming
- The Technical Debt Dataset provides curated project measurement data from 33 Java projects within the Apache Software Foundation
- The dataset includes information on SonarQube issues, code smells, faults, refactorings, and fault-inducing commits
- PyDriller was used to extract detailed information from Git logs for file modifications in each commit
- Information collected includes software metrics, complexity metrics, test coverage details, and duplications
- The dataset aims to facilitate comparisons in Technical Debt research by providing a common dataset for researchers to utilize

Summary1. People like to study and understand technical debt in research and business. 2. Tools are used to check how good the code is written. 3. Studying software projects can take a lot of time and money. 4. A special dataset gives information about many Java projects from Apache Software Foundation. 5. This dataset helps researchers compare different technical debt studies easily. Definitions- Technical Debt: The cost of fixing issues in software development that accumulate over time due to shortcuts or poor coding practices. - Code Quality: How well-written and efficient the code is in a software project. - Dataset: A collection of data or information organized for analysis or reference. - Empirical Studies: Research based on practical experience rather than theory alone. - Refactorings: Restructuring existing code without changing its external behavior to improve readability or maintainability.

Technical Debt Dataset: A Comprehensive Resource for Technical Debt Analysis Introduction: In today's fast-paced software development industry, the pressure to deliver high-quality code quickly has led to a rise in technical debt. Technical debt refers to the additional work that needs to be done in the future due to shortcuts or poor coding practices in the present. It can lead to increased maintenance costs, decreased productivity, and even project failure if not managed properly. As a result, there is a growing interest in understanding and managing technical debt within software projects. One way of analyzing technical debt is through static code analysis tools, which are becoming increasingly popular among both researchers and industry professionals. However, conducting empirical studies on software projects can be time-consuming and expensive. Additionally, comparing results from different studies can be challenging due to variations in project characteristics. To address these issues, a team of researchers introduced the Technical Debt Dataset – a curated collection of project measurement data from 33 Java projects within the Apache Software Foundation (ASF). This dataset aims to provide a common ground for researchers to analyze technical debt by offering comprehensive information on various aspects such as code smells, refactoring activities, and fault-inducing commits. Data Collection Process: The creation of the Technical Debt Dataset involved analyzing all commits from defined time frames using SonarQube – an open-source platform for continuous inspection of code quality. The tool was used to gather information on technical debt issues such as bugs, vulnerabilities, and code smells across all 33 ASF projects. Furthermore, Ptidej – an Eclipse plugin for detecting code smells – was employed to identify any potential design flaws or bad coding practices within each commit. Information on refactoring activities applied in each commit was also collected using Refactoring Miner – another Eclipse plugin that detects refactorings performed by developers. To capture fault-related information such as bug reports and fixes reported in issue trackers like Jira, the SZZ algorithm was implemented. This algorithm identifies fault-inducing and -fixing commits by tracing back through the project's commit history. In total, over 78,000 commits were analyzed, resulting in the detection of 1.8 million SonarQube issues, 38,000 code smells, 28,000 faults, and 57,000 refactorings across the projects. The extensive analysis spanned over 200 days and aimed to provide a comprehensive dataset for researchers to utilize in their studies on technical debt. Data Extraction: To extract detailed information from Git logs – which contain a record of all changes made to a project – PyDriller was used. This tool allowed for the collection of various sets of data related to code quality such as software metrics (e.g., size-related metrics), complexity metrics (e.g., Cyclomatic Complexity and cognitive complexity), test coverage details (e.g., lines not covered by tests), and duplications (e.g., duplicated lines and files). The use of multiple tools like PyDriller and Ptidej alongside comprehensive data collection methods has resulted in a rich dataset that offers valuable insights into technical debt within software projects. Accessing the Dataset: The Technical Debt Dataset is available in both CSV files and an SQLite database format for ease of querying data. This allows researchers to easily access and analyze the data according to their specific research needs. Implications for Future Research: The Technical Debt Dataset offers a valuable resource for researchers looking to delve deeper into technical debt analysis within software projects. By providing a common ground for comparison between different studies, it aims to foster further advancements in understanding the implications of technical debt on software development practices. Conclusion: As organizations continue to prioritize speed over quality in software development processes, managing technical debt becomes crucial. The Technical Debt Dataset provides an extensive collection of project measurement data that can aid researchers in gaining insights into this growing issue. With its accessibility and comprehensive nature, it has the potential to drive further advancements in the field of technical debt analysis and help organizations make informed decisions to manage their technical debt effectively.

Created on 11 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

51.3%

The Westermo test results data set

cs.SE

51.3%

Recent Advances in Software Effort Estimation using Machine Learning

cs.SE

51.2%

How do Machine Learning Projects use Continuous Integration Practices? An Emp…

cs.SE

50.7%

How to Refactor this Code? An Exploratory Study on Developer-ChatGPT Refactor…

cs.SE

49.3%

A Study on Software Metrics and its Impact on Software Quality

cs.SE

48.8%

Large Language Models in Fault Localisation

cs.SE

48.2%

On the Concerns of Developers When Using GitHub Copilot

cs.SE

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.