The Technical Debt Dataset

AI-generated keywords: Technical Debt Dataset Code Analysis Empirical Studies Project Measurement

AI-generated Key Points

  • Technical Debt analysis is increasingly popular in research and industry
  • Various tools for static code analysis are used to assess code quality
  • Conducting empirical studies on software projects can be costly and time-consuming
  • The Technical Debt Dataset provides curated project measurement data from 33 Java projects within the Apache Software Foundation
  • The dataset includes information on SonarQube issues, code smells, faults, refactorings, and fault-inducing commits
  • PyDriller was used to extract detailed information from Git logs for file modifications in each commit
  • Information collected includes software metrics, complexity metrics, test coverage details, and duplications
  • The dataset aims to facilitate comparisons in Technical Debt research by providing a common dataset for researchers to utilize
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Valentina Lenarduzzi, Nyyti Saarimäki, Davide Taibi

The Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE'19), September 18, 2019, Recife, Brazil
License: CC BY-SA 4.0

Abstract: Technical Debt analysis is increasing in popularity as nowadays researchers and industry are adopting various tools for static code analysis to evaluate the quality of their code. Despite this, empirical studies on software projects are expensive because of the time needed to analyze the projects. In addition, the results are difficult to compare as studies commonly consider different projects. In this work, we propose the Technical Debt Dataset, a curated set of project measurement data from 33 Java projects from the Apache Software Foundation. In the Technical Debt Dataset, we analyzed all commits from separately defined time frames with SonarQube to collect Technical Debt information and with Ptidej to detect code smells. Moreover, we extracted all available commit information from the git logs, the refactoring applied with Refactoring Miner, and fault information reported in the issue trackers (Jira). Using this information, we executed the SZZ algorithm to identify the fault-inducing and -fixing commits. We analyzed 78K commits from the selected 33 projects, detecting 1.8M SonarQube issues, 38K code smells, 28K faults and 57K refactorings. The project analysis took more than 200 days. In this paper, we describe the data retrieval pipeline together with the tools used for the analysis. The dataset is made available through CSV files and an SQLite database to facilitate queries on the data. The Technical Debt Dataset aims to open up diverse opportunities for Technical Debt research, enabling researchers to compare results on common projects.

Submitted to arXiv on 02 Aug. 2019

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1908.00827v1

Technical Debt analysis is becoming increasingly popular in both research and industry. Various tools for static code analysis are being adopted to assess the quality of code. However, conducting empirical studies on software projects can be costly and time-consuming. Results are often difficult to compare due to differences in projects being considered. To address this issue, the Technical Debt Dataset has been introduced as a curated collection of project measurement data from 33 Java projects within the Apache Software Foundation. In the creation of the Technical Debt Dataset, all commits from defined time frames were analyzed using SonarQube to gather Technical Debt information and Ptidej to detect code smells. Additionally, information on refactoring applied in each commit was collected using Refactoring Miner, along with fault information reported in issue trackers such as Jira. By implementing the SZZ algorithm, fault-inducing and -fixing commits were identified among 78K commits from the selected projects. The dataset revealed significant insights, detecting 1.8M SonarQube issues, 38K code smells, 28K faults, and 57K refactorings across the projects. The extensive project analysis spanned over 200 days and aimed to facilitate comparisons in Technical Debt research by providing a common dataset for researchers to utilize. Furthermore, PyDriller was utilized to extract detailed information from Git logs including file modifications in each commit. Various sets of information related to code quality were collected such as software metrics encompassing size-related metrics, complexity metrics like Cyclomatic Complexity and cognitive complexity, test coverage details including lines not covered by tests, and duplications in terms of duplicated lines and files. Overall, through the integration of multiple tools like PyDriller and Ptidej alongside comprehensive data collection methods, the Technical Debt Dataset offers a valuable resource for researchers looking to delve deeper into technical debt analysis within software projects. The dataset is made accessible through CSV files and an SQLite database for ease of querying data and aims to foster further advancements in understanding technical debt implications on software development practices.
Created on 11 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.