Technical Debt analysis is becoming increasingly popular in both research and industry. Various tools for static code analysis are being adopted to assess the quality of code. However, conducting empirical studies on software projects can be costly and time-consuming. Results are often difficult to compare due to differences in projects being considered. To address this issue, the Technical Debt Dataset has been introduced as a curated collection of project measurement data from 33 Java projects within the Apache Software Foundation. In the creation of the Technical Debt Dataset, all commits from defined time frames were analyzed using SonarQube to gather Technical Debt information and Ptidej to detect code smells. Additionally, information on refactoring applied in each commit was collected using Refactoring Miner, along with fault information reported in issue trackers such as Jira. By implementing the SZZ algorithm, fault-inducing and -fixing commits were identified among 78K commits from the selected projects. The dataset revealed significant insights, detecting 1.8M SonarQube issues, 38K code smells, 28K faults, and 57K refactorings across the projects. The extensive project analysis spanned over 200 days and aimed to facilitate comparisons in Technical Debt research by providing a common dataset for researchers to utilize. Furthermore, PyDriller was utilized to extract detailed information from Git logs including file modifications in each commit. Various sets of information related to code quality were collected such as software metrics encompassing size-related metrics, complexity metrics like Cyclomatic Complexity and cognitive complexity, test coverage details including lines not covered by tests, and duplications in terms of duplicated lines and files. Overall, through the integration of multiple tools like PyDriller and Ptidej alongside comprehensive data collection methods, the Technical Debt Dataset offers a valuable resource for researchers looking to delve deeper into technical debt analysis within software projects. The dataset is made accessible through CSV files and an SQLite database for ease of querying data and aims to foster further advancements in understanding technical debt implications on software development practices.
- - Technical Debt analysis is increasingly popular in research and industry
- - Various tools for static code analysis are used to assess code quality
- - Conducting empirical studies on software projects can be costly and time-consuming
- - The Technical Debt Dataset provides curated project measurement data from 33 Java projects within the Apache Software Foundation
- - The dataset includes information on SonarQube issues, code smells, faults, refactorings, and fault-inducing commits
- - PyDriller was used to extract detailed information from Git logs for file modifications in each commit
- - Information collected includes software metrics, complexity metrics, test coverage details, and duplications
- - The dataset aims to facilitate comparisons in Technical Debt research by providing a common dataset for researchers to utilize
Summary1. People like to study and understand technical debt in research and business.
2. Tools are used to check how good the code is written.
3. Studying software projects can take a lot of time and money.
4. A special dataset gives information about many Java projects from Apache Software Foundation.
5. This dataset helps researchers compare different technical debt studies easily.
Definitions- Technical Debt: The cost of fixing issues in software development that accumulate over time due to shortcuts or poor coding practices.
- Code Quality: How well-written and efficient the code is in a software project.
- Dataset: A collection of data or information organized for analysis or reference.
- Empirical Studies: Research based on practical experience rather than theory alone.
- Refactorings: Restructuring existing code without changing its external behavior to improve readability or maintainability.
Technical Debt Dataset: A Comprehensive Resource for Technical Debt Analysis
Introduction:
In today's fast-paced software development industry, the pressure to deliver high-quality code quickly has led to a rise in technical debt. Technical debt refers to the additional work that needs to be done in the future due to shortcuts or poor coding practices in the present. It can lead to increased maintenance costs, decreased productivity, and even project failure if not managed properly. As a result, there is a growing interest in understanding and managing technical debt within software projects.
One way of analyzing technical debt is through static code analysis tools, which are becoming increasingly popular among both researchers and industry professionals. However, conducting empirical studies on software projects can be time-consuming and expensive. Additionally, comparing results from different studies can be challenging due to variations in project characteristics.
To address these issues, a team of researchers introduced the Technical Debt Dataset – a curated collection of project measurement data from 33 Java projects within the Apache Software Foundation (ASF). This dataset aims to provide a common ground for researchers to analyze technical debt by offering comprehensive information on various aspects such as code smells, refactoring activities, and fault-inducing commits.
Data Collection Process:
The creation of the Technical Debt Dataset involved analyzing all commits from defined time frames using SonarQube – an open-source platform for continuous inspection of code quality. The tool was used to gather information on technical debt issues such as bugs, vulnerabilities, and code smells across all 33 ASF projects.
Furthermore, Ptidej – an Eclipse plugin for detecting code smells – was employed to identify any potential design flaws or bad coding practices within each commit. Information on refactoring activities applied in each commit was also collected using Refactoring Miner – another Eclipse plugin that detects refactorings performed by developers.
To capture fault-related information such as bug reports and fixes reported in issue trackers like Jira, the SZZ algorithm was implemented. This algorithm identifies fault-inducing and -fixing commits by tracing back through the project's commit history.
In total, over 78,000 commits were analyzed, resulting in the detection of 1.8 million SonarQube issues, 38,000 code smells, 28,000 faults, and 57,000 refactorings across the projects. The extensive analysis spanned over 200 days and aimed to provide a comprehensive dataset for researchers to utilize in their studies on technical debt.
Data Extraction:
To extract detailed information from Git logs – which contain a record of all changes made to a project – PyDriller was used. This tool allowed for the collection of various sets of data related to code quality such as software metrics (e.g., size-related metrics), complexity metrics (e.g., Cyclomatic Complexity and cognitive complexity), test coverage details (e.g., lines not covered by tests), and duplications (e.g., duplicated lines and files).
The use of multiple tools like PyDriller and Ptidej alongside comprehensive data collection methods has resulted in a rich dataset that offers valuable insights into technical debt within software projects.
Accessing the Dataset:
The Technical Debt Dataset is available in both CSV files and an SQLite database format for ease of querying data. This allows researchers to easily access and analyze the data according to their specific research needs.
Implications for Future Research:
The Technical Debt Dataset offers a valuable resource for researchers looking to delve deeper into technical debt analysis within software projects. By providing a common ground for comparison between different studies, it aims to foster further advancements in understanding the implications of technical debt on software development practices.
Conclusion:
As organizations continue to prioritize speed over quality in software development processes, managing technical debt becomes crucial. The Technical Debt Dataset provides an extensive collection of project measurement data that can aid researchers in gaining insights into this growing issue. With its accessibility and comprehensive nature, it has the potential to drive further advancements in the field of technical debt analysis and help organizations make informed decisions to manage their technical debt effectively.