Using AI/ML to Find and Remediate Enterprise Secrets in Code & Document Sharing Platforms

AI-generated keywords: Software development

AI-generated Key Points

  • Inadvertent exposure of sensitive information in code poses a significant risk to organizations
  • Over 100,000 GitHub repositories have been found containing secrets, highlighting the need for preventive measures
  • Existing solutions rely on heuristic methods like regular expressions but often generate high levels of noise due to false positives
  • AI/ML models are being used to reduce false positives and enhance accuracy in detecting and remediating vulnerabilities
  • Proposal to develop AI/ML models capable of pinpointing secrets in code and automatically remediating them, extending detection capability to free text within document sharing platforms
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Gregor Kerr, David Algorry, Senad Ibraimoski, Peter Maciver, Sean Moran

License: CC BY-NC-SA 4.0

Abstract: We introduce a new challenge to the software development community: 1) leveraging AI to accurately detect and flag up secrets in code and on popular document sharing platforms that frequently used by developers, such as Confluence and 2) automatically remediating the detections (e.g. by suggesting password vault functionality). This is a challenging, and mostly unaddressed task. Existing methods leverage heuristics and regular expressions, that can be very noisy, and therefore increase toil on developers. The next step - modifying code itself - to automatically remediate a detection, is a complex task. We introduce two baseline AI models that have good detection performance and propose an automatic mechanism for remediating secrets found in code, opening up the study of this task to the wider community.

Submitted to arXiv on 03 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.01754v1

, , , , In the realm of software development, the inadvertent exposure of sensitive information such as passwords, API tokens, and private keys within code and document sharing platforms poses a significant risk to organizations. A startling discovery revealed over 100,000 GitHub repositories containing such secrets, underscoring the urgency for robust preventive measures [6]. Past incidents of large-scale breaches have underscored the critical need for tools that can effectively detect and mitigate these vulnerabilities [3, 8, 11]. Existing solutions rely on heuristic methods like regular expressions to identify secrets in various environments [4, 5, 10]. While these tools offer a solid foundation for detection, they often generate high levels of noise due to false positive detections caused by faulty regular expressions [2]. <ks>AI/ML models</ks> have gained interest in reducing false positives and enhancing accuracy [9] to address this pressing issue. We present a novel challenge to the software development community: developing AI/ML models capable of accurately pinpointing secrets in code and automatically remediating them with integrated replacements. Additionally, we propose extending this detection capability to free text within collaborative document sharing platforms (DSPs). Our approach involves employing baseline AI/ML models and implementing a human-in-the-loop learning setup to achieve optimal performance. We introduce two distinct baseline models tailored for code and natural language analysis. For code detection, we utilize a language-agnostic machine learning model trained on annotated labels provided by subject matter experts (SMEs). In the case of DSPs, our AI/ML model leverages outputs from heuristic tools to generate weak labels that are subsequently re-annotated based on SME input. Furthermore, we outline an effective openrewrite rules-based solution for automatically remediating detected secrets in code. By combining innovative AI technologies with expert input and rule-based strategies, we aim to revolutionize the identification and mitigation of security vulnerabilities within software development processes. This pioneering approach not only enhances data protection but also streamlines developer workflows by minimizing false positives and automating remediation tasks.
Created on 23 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.