Exploring LLM-based Agents for Root Cause Analysis

AI-generated keywords: Incident management Cloud-based software systems Root cause analysis Automation Large Language Models

AI-generated Key Points

  • Roy et al. presented a study on incident management within cloud-based software systems at Conference’17 in July 2017 in Washington, DC, USA.
  • Incident management plays a crucial role in the software development lifecycle due to the increasing complexity of these systems.
  • Root cause analysis (RCA) is identified as a key aspect of incident management that requires specialized knowledge and experience from on-call engineers.
  • Manual RCA is time-consuming and burdensome for engineers, prompting Roy et al. to propose automation as a solution to streamline the process.
  • The ReAct agent introduced by Roy et al. utilizes retrieval tools to enhance the diagnostic capabilities of Large Language Models (LLMs) for RCA.
  • Empirical evaluation using real-world production incidents from Microsoft shows that ReAct performs competitively with strong retrieval and reasoning baselines while improving factual accuracy significantly.
  • Discussions associated with incident reports do not yield substantial performance enhancements according to semantic metrics.
  • A case study conducted with a team at Microsoft demonstrates how integrating external diagnostic services can overcome limitations observed in prior studies and provide practical insights for implementing such systems in operational settings.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan

License: CC BY-SA 4.0

Abstract: The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience with a team's specific services. Automation of RCA can result in significant savings of time, and ease the burden of incident management on on-call engineers. Recently, researchers have utilized Large Language Models (LLMs) to perform RCA, and have demonstrated promising results. However, these approaches are not able to dynamically collect additional diagnostic information such as incident related logs, metrics or databases, severely restricting their ability to diagnose root causes. In this work, we explore the use of LLM based agents for RCA to address this limitation. We present a thorough empirical evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft. Results show that ReAct performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work, and practical considerations for implementing such a system in practice.

Submitted to arXiv on 07 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.04123v1

At Conference’17 in July 2017 in Washington, DC, USA, Roy et al. presented their study on incident management within cloud-based software systems. They highlight the increasing complexity of these systems and the crucial role of incident management in the software development lifecycle. The authors identify root cause analysis (RCA) as a key aspect of incident management that requires specialized knowledge and experience from on-call engineers. However, manual RCA is time-consuming and can burden engineers. To address this issue, Roy et al. propose automation as a solution to streamline the process and alleviate the burden on engineers. Building upon previous research that utilized Large Language Models (LLMs) for RCA, Roy et al. introduce their ReAct agent equipped with retrieval tools to enhance the diagnostic capabilities of LLMs. Through an empirical evaluation using real-world production incidents from Microsoft, they demonstrate that ReAct performs competitively with strong retrieval and reasoning baselines while significantly improving factual accuracy. Additionally, the incorporation of discussions associated with incident reports does not yield substantial performance enhancements according to semantic metrics. Furthermore, Roy et al. conduct a case study with a team at Microsoft to equip the ReAct agent with access to external diagnostic services used for manual RCA. This integration showcases how agents can overcome limitations observed in prior studies and provides practical insights for implementing such systems in operational settings. Through detailed evaluations and qualitative analyses, Roy et al. 's work contributes valuable insights into enhancing RCA processes through automated agents within cloud-based software systems.
Created on 05 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.