The Cyber Defense Benchmark introduces a new method for evaluating the performance of large language model (LLM) agents in the field of cybersecurity operations (SecOps). This benchmark focuses on assessing how well LLM agents can handle the core task of threat hunting within a security operations center (SOC) environment. The benchmark challenges LLM agents by presenting them with a database of raw Windows event logs without any guided questions or hints. It incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. Each episode presents the agent with an in-memory SQLite database containing obfuscated and time-shifted log records from a deterministic campaign simulator. The agent must use SQL queries to uncover malicious event timestamps and flag them according to Sigma-rule-derived ground truth. Evaluation of five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - reveals significant shortcomings in their performance on 26 campaigns covering 105 out of 106 procedures. Even the best-performing model (Claude Opus 4.6) only correctly flags approximately 3.8% of malicious events on average, with no model successfully identifying all flags across any run. To establish a passing score for unsupervised SOC deployment, a minimum recall rate of >=50% on every ATT&CK tactic is set as the bar; however, none of the models meet this criterion. While the leader surpasses this threshold for five out of thirteen tactics, it fails to do so for the remaining eight tactics. These findings highlight that current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite their strong performance in curated Q&A security benchmarks. The Cyber Defense Benchmark provides valuable insights into the limitations and challenges faced by LLM agents in real-world cybersecurity scenarios. It underscores the need for further advancements in this area to improve their performance.
- - The Cyber Defense Benchmark introduces a new method for evaluating the performance of large language model (LLM) agents in cybersecurity operations (SecOps).
- - The benchmark focuses on assessing how well LLM agents can handle threat hunting within a security operations center (SOC) environment.
- - It challenges LLM agents with raw Windows event logs and incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics.
- - Evaluation of five frontier models reveals significant shortcomings in their performance, with even the best-performing model only correctly flagging approximately 3.8% of malicious events on average.
- - A passing score for unsupervised SOC deployment requires a minimum recall rate of >=50% on every ATT&CK tactic, but none of the models meet this criterion.
- - Current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite strong performance in curated Q&A security benchmarks.
- - The Cyber Defense Benchmark highlights limitations and challenges faced by LLM agents in real-world cybersecurity scenarios, emphasizing the need for further advancements to improve their performance.
SummaryThe Cyber Defense Benchmark is a new way to test how well big computer programs can protect against cyber attacks. It gives them hard tasks to see if they can find and stop bad things happening on a computer network. The tests use real examples of attacks and show that the programs still struggle to catch most of them. To pass the test, these programs need to be much better at finding and stopping bad things from happening on computers.
Definitions- Cyber Defense Benchmark: A method for testing how well computer programs can protect against cyber attacks.
- Large language model (LLM) agents: Big computer programs that can understand and process human language.
- Cybersecurity operations (SecOps): Activities related to protecting computer systems and networks from cyber threats.
- Threat hunting: Looking for signs of potential cyber threats or attacks within a computer network.
- Security operations center (SOC): A facility where cybersecurity professionals monitor, detect, analyze, and respond to security incidents on a continuous basis.
- MITRE ATT&CK: A framework used in cybersecurity to categorize common tactics and techniques used by attackers.
- Recall rate: The proportion of relevant items that are retrieved by a search or detection system.
The Cyber Defense Benchmark: Evaluating the Performance of Large Language Model Agents in Cybersecurity Operations
In today's digital landscape, cybersecurity is a critical concern for organizations of all sizes. With the rise of sophisticated cyber threats and attacks, security operations centers (SOCs) are constantly looking for ways to improve their threat detection and response capabilities. One emerging technology that has gained significant attention in recent years is large language models (LLMs). These advanced AI systems have shown impressive performance in various tasks such as natural language processing and question-answering. However, their effectiveness in real-world cybersecurity scenarios remains largely unexplored.
To address this gap, a team of researchers from OpenAI and MITRE Corporation recently published a research paper titled "The Cyber Defense Benchmark." This benchmark introduces a new method for evaluating the performance of LLM agents in SOC environments specifically for threat hunting – one of the core tasks within SecOps. The study aims to provide insights into how well LLM agents can handle open-ended threat hunting tasks based on evidence-driven approaches.
Understanding The Cyber Defense Benchmark
The Cyber Defense Benchmark presents LLM agents with a database of raw Windows event logs without any guided questions or hints. It incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. Each episode presents the agent with an in-memory SQLite database containing obfuscated and time-shifted log records from a deterministic campaign simulator.
The goal for the LLM agent is to use SQL queries to uncover malicious event timestamps and flag them according to Sigma-rule-derived ground truth. To evaluate their performance, five frontier models were tested – Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.
Key Findings of the Benchmark
The results of the benchmark reveal significant shortcomings in the performance of LLM agents on open-ended threat hunting tasks. Out of 26 campaigns covering 105 out of 106 procedures, none of the models were able to successfully identify all flags across any run. Even the best-performing model (Claude Opus 4.6) only correctly flagged approximately 3.8% of malicious events on average.
To establish a passing score for unsupervised SOC deployment, a minimum recall rate of >=50% on every ATT&CK tactic was set as the bar. However, none of the models met this criterion. While Claude Opus 4.6 surpassed this threshold for five out of thirteen tactics, it failed to do so for the remaining eight tactics.
Implications and Future Directions
The Cyber Defense Benchmark provides valuable insights into the limitations and challenges faced by LLM agents in real-world cybersecurity scenarios. It highlights that current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite their strong performance in curated Q&A security benchmarks.
These findings have significant implications for organizations looking to deploy LLM agents in their SOC environments for threat detection and response purposes. It underscores the need for further advancements in this area to improve their performance and effectiveness.
Future research could focus on developing new techniques or training methods specifically tailored towards improving LLM agent's performance in open-ended threat hunting tasks within SOC environments. Additionally, there is also a need to explore how these advanced AI systems can be integrated with existing security tools and processes to enhance overall cybersecurity operations.
In Conclusion
The Cyber Defense Benchmark is an important step towards understanding and evaluating LLM agents' capabilities in real-world cybersecurity scenarios specifically focused on threat hunting within SOCs. The study sheds light on key limitations and challenges faced by these advanced AI systems, highlighting the need for further advancements in this area. As cyber threats continue to evolve and become more sophisticated, it is crucial to continually evaluate and improve LLM agents' performance to enhance overall cybersecurity operations.