Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

AI-generated keywords: Cyber Defense Benchmark Large Language Model Agents Security Operations Center Threat Hunting Performance Evaluation

AI-generated Key Points

The Cyber Defense Benchmark introduces a new method for evaluating the performance of large language model (LLM) agents in cybersecurity operations (SecOps).
The benchmark focuses on assessing how well LLM agents can handle threat hunting within a security operations center (SOC) environment.
It challenges LLM agents with raw Windows event logs and incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics.
Evaluation of five frontier models reveals significant shortcomings in their performance, with even the best-performing model only correctly flagging approximately 3.8% of malicious events on average.
A passing score for unsupervised SOC deployment requires a minimum recall rate of >=50% on every ATT&CK tactic, but none of the models meet this criterion.
Current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite strong performance in curated Q&A security benchmarks.
The Cyber Defense Benchmark highlights limitations and challenges faced by LLM agents in real-world cybersecurity scenarios, emphasizing the need for further advancements to improve their performance.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Alankrit Chona, Igor Kozlov, Ambuj Kumar

arXiv: 2604.19533v1 - DOI (cs.CR)

13 pages, 3 figures, 5 tables. Complete benchmark and hunt traces available on request

License: CC BY-NC-SA 4.0

Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.

Submitted to arXiv on 21 Apr. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2604.19533v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The Cyber Defense Benchmark introduces a new method for evaluating the performance of large language model (LLM) agents in the field of cybersecurity operations (SecOps). This benchmark focuses on assessing how well LLM agents can handle the core task of threat hunting within a security operations center (SOC) environment. The benchmark challenges LLM agents by presenting them with a database of raw Windows event logs without any guided questions or hints. It incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. Each episode presents the agent with an in-memory SQLite database containing obfuscated and time-shifted log records from a deterministic campaign simulator. The agent must use SQL queries to uncover malicious event timestamps and flag them according to Sigma-rule-derived ground truth. Evaluation of five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - reveals significant shortcomings in their performance on 26 campaigns covering 105 out of 106 procedures. Even the best-performing model (Claude Opus 4.6) only correctly flags approximately 3.8% of malicious events on average, with no model successfully identifying all flags across any run. To establish a passing score for unsupervised SOC deployment, a minimum recall rate of >=50% on every ATT&CK tactic is set as the bar; however, none of the models meet this criterion. While the leader surpasses this threshold for five out of thirteen tactics, it fails to do so for the remaining eight tactics. These findings highlight that current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite their strong performance in curated Q&A security benchmarks. The Cyber Defense Benchmark provides valuable insights into the limitations and challenges faced by LLM agents in real-world cybersecurity scenarios. It underscores the need for further advancements in this area to improve their performance.

- The Cyber Defense Benchmark introduces a new method for evaluating the performance of large language model (LLM) agents in cybersecurity operations (SecOps).
- The benchmark focuses on assessing how well LLM agents can handle threat hunting within a security operations center (SOC) environment.
- It challenges LLM agents with raw Windows event logs and incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics.
- Evaluation of five frontier models reveals significant shortcomings in their performance, with even the best-performing model only correctly flagging approximately 3.8% of malicious events on average.
- A passing score for unsupervised SOC deployment requires a minimum recall rate of >=50% on every ATT&CK tactic, but none of the models meet this criterion.
- Current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite strong performance in curated Q&A security benchmarks.
- The Cyber Defense Benchmark highlights limitations and challenges faced by LLM agents in real-world cybersecurity scenarios, emphasizing the need for further advancements to improve their performance.

SummaryThe Cyber Defense Benchmark is a new way to test how well big computer programs can protect against cyber attacks. It gives them hard tasks to see if they can find and stop bad things happening on a computer network. The tests use real examples of attacks and show that the programs still struggle to catch most of them. To pass the test, these programs need to be much better at finding and stopping bad things from happening on computers. Definitions- Cyber Defense Benchmark: A method for testing how well computer programs can protect against cyber attacks. - Large language model (LLM) agents: Big computer programs that can understand and process human language. - Cybersecurity operations (SecOps): Activities related to protecting computer systems and networks from cyber threats. - Threat hunting: Looking for signs of potential cyber threats or attacks within a computer network. - Security operations center (SOC): A facility where cybersecurity professionals monitor, detect, analyze, and respond to security incidents on a continuous basis. - MITRE ATT&CK: A framework used in cybersecurity to categorize common tactics and techniques used by attackers. - Recall rate: The proportion of relevant items that are retrieved by a search or detection system.

The Cyber Defense Benchmark: Evaluating the Performance of Large Language Model Agents in Cybersecurity Operations

In today's digital landscape, cybersecurity is a critical concern for organizations of all sizes. With the rise of sophisticated cyber threats and attacks, security operations centers (SOCs) are constantly looking for ways to improve their threat detection and response capabilities. One emerging technology that has gained significant attention in recent years is large language models (LLMs). These advanced AI systems have shown impressive performance in various tasks such as natural language processing and question-answering. However, their effectiveness in real-world cybersecurity scenarios remains largely unexplored. To address this gap, a team of researchers from OpenAI and MITRE Corporation recently published a research paper titled "The Cyber Defense Benchmark." This benchmark introduces a new method for evaluating the performance of LLM agents in SOC environments specifically for threat hunting – one of the core tasks within SecOps. The study aims to provide insights into how well LLM agents can handle open-ended threat hunting tasks based on evidence-driven approaches.

Understanding The Cyber Defense Benchmark

The Cyber Defense Benchmark presents LLM agents with a database of raw Windows event logs without any guided questions or hints. It incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. Each episode presents the agent with an in-memory SQLite database containing obfuscated and time-shifted log records from a deterministic campaign simulator. The goal for the LLM agent is to use SQL queries to uncover malicious event timestamps and flag them according to Sigma-rule-derived ground truth. To evaluate their performance, five frontier models were tested – Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.

Key Findings of the Benchmark

The results of the benchmark reveal significant shortcomings in the performance of LLM agents on open-ended threat hunting tasks. Out of 26 campaigns covering 105 out of 106 procedures, none of the models were able to successfully identify all flags across any run. Even the best-performing model (Claude Opus 4.6) only correctly flagged approximately 3.8% of malicious events on average. To establish a passing score for unsupervised SOC deployment, a minimum recall rate of >=50% on every ATT&CK tactic was set as the bar. However, none of the models met this criterion. While Claude Opus 4.6 surpassed this threshold for five out of thirteen tactics, it failed to do so for the remaining eight tactics.

Implications and Future Directions

The Cyber Defense Benchmark provides valuable insights into the limitations and challenges faced by LLM agents in real-world cybersecurity scenarios. It highlights that current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite their strong performance in curated Q&A security benchmarks. These findings have significant implications for organizations looking to deploy LLM agents in their SOC environments for threat detection and response purposes. It underscores the need for further advancements in this area to improve their performance and effectiveness. Future research could focus on developing new techniques or training methods specifically tailored towards improving LLM agent's performance in open-ended threat hunting tasks within SOC environments. Additionally, there is also a need to explore how these advanced AI systems can be integrated with existing security tools and processes to enhance overall cybersecurity operations.

In Conclusion

The Cyber Defense Benchmark is an important step towards understanding and evaluating LLM agents' capabilities in real-world cybersecurity scenarios specifically focused on threat hunting within SOCs. The study sheds light on key limitations and challenges faced by these advanced AI systems, highlighting the need for further advancements in this area. As cyber threats continue to evolve and become more sophisticated, it is crucial to continually evaluate and improve LLM agents' performance to enhance overall cybersecurity operations.

Created on 26 May. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

58.3%

Hacking CTFs with Plain Agents

cs.CR

55.0%

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Ba…

cs.CR

54.7%

From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-In…

cs.CR

53.2%

ATLANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence Sy…

cs.CR

52.6%

Large Language Model (LLM) for Software Security: Code Analysis, Malware Anal…

cs.CR

51.8%

Current state of LLM Risks and AI Guardrails

cs.CR

51.4%

AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathwa…

cs.CR

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.