Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

AI-generated keywords: Cyber Defense Benchmark Large Language Model Agents Security Operations Center Threat Hunting Performance Evaluation

AI-generated Key Points

  • The Cyber Defense Benchmark introduces a new method for evaluating the performance of large language model (LLM) agents in cybersecurity operations (SecOps).
  • The benchmark focuses on assessing how well LLM agents can handle threat hunting within a security operations center (SOC) environment.
  • It challenges LLM agents with raw Windows event logs and incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics.
  • Evaluation of five frontier models reveals significant shortcomings in their performance, with even the best-performing model only correctly flagging approximately 3.8% of malicious events on average.
  • A passing score for unsupervised SOC deployment requires a minimum recall rate of >=50% on every ATT&CK tactic, but none of the models meet this criterion.
  • Current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite strong performance in curated Q&A security benchmarks.
  • The Cyber Defense Benchmark highlights limitations and challenges faced by LLM agents in real-world cybersecurity scenarios, emphasizing the need for further advancements to improve their performance.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Alankrit Chona, Igor Kozlov, Ambuj Kumar

13 pages, 3 figures, 5 tables. Complete benchmark and hunt traces available on request
License: CC BY-NC-SA 4.0

Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.

Submitted to arXiv on 21 Apr. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2604.19533v1

The Cyber Defense Benchmark introduces a new method for evaluating the performance of large language model (LLM) agents in the field of cybersecurity operations (SecOps). This benchmark focuses on assessing how well LLM agents can handle the core task of threat hunting within a security operations center (SOC) environment. The benchmark challenges LLM agents by presenting them with a database of raw Windows event logs without any guided questions or hints. It incorporates 106 real attack procedures from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. Each episode presents the agent with an in-memory SQLite database containing obfuscated and time-shifted log records from a deterministic campaign simulator. The agent must use SQL queries to uncover malicious event timestamps and flag them according to Sigma-rule-derived ground truth. Evaluation of five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - reveals significant shortcomings in their performance on 26 campaigns covering 105 out of 106 procedures. Even the best-performing model (Claude Opus 4.6) only correctly flags approximately 3.8% of malicious events on average, with no model successfully identifying all flags across any run. To establish a passing score for unsupervised SOC deployment, a minimum recall rate of >=50% on every ATT&CK tactic is set as the bar; however, none of the models meet this criterion. While the leader surpasses this threshold for five out of thirteen tactics, it fails to do so for the remaining eight tactics. These findings highlight that current LLMs are not equipped for open-ended threat hunting tasks based on evidence-driven approaches despite their strong performance in curated Q&A security benchmarks. The Cyber Defense Benchmark provides valuable insights into the limitations and challenges faced by LLM agents in real-world cybersecurity scenarios. It underscores the need for further advancements in this area to improve their performance.
Created on 26 May. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.