The untapped potential of AI for security operations
In August 2025, Anthropic announced it had caught a threat actor using Claude in a series of cyber attacks. The large language model (LLM) helped the attacker in their reconnaissance and network compromise, but also in more strategic tasks: choosing which data to exfiltrate, crafting persuasive ransom notes, and even advising on how high to set the ransom demand.
Beyond identifying and preventing malicious usage, Anthropic is set on ensuring that defenders also benefit from its technologies. “We should not cede the cyber advantage derived from AI to attackers and criminals,” it said in a research blog post. “We think the most scalable solution is to build AI systems that empower those safeguarding our digital environments.”
It shared some promising results demonstrating Claude’s ability to identify software vulnerabilities in benchmark tests, but to me, that’s unsurprising. Finding and patching vulnerabilities involves two tasks we know LLMs are good at: identifying patterns and writing code. I work in detection and response, so I’m far more interested in how well AI can detect malicious activity in large log datasets and support analysts in their investigations.

Technical limitations and vendor tradeoffs
As in most industries, AI hype in security operations is at fever pitch. If you believed the marketers, you’d think most security operations centres (SOCs) are in a post-analyst era, delegating detection and response to hordes of autonomous agents that investigate and remediate incidents in a flash.
But most AI implementations I’ve seen to date have been text-based: writing log queries from plain language, summarising detections, and writing tickets to send to users and customers. My problems, especially with the latter, are:
-
LLMs only include detail provided as input. It’s likely that one day they’ll be able to reliably analyse data and provide enriched output, but today that mostly returns vague results and hallucinations. For that reason, automated detection summaries are often constrained to the limited context in the detection itself. To generate a full write-up, an analyst must take copious notes on their actions and thought process and provide these in their prompt, then vet the output - by which point they may as well have written the full ticket themselves.
-
LLMs produce long, convoluted output. I’ll point again to Paul Graham’s remark that AI writes like a schoolkid targeting an essay word count. Even with extensive prompting, it’s difficult to convince models to produce write-ups concise enough for security operations, where seconds count and the recipients of escalations are not always technical. Straightforward, easy-to-parse messaging is critical, and most output from current LLMs is the exact opposite.
You could also debate the value of using LLMs to write queries, tickets, and summaries at all, because these are relatively simple tasks. A good analyst can quickly write a log query (if they need to at all, since most investigations revolve around a few common searches), and if they’ve been keeping notes during their investigation then a ticket won’t take very long, either.
Moving beyond detection summaries
My hope is that AI will one day break free of the confines of summaries and suggested verdicts. I can foresee a future where it can reliably support SOC teams in a much more difficult task: detection and analysis across the large log data sets stored in security information and event management (SIEM) and extended detection and response (XDR) platforms.
Detection metadata and indicators of compromise are enough for a finger-in-the-air suggested verdict, but an analyst would come to a similar conclusion after a few seconds skimming the detection page. Sifting through the logs to uncover the full extent of malicious activity and identify root cause can take hours in complex cases, and holds far greater potential for time savings.
An AI model that could quickly check surrounding logs to add context and identify the full scope of an incident would be invaluable in cutting time to contain and remediate. Notably, Anthropic’s report also showed some promise in this area when Claude was set against Cybench, a benchmark derived from capture the flag (CTF) challenges representative of deeper security operations work like network traffic and malware analysis.
My hope is that AI will move beyond summaries and suggested verdicts, and support SOCs with detection and analysis across large log data sets
Given ten attempts, Sonnet 4.5 had a 76.5 percent success rate against these challenges - a substantial increase on the 35.9 percent scored by Sonnet 3.7, which was released only a few months ago in February 2025.
This is encouraging, but the limitations of such a study should also be noted. CTFs provide participants with only a small portion of the logs that would be present in a real environment, and Anthropic itself admitted that “real world defensive security is more complicated in practice than our evaluations can capture”, with more complex problems and implementation challenges.
The AI cost conundrum
One also has to wonder about the cost of AI detection and investigation at scale. The reason most current AI features focus on detection metadata is likely to reduce the number of tokens consumed in each request. Running similar searches across days of corporate log data is on another scale entirely, and in a detection scenario this would need to be performed routinely.
This is representative of another problem across the AI sector - and the one that fuels accusations of a bubble. While LLMs are undoubtedly useful, that usage is heavily subsidised by vendors burning through cash, and it remains to be seen whether their widespread use will be financially viable for most organisations when prices reflect the cost of development and operation.
Transparency is another industry-wide issue that crops up in security operations. The old adage in cyber security is that tools are useful, but analysts must always understand how they reach their conclusions and be able to replicate their findings if necessary for validation. Trust, but verify.
Some vendors are better than others, but in early AI implementations it’s not unusual to encounter detections or verdicts that seem to exist “because the model said so”. If the tool doesn’t provide any workings, the investigating analyst must explore all possible avenues to understand the reason for the detection before performing any real analysis. Therefore to truly save SOCs time, AI-enabled tools must detail the why as well as the what.
The pace of change
I’m not a complete AI skeptic - the progress from Sonnet 3.7 to 4.5 in just a few months shows genuine momentum and potential for wider applications in security operations. But current implementations don’t match the level of buzz, and significant hurdles remain before AI delivers on vendor promises.
The vision of an autonomous SOC is compelling, but it’s still countless iterations away. In the meantime, security teams should approach AI with the same rigour applied to all tools: understand how it works, verify its conclusions, and recognise its limitations. AI will transform security operations, but it will be a gradual process, not an overnight revolution.