AI-generated security reports fail where it matters most

Craig Nash
By
Craig Nash
Tech writer at All Things Geek. Covers artificial intelligence, semiconductors, and computing hardware.
9 Min Read
AI-generated security reports fail where it matters most

AI-generated security reports sound like a dream for overworked incident response teams. Cisco tested this dream by feeding raw incident notes to ChatGPT, Claude, and Gemini, asking each to produce formal technical reports. The results looked polished. They were also dangerously unreliable.

Key Takeaways

  • AI-generated security reports contain significant inaccuracies despite appearing polished and professional.
  • Four distinct failure modes plague AI incident reporting: inconsistent research, contradictory conclusions, formatting drift, and context pollution.
  • Recommendations from the same AI model can contradict each other on identical data, ranging from targeted to organization-wide remediation.
  • Human review and accountability remain mandatory—AI cannot replace human expertise in security documentation.
  • The gap between AI’s presentation quality and actual reliability makes it unsuitable for formal incident reports without extensive human oversight.

This is the core problem with relying on AI for security documentation: the polished exterior masks fundamental unreliability. Cisco’s findings expose why AI is not yet ready for the kind of formal, repeatable, high-stakes reporting that security teams depend on.

Why AI-Generated Security Reports Keep Failing

The failures Cisco identified fall into four distinct categories, each rooted in how large language models actually work. Unlike traditional software that follows deterministic rules, LLMs generate output by predicting the next token based on model weights and training data. This probability-driven approach creates inconsistency at scale.

Inconsistent research and sourcing is the first failure mode. When tasked with generating a report from raw notes, different runs of the same model pull from different underlying data sources. One execution might reference one set of findings; the next run, identical inputs produce citations from entirely different sources. For security teams that need repeatable, standardized research outcomes, this is disqualifying. How do you validate a remediation strategy if the threat intelligence behind it changes with each report generation?

Inconsistent conclusions compound the problem. Give an AI model the same incident data twice, and it may suggest a targeted password reset in one report and a full organization-wide reset in another. Cisco warns that models frequently default to whichever recommendation they generate first, which may be poor advice. In incident response, contradictory remediation guidance is worse than no guidance at all—it introduces confusion and delays critical decisions.

The Four Failure Modes in AI Incident Reporting

Inconsistent output format represents a third critical weakness. Because LLMs generate text token by token, the structure and formatting of the final report can vary between runs. Executive summaries might appear in different formats, recommendation sections might shift in length or organization, and standardized layouts dissolve. For security teams that rely on consistent report structures for quality control and compliance documentation, this variability is a significant operational liability.

Context drift and pollution seal the case against current AI-generated reports. LLMs operate within a fixed context window—a limit on how much information they can retain during a single session. As longer incident notes are processed, older information gets discarded, potentially losing critical initial instructions or threat context. Worse, when multiple unrelated tasks run in a single session, the model’s context becomes polluted, leading to unpredictable blending of results. A report on one incident might inadvertently incorporate details from another.

Cisco also found that AI-generated recommendations were frequently duplicative, irrelevant, or entirely unactionable. The model might suggest the same mitigation multiple times, recommend steps that don’t apply to the incident, or propose actions that security teams cannot realistically execute.

What This Means for Security Teams Considering AI Reporting

The temptation to automate incident reporting is real. Security teams are drowning in alerts and incidents. If AI could handle the documentation burden, that would free analysts for higher-value work. But Cisco’s findings make clear that this vision is premature. The efficiency gains that AI promises are negated by the inconsistencies it introduces.

Cisco’s own guidance is unambiguous: human report authors must edit, understand, and take ownership of every word of the final report. This means the human workload doesn’t actually decrease—it shifts. Instead of writing the report from scratch, analysts must now validate, correct, and rewrite AI output. In many cases, that is more work than writing the report originally.

The comparison with human-authored reports is stark. A skilled incident analyst produces consistent research, sound conclusions, standardized formatting, and coherent context throughout. These are not luxuries; they are foundational requirements for security documentation that will be reviewed by executives, regulators, and potentially lawyers. AI cannot yet meet these standards reliably.

Can Any AI Model Handle Security Reporting Better Than Others?

Cisco tested multiple leading models—ChatGPT, Claude, and Gemini—and found that all three exhibited the same fundamental failure modes. No single model proved significantly more reliable at generating formal incident reports. This suggests the problem is not specific to any one architecture but rather inherent to how current large language models approach long-form generation under constrained, high-stakes conditions.

The issue is not a lack of intelligence or training data. It is the probabilistic nature of token generation itself. Until LLMs move beyond predicting the next token and toward truly deterministic, verifiable reasoning, these inconsistencies will persist.

What Would It Take to Make AI-Generated Security Reports Viable?

For AI-generated reports to replace human authorship, the technology would need to guarantee consistency across multiple dimensions: identical research outcomes, deterministic conclusions, stable formatting, and reliable context management. Current models fail on all four counts. Future improvements might include hybrid approaches—where AI drafts sections and humans validate each component—or architectural changes that enforce consistency and verifiability. But today’s models are not there yet.

The broader lesson extends beyond security reporting. Any high-stakes documentation that requires consistency, accuracy, and accountability—compliance reports, legal briefs, medical records, financial audits—faces the same fundamental challenge with AI. Polished output is not the same as reliable output.

Why does AI-generated security reports struggle with consistency?

Large language models generate text by predicting the next token based on probability distributions learned during training. This approach naturally produces variation between runs, even with identical inputs. Security reporting demands consistency—same data should yield the same conclusions and recommendations. AI cannot guarantee this because it is fundamentally probabilistic, not deterministic.

Can humans fix AI-generated security reports by editing them?

Yes, but it defeats the purpose of automation. Cisco found that human authors must review, understand, and take ownership of every word in the final report. This means analysts must validate the research, check the conclusions, verify the recommendations, and ensure the formatting meets standards. In most cases, this editing burden is as heavy as writing the report from scratch.

Are there any AI models that perform better at security incident reporting?

Cisco tested ChatGPT, Claude, and Gemini and found all three exhibited the same failure modes: inconsistent research, contradictory conclusions, formatting drift, and context pollution. No single model demonstrated significantly better performance at generating formal incident reports. The problem appears to be fundamental to how current large language models work, not specific to any one vendor.

The takeaway is clear: AI-generated security reports are not yet ready for production use without extensive human oversight. Cisco’s public findings serve as a cautionary tale for any security team tempted by the promise of automated incident documentation. The polished presentation masks serious reliability gaps. Until AI systems can guarantee consistency and accuracy, human analysts remain indispensable for security reporting.

Edited by the All Things Geek team.

Source: TechRadar

Share This Article
Tech writer at All Things Geek. Covers artificial intelligence, semiconductors, and computing hardware.