Grounded AI Agents | 2026 Sonatype Software Supply Chain Report

As organizations increasingly delegate critical security decisions to AI systems, we face a fundamental challenge: even state-of-the-art language models lack access to real-time vulnerability databases, supply chain intelligence, and breaking change data. As a result, AI agents are confidently recommending nonexistent versions, introducing known vulnerabilities, and even suggesting malware-infected packages. The model is doing all of this while appearing authoritative. AI is doing all of this while appearing authoritative.

Traditional upgrade strategies expose similar blind spots. "Latest version" heuristics, which software developers simply upgrade open source components whenever a newer version is available, assume newer means better, assume newer means better, ignoring CVE disclosures, stability signals, and the cascade of breaking changes that transform simple updates into multi-week migrations. Meanwhile, ungrounded AI recommendations, regardless of the sophistication of the underlying model, operate on theoretical patterns rather than live security intelligence. Both approaches share a critical flaw: they make decisions without the data that actually matters and without the guardrails to guarantee code is compliant.

In this research, Sonatype demonstrates a different path: AI that is grounded in live intelligence, validated against real registries, and guided by breaking-change analytics governed by policy. When AI operates with this foundation, its capabilities shift from theoretical suggestion engines to trusted, production-grade decision systems.

This confidence pattern was observed in a sample of real-world enterprise applications. While production AI systems might decline to answer when uncertain, the core issue remains: package ecosystems evolve constantly. New versions ship hourly. Security vulnerabilities are constantly emerging. No training dataset, however comprehensive, can predict tomorrow's CVE or next week's breaking change.

Sonatype's approach doesn't compete with agentic AI — it completes it. By grounding recommendations in live package registries, proprietary vulnerability and malware data, and breaking change calculations, we achieved zero AI hallucinations across the same 36,870 components. Every recommendation is verified against real repositories. Every upgrade is assessed for actual security impact.

The future isn't choosing between AI and traditional tools. It's AI agents operating with real-time intelligence that teams can trust in production.

FIGURE 4.1

Hallucination Rates by Confidence Level

Confidence	Hallucinated	Valid	Total	Hallucination Rate	Share of Hallucinations
HIGH	23	1,336	1,359	1.69%	0.22%
MEDIUM	4,504	18,959	23,463	19.20%	44.01%
LOW	5,708	6,340	12,048	47.38%	55.77%

Hallucinated

Confidence
HIGH	23
MEDIUM	4,504
LOW	5,708

Valid

Confidence
HIGH	1,336
MEDIUM	18,959
LOW	6,340

Total

Confidence
HIGH	1,359
MEDIUM	23,463
LOW	12,048

Hallucination Rate

Confidence
HIGH	1.69%
MEDIUM	19.20%
LOW	47.38%

Share of Hallucinations

Confidence
HIGH	0.22%
MEDIUM	44.01%
LOW	55.77%

LLM-generated versions (LLM)

Lowest improvement of the strategies analyzed; 345 components became less secure

Most Recently Published Version (Latest)

Results in strong security outcomes but with extreme engineering costs at nearly $30,000 per application

Sonatype ‘No Breaking Changes’ (NBC)

Chooses highest safe version without breakage; high security gains with minimal refactoring

Sonatype Best (Best)

Chooses highest version score regardless of breaking changes; highest security improvement overall

Figure 4.2 outlines the mean security score improvement for each application by strategy. Percent improvement is calculated as (total target security - total baseline security) / total baseline security × 100, averaged across vulnerable components from 856 enterprise applications. Security scores aggregate the severity and count of known vulnerability types on a 0–100 scale. For example, an application with 450 baseline points improving to 614 target points represents +36.4% security gain.

Figure 4.2: Mean Security Score Improvement Per Application by Strategy

Source: Sonatype

Overall, it is generally a good idea to remediate vulnerabilities. All upgrade strategies improve security outcomes, but not equally. The LLM-generated (LLM) upgrade recommendations show the smallest uplift, recommending generally newer versions without proper guidance. Sonatype's No Breaking Changes (NBC) sees a significant improvement while identifying versions that minimize or eliminate breaking changes.

Then we have the Latest version strategy, with a significant improvement in security, but with a high engineering cost, as we will see later. The overall best improvement comes from the Sonatype Best (Best) strategy, which more holistically considers the security of the components (severity in combination) when identifying the best upgrade path.

Ungrounded AI Agents: Exploring Malware and Protestware Recommendations

The LLM strategy did more than hallucinate versions. It recommended sweetalert2 11.21.2, which is confirmed protestware executing political payloads, as well as color 5.0.1 and color-string 2.1.1, which were compromised in a major supply chain attack. These packages were not obscure edge cases. They were widely downloaded and part of a high-profile security event that occurred after the model’s training data cutoff.

Breaking Change Cost Analysis

Security improvements come at a price measured in developer hours and refactoring effort. Across 856 enterprise applications with representative dependency footprints, upgrade strategies impose dramatically different implementation costs.

Figure 4.3 below compares median per‑application upgrade budgets across the four strategies. NBC delivers the lowest-friction path: roughly ~1 engineer-week to modernize an entire app while avoiding destabilizing work. Best still holds the costs under $20K and under 200 hours per app, yet it absorbs the additional change needed to drive higher security scores.

Figure 4.3: Upgrade Cost & Effort Per Application

Source: Sonatype

Both outclass the unmanaged options: unconstrained Latest upgrades result in nearly 5x the median spend versus NBC, and LLM-only selections land in the same cost bracket as Best without the significant risk reduction.

Applying a generic ~8% copilot uplift to the same per-app upgrade totals, NBC still modernizes an app for a little over $5K and ~53 hours, while chasing Latest upgrades soaks up nearly $27K and 288 hours — over five times the spend and the engineering time.

That gap isn’t just a bookkeeping line; it’s opportunity cost. Every extra week poured into unmanaged upgrades is a week not spent on security hardening, paying down tech debt, or feature delivery. LLM-only picks land in the same budget band as Best yet lack its curated risk reduction, reinforcing that disciplined Sonatype strategies are the only way to keep upgrade budgets predictable without cannibalizing roadmap work.

Figure 4.4: At Enterprise Scale: Upgrade Cost & Effort

Source: Sonatype

In practice, organizations don’t upgrade every dependency in every application all at once. Instead, they perform ongoing dependency maintenance — small, continuous updates that, across hundreds or thousands of applications, represent a near-constant workload. Without a cohesive strategy, these distributed efforts can quietly accumulate into multimillion-dollar annual costs.

Sonatype’s NBC automation keeps portfolio-level upgrade effort roughly an order of magnitude lower than unmanaged Latest adoption, while achieving a similar security posture. Teams targeting the most secure baseline can adopt Best selectively, reserving deeper migrations for critical systems where maximum vulnerability reduction warrants the additional investment.

Our analysis of 36,870 dependency upgrade recommendations exposes a critical divergence between the promise of autonomous AI agents and the reality of software supply chain security. The data suggests that without access to real-time package registry intelligence, both state-of-the-art LLMs and traditional Latest heuristics fail to balance security risk with engineering effort.

More critically, the "AI hallucinations" were not only harmless version number errors, but also data corruption, protestware, and hijacked packages. This illustrates a fundamental limitation: training data cuts off, but supply chain attacks operate in real-time. A model trained before a package compromise cannot "know" a version is unsafe without a live feed of vulnerability intelligence.

Furthermore, the LLM strategy delivered the lowest security improvement (+120.4%) of all methods tested. In 345 specific instances, following the AI’s advice actually degraded the component's security posture by introducing more vulnerabilities than it resolved.

sweetalert2 version 11.21.2

Data Corruption & Protestware

This package creates a ‘NoWarForRussians’ banner on any Russian website using this component that is running in a browser using Russian.

color version 5.0.1 & color-string version 2.1.1

Cryptostealer & Hijack

Taken over as part of the chalk/debug campaign, color and color-string were manipulated to extract victims’ cryptocurrency from browser wallets.

security gain from remediating vulnerable components

lower dependency upgrade cost and effort compared to Latest Version

Grounding AI Agents is the Missing Link

The high accuracy (98%) of GPT-5 in the rare instances (3.68%) where it expressed "High Confidence" suggests that the reasoning capabilities of modern models are sound, but their context is insufficient.

The path forward is not to choose between AI and traditional tools, but to ground autonomous AI agents in verified intelligence. By feeding the model real-time data — including computed breaking changes and enhanced vulnerability and malware intelligence, Sonatype’s approach eliminates AI hallucinations entirely while empowering teams to choose the upgrade path (Best vs. No BC) that aligns with their risk tolerance and budget.

Grounding AI Agents in Real-World Intelligence

The Hidden Dependency Gap in AI Agents

REAL-TIME INTELLIGENCE FOR AI AGENTS

LLMs Hallucinate Versions at Scale

Hallucination Rates by Confidence Level

Confidence

Security Improvement by Upgrade Strategy

LLM-generated versions (LLM)

Most Recently Published Version (Latest)

Sonatype ‘No Breaking Changes’ (NBC)

Sonatype Best (Best)

Figure 4.2: Mean Security Score Improvement Per Application by Strategy

Ungrounded AI Agents: Exploring Malware and Protestware Recommendations

Breaking Change Cost Analysis

Figure 4.3: Upgrade Cost & Effort Per Application

How Costs Scale: Organizational Impact

Figure 4.4: At Enterprise Scale: Upgrade Cost & Effort

The "Intelligence" Gap in Generative AI

sweetalert2 version 11.21.2

color version 5.0.1 & color-string version 2.1.1

The False Economy of "Latest Version"

Sonatype Security Hybrid

Grounding AI Agents is the Missing Link