From Guesswork to Grounded
Grounding AI Agents in Real-World Intelligence
The Hidden Dependency Gap in AI Agents
As organizations increasingly delegate critical security decisions to AI systems, we face a fundamental challenge: even state-of-the-art language models lack access to real-time vulnerability databases, supply chain intelligence, and breaking change data. As a result, AI agents are confidently recommending nonexistent versions, introducing known vulnerabilities, and even suggesting malware-infected packages. The model is doing all of this while appearing authoritative. AI is doing all of this while appearing authoritative.
Traditional upgrade strategies expose similar blind spots. "Latest version" heuristics, which software developers simply upgrade open source components whenever a newer version is available, assume newer means better, assume newer means better, ignoring CVE disclosures, stability signals, and the cascade of breaking changes that transform simple updates into multi-week migrations. Meanwhile, ungrounded AI recommendations, regardless of the sophistication of the underlying model, operate on theoretical patterns rather than live security intelligence. Both approaches share a critical flaw: they make decisions without the data that actually matters and without the guardrails to guarantee code is compliant.
In this research, Sonatype demonstrates a different path: AI that is grounded in live intelligence, validated against real registries, and guided by breaking-change analytics governed by policy. When AI operates with this foundation, its capabilities shift from theoretical suggestion engines to trusted, production-grade decision systems.
REAL-TIME INTELLIGENCE FOR AI AGENTS
Give AI agents the context they need to select the best components from the start and maintain the safest dependency versions with Sonatype Guide.
This chapter analyzes nearly 37,000 real dependency upgrades across Maven, npm, PyPI, and NuGet to quantify how ungrounded AI coding agents behave in practice and how security-intelligent governance closes the gap.
This isn't an indictment of AI capabilities. It's a recognition that automation without live intelligence is dangerous at scale.
Prompt
LLMs Hallucinate Versions at Scale
Across 36,870 upgrade recommendations, 27.76% referenced non-existent versions including over 10,000 hallucinated package releases that would never resolve in a live repository. The performance analysis of the LLM strategy (detailed in the LLM Recommendation Generation section of the Methodology below) reveals an interesting finding regarding confidence:
- GPT-5 was 98% accurate when it expressed high confidence
- It expressed high confidence in just 3.68% of recommendations
- Nearly half of all “low confidence” answers were incorrect.
This confidence pattern was observed in a sample of real-world enterprise applications. While production AI systems might decline to answer when uncertain, the core issue remains: package ecosystems evolve constantly. New versions ship hourly. Security vulnerabilities are constantly emerging. No training dataset, however comprehensive, can predict tomorrow's CVE or next week's breaking change.
Sonatype's approach doesn't compete with agentic AI — it completes it. By grounding recommendations in live package registries, proprietary vulnerability and malware data, and breaking change calculations, we achieved zero AI hallucinations across the same 36,870 components. Every recommendation is verified against real repositories. Every upgrade is assessed for actual security impact.
The future isn't choosing between AI and traditional tools. It's AI agents operating with real-time intelligence that teams can trust in production.
Hallucination Rates by Confidence Level
|
Confidence |
Hallucinated |
Valid |
Total |
Hallucination Rate |
Share of Hallucinations |
|---|---|---|---|---|---|
| HIGH |
23
|
1,336
|
1,359
|
1.69%
|
0.22%
|
| MEDIUM |
4,504
|
18,959
|
23,463
|
19.20%
|
44.01%
|
| LOW |
5,708
|
6,340
|
12,048
|
47.38%
|
55.77%
|
Confidence
| HIGH |
23
|
| MEDIUM |
4,504
|
| LOW |
5,708
|
Hallucinated
| HIGH |
1,336
|
| MEDIUM |
18,959
|
| LOW |
6,340
|
Valid
| HIGH |
1,359
|
| MEDIUM |
23,463
|
| LOW |
12,048
|
Total
| HIGH |
1.69%
|
| MEDIUM |
19.20%
|
| LOW |
47.38%
|
Hallucination Rate
| HIGH |
0.22%
|
| MEDIUM |
44.01%
|
| LOW |
55.77%
|
Share of Hallucinations
| HIGH |
|
| MEDIUM |
|
| LOW |
|
Security Improvement by Upgrade Strategy
Software ages like milk, not wine. As new vulnerabilities are discovered and disclosed, older package versions accumulate security debt while newer releases incorporate patches. Every day without upgrading increases exposure. Yet, not all upgrade paths are created equal. We compared four upgrade strategies across 856 enterprise applications. All strategies improved security, but not equally.
LLM-generated versions (LLM)
Most Recently Published Version (Latest)
Sonatype ‘No Breaking Changes’ (NBC)
Sonatype Best (Best)
Figure 4.2 outlines the mean security score improvement for each application by strategy. Percent improvement is calculated as (total target security - total baseline security) / total baseline security × 100, averaged across vulnerable components from 856 enterprise applications. Security scores aggregate the severity and count of known vulnerability types on a 0–100 scale. For example, an application with 450 baseline points improving to 614 target points represents +36.4% security gain.
Figure 4.2: Mean Secure Score Improvement Per Application by Strategy
Source: Sonatype
Overall, it is generally a good idea to remediate vulnerabilities. All upgrade strategies improve security outcomes, but not equally. The LLM-generated (LLM) upgrade recommendations show the smallest uplift, recommending generally newer versions without proper guidance. Sonatype's No Breaking Changes (NBC) sees a significant improvement while identifying versions that minimize or eliminate breaking changes.
Then we have the Latest version strategy, with a significant improvement in security, but with a high engineering cost, as we will see later. The overall best improvement comes from the Sonatype Best (Best) strategy, which more holistically considers the security of the components (severity in combination) when identifying the best upgrade path.
LLM recommendations present a troubling paradox. While showing an improvement overall, the model degraded security posture for 345 components, recommending newer versions that introduced more vulnerabilities than they resolved. This occurred when the model unknowingly chose versions that:
- Were compromised after its training cutoff
- Carried additional CVEs
-
Were newer, but were also riskier
Ungrounded AI Agents: Exploring Malware and Protestware Recommendations
The LLM strategy did more than hallucinate versions. It recommended sweetalert2 11.21.2, which is confirmed protestware executing political payloads, as well as color 5.0.1 and color-string 2.1.1, which were compromised in a major supply chain attack. These packages were not obscure edge cases. They were widely downloaded and part of a high-profile security event that occurred after the model’s training data cutoff.
Prompt
{
}
}
This is the core problem: AI cannot detect threats that happened after it was trained. AI needs real-time intelligence.
While security improvements justify upgrades, the practical question remains: what does it cost? Breaking changes drive developer effort, transforming version bumps into multi-day refactoring projects. The following analysis quantifies these costs across strategies, revealing trade-offs between security gains and implementation burden.
AI cannot detect threats that happened after it was trained — unless given real-time intelligence
Breaking Change Cost Analysis
Security improvements come at a price measured in developer hours and refactoring effort. Across 856 enterprise applications with representative dependency footprints, upgrade strategies impose dramatically different implementation costs.
Figure 4.3 below compares median per‑application upgrade budgets across the four strategies. NBC delivers the lowest-friction path: roughly ~1 engineer-week to modernize an entire app while avoiding destabilizing work. Best still holds the costs under $20K and under 200 hours per app, yet it absorbs the additional change needed to drive higher security scores.
Figure 4.3: Upgrade Cost & Effort Per Application
Source: Sonatype
Both outclass the unmanaged options: unconstrained Latest upgrades result in nearly 5x the median spend versus NBC, and LLM-only selections land in the same cost bracket as Best without the significant risk reduction.
Applying a generic ~8% copilot uplift to the same per-app upgrade totals, NBC still modernizes an app for a little over $5K and ~53 hours, while chasing Latest upgrades soaks up nearly $27K and 288 hours — over five times the spend and the engineering time.
That gap isn’t just a bookkeeping line; it’s opportunity cost. Every extra week poured into unmanaged upgrades is a week not spent on security hardening, paying down tech debt, or feature delivery. LLM-only picks land in the same budget band as Best yet lack its curated risk reduction, reinforcing that disciplined Sonatype strategies are the only way to keep upgrade budgets predictable without cannibalizing roadmap work.
How Costs Scale: Organizational Impact
This projection scales each strategy’s median per-application effort across a representative large enterprise portfolio. It illustrates the cumulative impact of decentralized upgrade decisions over time.
Figure 4.4: At Enterprise Scale: Upgrade Cost & Effort
Source: Sonatype
In practice, organizations don’t upgrade every dependency in every application all at once. Instead, they perform ongoing dependency maintenance — small, continuous updates that, across hundreds or thousands of applications, represent a near-constant workload. Without a cohesive strategy, these distributed efforts can quietly accumulate into multimillion-dollar annual costs.
Sonatype’s NBC automation keeps portfolio-level upgrade effort roughly an order of magnitude lower than unmanaged Latest adoption, while achieving a similar security posture. Teams targeting the most secure baseline can adopt Best selectively, reserving deeper migrations for critical systems where maximum vulnerability reduction warrants the additional investment.
Our analysis of 36,870 dependency upgrade recommendations exposes a critical divergence between the promise of autonomous AI agents and the reality of software supply chain security. The data suggests that without access to real-time package registry intelligence, both state-of-the-art LLMs and traditional Latest heuristics fail to balance security risk with engineering effort.
The "Intelligence" Gap in Generative AI
The most alarming finding is not merely that ungrounded AI makes mistakes, but that it makes dangerous ones with high confidence.
The observed 27.8% AI hallucination rate in GPT-5 recommendations confirms that language models, when isolated from live repositories, struggle to distinguish between existing and non-existent software.
More critically, the "AI hallucinations" were not only harmless version number errors, but also data corruption, protestware, and hijacked packages. This illustrates a fundamental limitation: training data cuts off, but supply chain attacks operate in real-time. A model trained before a package compromise cannot "know" a version is unsafe without a live feed of vulnerability intelligence.
Furthermore, the LLM strategy delivered the lowest security improvement (+120.4%) of all methods tested. In 345 specific instances, following the AI’s advice actually degraded the component's security posture by introducing more vulnerabilities than it resolved.
sweetalert2 version 11.21.2
Data Corruption & Protestware
This package creates a ‘noWarMessageForRussians’ banner on any Russian website using this component that is running in a browser using Russian.
color version 5.0.1 & color-string version 2.1.1
Cryptostealer & Hijack
Taken over as part of the chalk/debug campaign, color and color-string were manipulated to extract victims’ cryptocurrency from browser wallets.
The False Economy of "Latest Version"
While the industry often defaults to "always upgrade to latest" as a best practice, our cost analysis reveals this to be a financially inefficient strategy. While "Latest" achieved strong security gains (+267.1%), it did so at a brute-force cost: approximately $29,516 and 314 developer hours per application.
When scaled to a portfolio of 1,500 applications, the "Latest" strategy demands nearly $44.3 million in estimated labor costs. This 5x cost multiplier, compared to intelligent automation, represents a massive opportunity cost; every hour spent resolving breaking changes from an unnecessary major version jump is an hour lost to feature development or debt reduction.
When scaled to a portfolio of 1,500 applications, the "latest" strategy demands nearly $44.3 million in estimated labor costs.
Sonatype Security Hybrid
You can also take a hybrid approach that puts security first in the version scoring algorithm. When a version has a perfect security score, it recommends NBC; otherwise, it defaults to the Best recommendation. This results in:
Grounding AI Agents is the Missing Link
The high accuracy (98%) of GPT-5 in the rare instances (3.68%) where it expressed "High Confidence" suggests that the reasoning capabilities of modern models are sound, but their context is insufficient.
The path forward is not to choose between AI and traditional tools, but to ground autonomous AI agents in verified intelligence. By feeding the model real-time data — including computed breaking changes and enhanced vulnerability and malware intelligence, Sonatype’s approach eliminates AI hallucinations entirely while empowering teams to choose the upgrade path (Best vs. No BC) that aligns with their risk tolerance and budget.
Download the Full Report