One experience has become nearly universal as AI systems move deeper into software development, their confidence when they're wrong.
Modern LLMs can generate code, recommend fixes, and even suggest dependency upgrades. But they also routinely invent package names, versions, and upgrade paths that don't exist, and present them with total certainty. In environments where automation already operates at scale, this isn't just inconvenient. It's dangerous.
As part of our 2026 State of the Software Supply Chain report, Sonatype analyzed nearly 37,000 dependency upgrade recommendations generated by GPT-5 across major open source ecosystems. The results were striking: 27% of all recommendations were hallucinations. These were fabricated versions that couldn't be resolved, forcing teams to spend time validating outputs, fixing broken builds, and reworking AI-generated suggestions.
This year, we expanded the research and evaluated a new generation of frontier models, including Claude Sonnet 3.7 and 4.5, Claude Opus 4.6, Gemini 2.5 Pro and 3 Pro, GPT-5, GPT-5.2, and GPT-5 Nano to find out if newer, larger models are actually making safer dependency decisions.
And, we introduced real-time intelligence into the process, including:
Live package registries.
Current vulnerability data.
Compatibility and breaking change analysis.
With that grounding layer in place, hallucinations disappeared across the entire dataset.
For a comprehensive look at this new research, read the whitepaper, Making AI Software Development Safe at Machine Scale.
Newer models are improving, with hallucination rates dropping significantly from earlier generations. But even the best ungrounded systems still hallucinate about 1 in 16 recommendations.
At scale, this presents a serious reliability issue. The latest generation of models didn't learn which versions actually exist. Instead, they learned to stop guessing when uncertain. "No change" recommendations nearly doubled, and about 1 in 3 components now receive a same-version recommendation.
On the surface, this looks like progress, but it introduces inaction, which is a different kind of risk.
Without access to real-time data, models face the choice of guessing and risking hallucinating a non-existent version, or doing nothing and preserving whatever risk already exists. Newer models increasingly choose the second option, but "do nothing" is not neutral.
If a dependency contains known vulnerabilities, a same-version recommendation locks that exposure in place. Over time, this leads to accumulated technical debt and persistent security risk.
Both are symptoms of the same underlying issue: reasoning without the data required to make correct decisions.
This extended research makes it clear that the problem is an intelligence gap. When models operate without access to live package registries, current vulnerability data, and breaking change analysis, they hit a ceiling. They can either guess or abstain, but they can't reliably choose the safest upgrade path.
But when real-time intelligence is introduced, the results change dramatically. A hybrid approach with Sonatype Guide at the center that combines model reasoning with real-time software supply chain intelligence eliminates hallucinations, reduces critical and high vulnerability exposure by up to 70%, and consistently outperforms even the largest ungrounded models.
Ungrounded models show significant variability (10,830–14,325 vulnerabilities), driven by differences in model quality and how recently they were trained. In contrast, the gains from grounding are consistently strong across all models. This reliability allows organizations to use older, more cost-effective models without compromising their ability to identify safe, high-quality dependencies and stay within AI budgets.
Stronger detection of malicious components and open source malware enhances overall resilience and reinforces market leadership. Meanwhile, developers can shift their focus away from avoidable fixes and toward building differentiated, high-impact features.
Model improvements alone won't solve this problem. Scaling parameters, refreshing training data, or switching vendors does not close the gap. Across providers and model generations, the same pattern emerges: ungrounded systems converge on the same limitations.
AI can accelerate development, but without grounding in real-time intelligence, it cannot make safe dependency decisions. Download our latest research, including an in-depth exploration of our methodologies.