Sonatype Research Finds AI Coding Safety Gains Rely on Real-Time Software Intelligence, Not Just Larger Models
Across nearly 37,000 software upgrade recommendations, AI grounded in real-time intelligence delivered safer outcomes than larger models operating without live context
Fulton, Md. – March 24, 2025 – Sonatype®, the leader in AI-driven DevSecOps, today unveiled new research showing that larger-scale AI models alone do not produce the safest software dependency recommendations. In a study of roughly 37,000 open source upgrade recommendations, models grounded in real-time software intelligence reduced retained Critical and High risk beyond what larger ungrounded models from Anthropic, Google, and OpenAI could achieve on their own.
Sonatype found that newer, bigger models with improved reasoning and speed became more cautious, increasingly recommending “no change” to an open source component rather than a safer upgrade path. Although that restraint reduced hallucinations, it often left meaningful Critical and High vulnerability exposure in place.
The findings show that safer AI-assisted software dependency decisions require more than model scale alone. They require real-time software intelligence that helps models validate package availability, assess upgrade paths, account for known vulnerabilities, and reduce the tradeoff between remediation and disruption. Key findings include:
- Frontier models are improving, but hallucinations persist: Even the best ungrounded models, which lack real-time intelligence, still fabricated roughly 1 in 16 dependency recommendations.
- Greater restraint still left meaningful risk exposure: Newer models increasingly recommended “no change,” with the most cautious models still carrying roughly 800 to 900 Critical and High vulnerabilities.
- Grounded intelligence outperformed standalone LLMs on security outcomes: Across Maven Central, npm, and PyPI, Sonatype’s Hybrid approach, which selects the most secure upgrade path, delivered 269% to 309% mean security score improvement, versus only 24% to 68% for the best LLM in each ecosystem.
- Real-time intelligence mattered more than model size alone: A small grounded model resulted in significantly lower Critical and High risk at up to 71x lower cost than frontier models.
"Larger models may be improving at reasoning, but dependency management is not a reasoning problem alone — it is a data problem. If a model does not know your actual environment, current vulnerability data, and the policies you operate under, it is just making educated guesses,” said Brian Fox, Co-founder and CTO at Sonatype. “Grounding AI in that reality is what makes its recommendations useful, credible, and safe for enterprise use.”
This study builds on the From Guesswork to Grounded chapter of the 2026 State of the Software Supply Chain® report. The study evaluated roughly 37,000 open source upgrade recommendations across Maven Central, PyPI, npm, and NuGet, comparing ungrounded frontier models with approaches augmented by real-time software intelligence.
Sonatype Guide, powered by this real-time intelligence, helps organizations identify safer open source upgrade paths, reduce avoidable risk, and limit unnecessary developer disruption from breaking changes or poor recommendations. By grounding AI in live software supply chain data, Sonatype helps teams operationalize AI-assisted remediation with greater safety and confidence.
To read the full study, Making AI Software Development Safe at Machine Scale, visit: https://www.sonatype.com/resources/research/making-ai-work-safely.
About Sonatype
Sonatype is the leader in AI-driven DevSecOps. As the maintainers of Maven Central and creators of Nexus Repository, Sonatype has spent two decades pioneering how the world manages and secures open source software — making Sonatype the trusted authority for modern software supply chains. With unmatched open source visibility and a unified product suite built for modern software development, Sonatype gives enterprises the intelligence and automated governance they need to harness the full potential of open source and AI. Sonatype handles the complexity behind the scenes: guiding component and model selection, blocking harmful malicious code, automating dependency and vulnerability management, and ensuring faster, more reliable builds — so developers spend more time on innovation and less time on remediation and rework. Trusted by more than 15 million developers, Sonatype helps power secure, modern software development at nearly 2,000 global organizations including 70% of the Fortune 100. To learn more about Sonatype, please visit www.sonatype.com.
Methodology
We analyzed direct dependencies from enterprise applications scanned between June and August 2025, using the same application sample as the original study and limiting the dataset to Maven, npm, PyPI, and NuGet, which produced roughly 37,000 unique package-version pairs and about 258,000 recommendations evaluated across seven frontier models from OpenAI, Anthropic, and Google. Each model received the same prompt, and every recommended version was checked against Sonatype’s package registry, with non-existent versions classified as hallucinations and same-version recommendations treated as inaction. Security outcomes were measured using Sonatype’s enriched severity scoring and deduplicated advisory counts, with hallucinated versions treated as no-ops because package managers would reject them in practice; Sonatype’s Hybrid strategy served as the benchmark throughout. We also tested whether real-time ecosystem intelligence matters more than model scale by evaluating GPT-5 Nano on a 397-component adversarial sample skewed toward known failure modes, using a single function-calling tool backed by Sonatype Guide’s version recommendation API and applying the same validation and security methodology.