CHAPTER 1

Open Source Supply, Demand, and Security

 

A monk by the name of John of Salisbury wrote a famous phrase in a 12th century manuscript, borrowed by Sir Isaac Newton and hundreds of others since:

"We are like dwarfs sitting on the shoulders of giants. We see more, and things that are more distant, than they did, not because our sight is superior or because we are taller than they, but because they raise us up, and by their great stature add to ours."

The meaning of the passage is simple: The progress we make only happens because of the progress in learning and understanding others have made before us. 

Nowhere else is this seen more than in the adoption of open source. Nearly all of the software shipped today relies on previous innovation that is distributed freely on scaffolding built by the utmost experts in the world, available to all developers free of charge. 

In past State of the Software Supply Chain reports, we estimated that up to 90% of the code we run in production is of open source origin. Therefore, the economics of open source are good indicators of trends and challenges in the wider software market. 

For the 9th consecutive year, we continue to track the growth of open source adoption across the top four major open source ecosystems. These collectively account for four of the top five languages in GitHub, and a 60% share of the most popular programming languages according to PYPL language popularity index1. Leveraging our continued monitoring, we present the combined statistics of each ecosystem in the table below.

Figure 1.1. SOFTWARE SUPPLY CHAIN STATISTICS, 2023

Ecosystem Total Projects Total Project Versions 2023 Annual Request Volume Estimate YoY Project Growth YoY Download Growth Average Versions Released per Project
Java (Maven) 557K 12.2M 1.0T 28% 25% 22
JavaScript (npm) 2.5M 37M 2.6T2 27% 18% 15
Python (PyPI) 475K 4.8M 261B3 28% 31% 10
.NET (NuGet Gallery) 367K 6M 162B4 28% 43% 17
Totals/Averages 3.9M 60M 4T 29% 33% 15

 

Open source supply sees a resurgence 

The supply side of open source is an interesting metric to gauge the pace and scale of innovation that occurs in a given ecosystem. The more open source projects are published every year, the more innovation occurs in a given ecosystem.

New open source projects across the monitored ecosystems have been published at a relatively steady 15% average rate5 in recent years, which was a significant reduction in pace from highs seen in 2019 and before. 

This two-year slump is most likely related to the COVID-19 pandemic period and associated slowdown. While some studies suggest productivity did increase during the 2020-2023 period in the U.S., a negative correlation emerges in open source production trends. This is further supported by another study that found productivity rates in information and communication technology did decline towards 2022. One other explanation could be that a lot of these projects are in fact coming from commercial activity and not people with spare time, which was abundant during the pandemic.

To date, the data in 2023 shows the innovation slowdown is now over. Each monitored ecosystem showed a remarkably consistent project growth rate, varying just 2% across all four monitored ecosystems to a total average growth rate of 29% year-over-year.

The rate of production growth is recovering across the board, and both Maven Central and NuGet are on track to exceed the rate of growth seen in 2020. 

PyPI and npm, although growing, have not yet caught up to their original rate of growth but are on an upward trend. In a later section, we will see how breakthroughs and interest in AI and its related tooling are fueling the rate of growth in these ecosystems.

FIGURE 1.2. OPEN SOURCE NEW PROJECT GROWTH RATE OVER THE PAST 4 YEARS

Between 2022 and 2023, the number of available open source projects grew an average of 29%. The average open source project in 2023 has released 15 versions available for consumption, with specific ecosystem averages ranging from 10 to 22 versions across the different open source registries. That means 1-2 new versions are released every month and adds up to 60 million combined releases made available in the observed ecosystems.

FIGURE 1.3. OPEN SOURCE PROJECTS AND VERSIONS GROWTH, 2023

 

Open source consumption is decelerating

While we’re seeing supply increase, consumption isn’t keeping pace. The rate of download growth in open source consumption has slowed the past two years. In 2023, this trend continued with the average download growth rate sitting at 33%, which is exactly what it was last year. This is a stark comparison to the all-time high of 2021, which saw 73% year-over-year (YoY) growth. There is a sign of slowdown in growth in the largest ecosystems, which is not surprising given the market saturation they already have.

Despite this, both of the largest ecosystems, Maven and npm, are each estimated to reach over a trillion requests in 2023, with npm reaching a staggering 2.6 trillion requests in total, continuing a modest growth that surpasses the total request rate of PyPI in 2022.

These two ecosystems account for 90% of the requests served with the remaining two growing at above average pace.

FIGURE 1.4 CUMULATIVE ESTIMATED REQUESTS PER ECOSYSTEM OVER 6 YEARS

Annual request growth rate of each ecosystem

Requests are the fundamental measure of how popular an open source ecosystem is and how lively its usage is. Other factors within an ecosystem may vary, such as the larger size and complexity of Java packages compared to JavaScript packages.

Investigating the rate of growth for requests can reveal information about the state of open source adoption, as well as the growth of the software industry at large.

Figure 1.5 charts these individual growth rates over time and displays an average across all four major ecosystems.

FIGURE 1.5. GROWTH RATE OF THE MONITORED OPEN SOURCE ECOSYSTEMS OVER 5 YEARS

Download requests for all open source ecosystems are still growing, but for a third year in a row there are signs that the pace of growth is slowing. 

We can see a clear delineation between the stabilization of large ecosystems like Maven and npm, and continued accelerated growth in PyPI and NuGet. 

Figure 1.6 charts the overall aggregate request growth across all ecosystems. It illustrates that although the pace of growth is slowing, the absolute scale of growth continues to compound on previous years' rates. To put it simply, the pace of open source adoption still shows no signs of stopping.

FIGURE 1.6. TOTAL OPEN SOURCE REQUESTS OVER YEARS

 

Individual ecosystem analysis

Through the first 7 months of 2023, 512 billion Java components were requested from the Maven Central Repository. This is a significant jump compared to the 821 billion requests in 2022.

Java continues to grow at a healthy pace, hitting an estimated 25% YoY request growth rate. If previous years are any indication, we may well see a spike towards the end of the year.

JAVA 2023 BY THE NUMBERS:

0 trillion
projected request volume
0%
YoY growth estimated
npm is the juggernaut of open source registries, with an estimated download request count of over 2.6 trillion components (or to display it in full numbers: 2,579,310,885,518). 

The growth of npm is the slowest of all the monitored ecosystems - estimated to be at 18% YoY. Nevertheless, owing to npm's substantial footprint, this translates to a staggering 400 billion requests, surpassing the combined total of requests served by PyPI and NuGet.

JAVASCRIPT 2023 BY THE NUMBERS:

0 trillion
projected download volume
0%
YoY growth
Python continues to expand at a high pace, fueled by the language’s popularity and innovative uses, including AI. In 2023, PyPI served over 178 billion requests. This year, we estimate PyPI request volume will hit 261 billion packages. This represents 31% YoY growth.

PYTHON 2023 BY THE NUMBERS:

0 billion
projected download volume
0%
YoY growth
NuGet is the chosen ecosystem of the .NET family of languages and continues to serve engineers working with the growing set of Microsoft technologies. The rate of growth in NuGet is estimated to be the fastest among the ecosystems. Developers downloaded 113 billion NuGet packages in 2022, which was well above our estimate last year. In 2023, NuGet is estimated to serve 162 billion requests, representing 43% YoY growth.

.NET 2023 BY THE NUMBERS:

0 billion
projected download volume
0%
YoY growth

 

Open source software security concerns see no sign of slowing 

In 2022, we reported a massive increase in the growth of malicious attacks on the software supply chain. Since our last report, this method of propagating security threats using trusted developer utilities and ecosystems has continued to evolve and flourish. 

A troubling trend has emerged in the software supply chain over the past few years of tailor-made packages designed to run a malicious payload on download — without any developer interaction. This form of intrusion relies on developers not recognizing that the build breakage resulting from the fake package might be an indication that something nefarious has already happened on their system. We did a deep dive into types of malicious attacks in last year’s report.

In our YoY monitoring, at the time of writing in September 2023, we have logged 245,032 malicious packages — meaning in the last year, we’ve seen the number of malicious packages triple. Looking at it a different way, it also indicates that in one year alone, we’ve seen twice as many supply chain attacks to the cumulative numbers in previous years.

This pace of growth is astonishing. It signals the role of the software supply chain as one of the fastest growing vectors for adversaries to execute malicious code. Furthermore, we have seen an increase in nation-state actors leveraging these vectors (see our deep dive section below).

FIGURE 1.7. NEXT GENERATION SOFTWARE SUPPLY CHAIN ATTACKS (2019-2023)

245,000

Malicious packages discovered, 2x all previous years combined

This is alarming news. Even though many open source ecosystems have implemented new security policies, such as mandatory MFA, they usually only address the issue of protecting existing open source publishers from attack. Oftentimes, packages containing malicious code are treated very similarly to packages with new security vulnerabilities, and they are taken down entirely based on a volunteer effort following a vulnerability removal process which is not appropriate when the code is designed to be malicious from the start. This approach can lead to the malicious packages being up longer than necessary, leaving developers at risk.

 

Notable malicious packages and vulnerabilities

As we continue to document an overall rise in malicious attacks on open source ecosystems, the monitored 2022-2023 period has also seen more professional criminal campaigns emerge. The software supply chain lends itself well to the cybercriminal ecosystem, either as an initial access vector to Initial Access brokers or even as a means of distributing initial access malware for Advanced Persistent Threat groups.  Here are several examples we’ve seen this year:

Lazarus created PyPI package 'VMConnect' imitates VMware vSphere connector

In August 2023, Sonatype discovered a malicious Python package, 'VMConnect,' which mimics a legitimate VMware module on PyPI. This is part of a wider cyber campaign called "PaperPin," and is widely thought to originate from the Lazarus Group, a North Korean state-affiliated organization. The packages aim to download further malicious payloads from attacker-controlled URLs. The focus on VMware, a widely used virtualization platform, is particularly concerning, as a successful compromise could have far-reaching implications for enterprise networks and is widely attractive to state-affiliated actors.

ChatGPT histories uncovered due to a vulnerability in Redis component used by OpenAI

In March 2023, ChatGPT users experienced a data leak where chat histories displayed other people's queries. OpenAI identified the issue as a race condition vulnerability in an open source component called Redis, which they use for caching user data. This flaw made sensitive data of about 1.2% of ChatGPT Plus subscribers accessible to others. The vulnerability was exacerbated by a recent server change that increased the probability of the race condition occurring. The issue underscores the importance of even rarely occurring vulnerabilities, especially in widely used components like Redis, given their potential to cause widespread disruption and data exposure.

PyTorch namespace confusion attack targeted utilities aimed at AI developers

In the past couple of holiday seasons, we've seen some big supply chain attacks, including one on PyTorch, a popular machine learning (ML) framework. The attackers used a tactic known as namespace confusion to specifically go after the experimental "nightly" build of PyTorch. They managed to steal sensitive data, signaling hackers are increasingly setting their sights on AI and ML tools. These tools are becoming more critical in various sectors, making them attractive targets. While only the experimental build was hit, the incident serves as a wake-up call for better security in the booming field of AI.

 

A timeline of attacks

We have continued to curate a timeline of known malicious packages and software supply chain campaigns. This interactive timeline summarizes notable supply chain incidents, next-gen attacks, and other incidents propagated using the software supply chain.

FIGURE 1.8 SOFTWARE SUPPY CHAIN ATTACKS

AUG 2023
JULY 2023
JUNE 2023
JUNE 2023
APR 2023
MAR 2023
FEB 2023
JAN 2023
DEC 2022
NOV 2022
SEPT 2022
AUG 2022
JULY 2022

AUG 2023

Malicious PyPI package imitates VMware vSphere connector module

A fake PyPI package, ‘VMConnect,’ copied VMware's vSphere connector but harbored hidden malicious code. It was part of an ongoing campaign, "PaperPin," along with similar packages. These packages were removed from PyPI.

JULY 2023

A French-meme-inspired PyPI package targets Windows with an info-stealer

A PyPI package called ‘feur’ cleverly disguised a Windows Remote Access Trojan (RAT) behind a meme-related name. This RAT had surveillance features, such as clipboard access, network monitoring, webcam usage, and screenshots.

JUNE 2023

PyPI attackers unleash trojans and info-stealers

Sonatype detected malicious PyPI packages posing as npm "colors" library, targeting Windows with trojans hosted on Discord. One package affected Windows and Unix with trojans and Python code. Others used variable obfuscation similar to crypto-miner malware.

JUNE 2023

Manifest confusion in npm

"Manifest confusion" was revealed in the npm ecosystem. A package's metadata (dependencies and scripts) is published separately from its actual contents, stored in a tarball containing package.json. This disconnect can result in issues like cache poisoning or hidden dependencies/scripts.

APR 2023

Threat actors compromise 3CX desktop app in software supply chain attack

A software supply chain attack struck 3CX's Mac and Windows client apps, impacting 600,000 users. This month-long, state-actor-led attack prompted 3CX to urge users to uninstall compromised apps and migrate to safer frameworks.

MAR 2023

W4SP copycats continue to infiltrate PyPI registry

Microsoft-helper package reveals copycat info-stealer

OpenAI data breach traced to unpatched Redis vulnerability

FEB 2023

https package attempts to sneak in through GTA 5 mods

Info-stealers distributed via Python packages on the PyPI registry

Malware campaign floods PyPI with thousands of malicious packages

JAN 2023

Malicious Python package attempts to download and install a Trojan virus

This malware validates the presence of a VM before attempting to execute. Sonatype confirmed the “minimums” package as malicious. It contains a payload in the setup.py file that attempts to download a Trojan virus from a rogue server, install it, and log the installation result using a Discord webhook.

DEC 2022

PyTorch-nightly build compromised

Malicious 'Cabo Custody Restful' attack tries to trick developers using MacOS

NOV 2022

Malicious reverse shell and bind shell scripts taint packages

Sonatype discovered packages tainted with malicious reverse shell and bind shell scripts. Other packages looked for information on the target computer’s OS such as hostnames, IPs, credentials, and other configuration details with the purpose of exfiltrating such data to malicious servers.

SEPT 2022

‘JuiceLedger’ tries to catch PyPI maintainers unaware

A phishing attack attempted to distribute a .NET-based malware, dubbed 'JuiceStealer,' that steals credential, browser, and cryptocurrency vault information and feeds the ill-gotten goods to a domain purportedly controlled by JuiceLedger.

AUG 2022

Cryptomining packages flood npm, PyPI

PyPI package ‘secretslib’ drops Linux malware to mine Monero

‘Requests’ library typosquats install ransomware

 

JULY 2022

PyPI packages steal Telegram cache files, add Windows Remote Desktop accounts

Sonatype discovered malicious PyPI packages that set up new Remote Desktop user accounts on your Windows computer and steal encrypted Telegram data files from your Telegram Desktop client.

 

Differentiating software vulnerabilities and malware

Up until now, we’ve been talking about malware and malicious attacks on the software supply chain — or maybe better stated as malware propagated using the open source supply chain. In this next section, we’re going to discuss software vulnerabilities. While the two concepts are related, they are very distinct, so we’d like to quickly define the difference between a vulnerability and a piece of malware.

Software vulnerability: A flaw in the code

A software vulnerability is akin to a flaw in code, much like a faulty lock on a door. However, unlike malware, vulnerabilities are not intentional. Instead, they represent weaknesses in software components or projects.

Similar to how a faulty lock compromises the security of a building by allowing unauthorized access, a software vulnerability creates a gap in the software's security perimeter. This gap becomes an entry point for intruders to exploit, gaining unapproved access to the system, application, or component.

Malware: Malicious intent in open source

Malware, short for “malicious software,” poses a significant threat to open source software ecosystems. It encompasses a wide range of malicious programs, such as viruses, worms, trojans, ransomware, spyware, and adware, all designed to gain unauthorized access to information or systems.

With its various forms, malware’s primary purpose is to steal data, install harmful software, gain control of a network, or compromise software or hardware. Threat actors employ diverse distribution methods, such as infected email attachments, malicious websites, or compromised software downloads.

 

Consumption behavior contributing to security concerns

Our report last year revealed a startling statistic: nearly 96% of component downloads with known vulnerabilities could be avoided as a better, fixed version was already available. This illustrates a clear need for organizations to pay closer attention to what versions they are adopting. 

There is widening evidence that despite the standard practice for avoiding vulnerable components today, the controls are not having the effect needed to reduce the attack surface. For example, as of September 2023, downloads vulnerable to the infamous Log4Shell vulnerability still account for nearly a quarter of all net new downloads of Log4j.  It should be highlighted, that almost two years after the initial finding of this vulnerability, we’re seeing this pace continue every week as a quarter of all downloads are of the vulnerable version of Log4j. This is only part of the story.

As we discussed last year, the numbers for other critical vulnerabilities that have not received as much widespread media attention are even more depressing.
0

total Log4j downloads since Dec 15, 2021 | 29% vulnerable

0%
vulnerable downloads in the last 7 days | 3,490,799 total downloads
This warrants concern and calls for behavioral adaptation at organizations, because critical vulnerabilities are widely exploited by bad actors even at the state level. For example, Log4Shell has topped CISA/NSA charts for active state-sponsored exploitation for well over a year now. This is also echoed in the OpenSSF's recently released Consumption Manifesto, which calls for organizations to "take responsibility for the open source they use, how it is consumed, and how they manage the risk associated with that consumption."

According to a joint consortium of national operators including CISA, NSA, NCSC-UK and others, attackers are exploiting older well-known vulnerabilities much more frequently than new zero-day vulnerabilities. This is extremely important to understand. While we should of course worry about zero-days, we also know that 96% of vulnerable open source downloads have a non-vulnerable fix available. Those 96% need to be addressed.
According to a joint consortium of national operators including CISA, NSA, NCSC-UK and others, attackers are exploiting older well-known vulnerabilities much more frequently than new zero-day vulnerabilities.
For this year’s report, we’ve taken a closer look at how vulnerabilities are consumed from Maven Central, with a special focus on what sort of geographic variance might exist.

Vulnerable components consumed

Let’s start off by looking at the top level. In 2022, we saw 12% of downloads served by Maven Central6 contained at least one known security vulnerability. 

This number is important when considering that the easiest way to reduce risk of a supply chain incident caused by a vulnerability is to simply choose a better, non-vulnerable version of a component.7 However, there is some improvement here. The number of vulnerable downloads in 2021 was 14% — and the number to date in 2023 sits around 10%.

0%
downloads served by Maven Central contain a known security vulnerability

FIGURE 1.9. PERCENTAGE OF COMPONENTS WITH KNOWN VULNERABILITIES SERVED FROM MAVEN CENTRAL

However, when investigating downloads that contained a vulnerability in 2022, it emerges that well over a third of the components consumed that had known vulnerabilities were Critical8 in severity, and a further 30.5% had a High Severity rating.

FIGURE 1.10. VULNERABLE DOWNLOADS BY SEVERITY

FIGURE 1.11. NVD KNOWN VULNERABILITY SCORE

This trend holds true nearly universally across all regions, suggesting that component consumption is largely an unmanaged decision today. This contrasts the number of known critical vulnerabilities in the National Vulnerability Database (NVD) — with over double the amount of "criticals" consumed over the spread of known vulnerabilities.
Over 16% of all downloads in 2022, on average, contained a High or Critical severity vulnerability. This trend holds true nearly universally across all regions, suggesting that component consumption is largely an unmanaged decision today.

The increase of critically vulnerable components being consumed could be due to the fact that these vulnerabilities are found and reported primarily in more popular and widely adopted open source software. Popularity begets more attention from good and bad actors, resulting in increased likelihood of a critical issue being present. It’s also worth noting that these more popular components have an official disclosure process. This means, on average, these critical vulnerabilities should be the ones that are most noticed. But, as we’ve seen with the vulnerable version of Log4j, “knowing” is only half the battle. Organizations have to care, and they have to have an automated way to address this issue.

A global view of vulnerable open source downloads

Software development has evolved into one of the most globally influential industries, shaping various sectors and regions in unique ways. However, not all regions share the same level of emphasis on software development. To gain insight into how the trends we've explored thus far manifest on a global scale, we conducted an analysis that looks at open source vulnerability consumption by country. 

Our study focused on countries that collectively downloaded over 100 million open source components from Maven Central in the past year. By scrutinizing the percentage of vulnerabilities associated with the software downloaded in each region, we start to gain insights into how different parts of the world manage their software supply chains.

In Figure 1.12, we delineate those that have stronger management programs from those who don’t by plotting the percentage of vulnerabilities against the average number of vulnerable downloads (approximately 22%) and applying a ranking based on how countries compare to that average. But it’s important to consider the context, and this is one of the most important figures to come out of Sonatype’s research: 96% of known vulnerabilities downloaded from Maven Central have a non-vulnerable version available.

The countries covered in the graph below include twenty of the largest consumers of open source software in the world. Even at the low end of our criteria (around 100 million downloads), 9.5% of those downloads are vulnerable components. When you consider juggernauts of open source consumption like the United States, the European Union (collectively), and China, tens of billions of vulnerabilities have entered the supply chains that produce the software we all use and our governments run on.

FIGURE 1.12 AVERAGE VULNERABILITIES BY COUNTRY WITH OVER 1 BILLION DOWNLOAD VOLUME

As we’re only scratching the surface with this regional view of vulnerable downloads, you can explore a deeper dive into open source consumption patterns within specific economic regions in Chapter 3 of this report, where we further unravel the intricacies of dependency management on a global scale. We also summarize the role regulations are having on the industry in Chapter 5.

 

NEXT UP: Chapter 2 ... in seconds.

Open Source Security Practices

Continue reading

Ch2-hero@2x-100