Maven Central Repository Traffic: Using S3

December 24, 2008 By Brian Fox

5 minute read time

Yesterday, I wrote about the analysis involved in tracking down the increasing load on Maven Central. By identifying some misbehaving tools, we were able to reduce the traffic from a 98 Mbps average to 60-80 Mbps. In this post, I discuss the next step toward a Maven Central that can scale to meet the load generated by millions of developers using an ecosystem of tools that rely on Central.

Where We Left Off...

As a refresher, here's what the load looked like at the end. To summarize, Central experienced load problems on httpd, we subsequently moved to NGINX, which fixed the load problem, but caused us to regularly saturate a 100 Mbps pipe. After a few weeks of investigation, we discovered that much of the traffic was due to a misconfigured product repeatedly downloading the Nexus index. We notified the responsible project and blocked access to the index until the problem was fixed. The end product of all that work was a Maven Central with fewer saturation events, which was operating in a consistent 60-80 Mbps range in the middle of a work week.

We still experienced significant traffic on Monday morning, as users fired up M2Eclipse and downloaded the updated weekly index. We tried shifting the index creation from Sunday night to Friday afternoon, to help smooth out some traffic over the weekend, but ultimately this didn't help much. We ended up having to QOS the transfers of the zip down to something like 30 kbps to keep the rest of the repository available during these loads. This wasn't a great solution, as it significantly increased the time for users to get at the index.

Considering Amazon S3

One of the ideas we had been pursuing was to use Amazon S3 to host central. For those unfamiliar with S3, it is a cloud-based system for storing and serving data, and is part of the Amazon Web Services products (EC2, S3, Cloudfront). The nearly unlimited bandwidth and option to use Cloudfront to physically bring the data closer to the users is a definite bonus. It was clear to us that we need something that could scale indefinitely, as we continue to see increased adoption of tools that rely on the Central Repository. The drawback with S3, is that we don't have the direct ability to monitor the traffic and uncover any abuse or misconfigured tools like we do now. We all agreed that moving to something like S3 was the future.

Despite our frequent pleas to use repository managers and not scrape the repo, we continue to find people doing exactly that nearly every day. The bandwidth on S3 is not free, and opening it up without the ability to protect the bottom line from abusers is not a great idea. Once you start designing systems on an Internet Scale, you realize that bandwidth isn't free.

Moving to "the Cloud": Amazon S3

We decided to take baby steps to see if S3 could help us with a very specific problem: The index downloads. Instead of downloading multiple GB of repository data and creating an index locally, we think it makes more sense for people and tools to download a repository index once a week. This index weighs in at 30 MB, and while that might not seem to be very large at first glance... multiply 30 MB by 50,000 downloads, and you'll quickly start to serve TB of data. This is exactly what was happening, and it seemed like an easy target to offload to Amazon S3.

We found a handy Ruby script called S3Sync that we use to synchronize the maven2/.index folder over to a "repo1.maven.org" bucket on S3. Nginx is then configured to send temporary redirects (302) for all /.index/ requests over to S3. By doing this, we can still manage traffic to the Index, but offload the bulk of the data transfer to the S3 network.

So how did it work? Unbelievably well. Take a look at the week prior to the shift to S3 and the week after the shift to S3.

Take a look at the weekly traffic before the switch to S3, realizing that the weekly traffic numbers mask the periodic 100 Mbps saturation that we were seeing in the hourly graphs. A new index was published, and we saw a spike in download activity during the morning of December 1st. After that, our weekly traffic gradually diminishes to a background level of between 40 Mbps and 80 Mbps on an average weekday. Again, note that even those good days had periods of complete saturation. After blocking the offending tools, we saw an improvement, but we wanted to offload the bulk of our index downloads to S3 to gain further improvements.

Now, look at the weekly traffic graph after moving the index to S3. Unless you note the difference in the Y Axis scale, this might not seem as impressive. The number we're most interested in decreasing is the 95th percentile value. It describes the 5-minute average bandwidth within which 95% of our traffic falls. If our 95th percentile number is close to 100 MBps it means we're likely to saturate the 100 MBps often. If our 95th percentile is down around 20 MBps, we're much more likely to have a stable and available repository. After moving the index to S3, we have a 5x decrease in the 95th percentile, from 98.7 MBps to 12.3 MBps. We went from a weekly bandwidth average of 49.28 MBps to 8.24 MBps, and our total transfer for the week went from 3.81 TB to 629.7 GB.

Before moving the index to S3, we suffered from slow response times and an index download, which would take a few minutes during peak traffic. After we moved the index to S3, response times improved by 300%, and the index download now takes a few seconds to complete. Just moving the index to S3 yielded dramatic improvements for the Central Maven Repository.

The Result: Greater Speed, Higher Availability

In the 2 weeks since we moved the index to S3, it has served over 4TB of data and 730,000 requests. Even though the bandwidth bill for the S3 service is significant, we estimate it is half of what we save on the Central connection traffic. In other words, we've increased availability and reduced the overhead costs associated with the Central Maven Repository.

A note of caution: To protect the system from abuse, we may have to change the URLs on S3 from time to time. Don't point directly at the S3 URL, continue to request the data from Central and you'll be ok.

We won't stop here in finding ways to optimize the repository experience for users. One thing that will be rolled out shortly is Incremental Index support. This will enable tools that use the Nexus Indexer API to grab only chunks of the index that have changed. This should have a significant impact on the amount of traffic. We also continue to investigate the possibility of leveraging the cloud to host the entire repository, so stay tuned.

Note: If you liked the screenshots used in these past two entries, check out Jing Project. Jing is a great free and easy to use tool for capturing screen images or videos and marking them up. Both OSX and Windows versions are available.

Written by Brian Fox

Brian Fox, CTO and co-founder of Sonatype, is a Governing Board Member for the Open Source Security Foundation (OpenSSF), a Governing Board Member for the Fintech Open Source Foundation (FINOS), a member of the Monetary Authority of Singapore Cyber and Technology Resilience Experts (CTREX) Panel, a member of the Apache Software Foundation and former Chair of the Apache Maven project. Working with OpenSSF, Brian helped create The Open Source Consumption Manifesto, urging organizations to elevate awareness of open source usage. He also chaired efforts to provide official responses to requests for information from the The Office of the National Cybersecurity Directorate (ONCD) and the Cybersecurity and Infrastructure Security Agency (CISA). Within the Atlantic Council's Open Source Policy Network, Brian actively helps shape cybersecurity strategy, offering valuable insights on critical documents, such as ONCD's recent National Cyber Security Strategy. Brian has over 20 years of experience driving the vision behind, as well as developing and leading the development of software for organizations ranging from startups to large enterprises. Brian is a frequent speaker at national and regional events including Java User Groups and other security and development-related conferences.

Explore All Posts by Brian Fox

Maven Central Repository Traffic: Using S3

Where We Left Off...

Considering Amazon S3

Moving to "the Cloud": Amazon S3

The Result: Greater Speed, Higher Availability

Try Nexus Repository Free Today

Related Resources

What Is Grounding? Why AI Coding Assistants Need Better Intelligence

Open Source, Open Infrastructure, and the Space Between

Request for Comments: CARE, Emergency Remediation, and Maven Central