Does Your Tool Depend on the Central Maven Repository?

By

2 minute read time

Here are three quick notes for people who write tools that depend on the Central Maven Repository. Adhering to these standards will help preserve the free, public resource that millions of Maven users depend on.

We think it is great that people interact with the repository, but we want to ensure that we're all doing so responsibly, which conserves bandwidth.

#1: Populate User-Agent Headers

If your build tool or repository manager interacts with the Central Maven Repository, you need to start setting a reasonable User Agent. Nexus, Artifactory, Archiva, Maven, and Ivy all identify themselves in the User-Agent header. This is important for the health of the Central repository. If there is a bug in a release of one of these tools, which pegs the bandwidth max of Central, this can jeopardize the availability of this central resource for others. Appropriate User-Agent headers help repository maintainers quickly identify problems, so that we can ensure that Central remains available for most users.

In general, if your tool ever interacts with the Central Maven repository, it is a good idea to maintain contact with at least one member of the Apache Maven PMC, and there are a few working for Sonatype including Brian, Jason, and John.

If you decide to write a new build tool, that's great. Before you distribute it to tens of thousands of developers, make sure your client sets the User-Agent. Recently, we had cases that involved misconfigured tools that lacked an identifying User-Agent. If the tool in question had a meaningful User-Agent header, it would have taken all five minutes to find the problem and identify the project in question. Instead, it took members of the team multiple weeks of effort.

#2: Don't Scrape Central, Don't Walk the Repo

There are a few services which have decided to scrape the entire contents of the central repository into another copy and then to operate on this copy. While there are different ways to do this (rsync, getting it from a mirror), we often see people using a tool like wget with a modified User-agent header field to constantly scan the entire repository. This creates a storm of requests against Central, and wastes bandwidth. It is also another way to crowd out other people trying to use the Central Repository.

Again, this is a case of communicating with the Maven PMC. If you are building a service that consumes Gigabytes of bandwidth on Central, you need to contact the PMC as they oversee and support the Central Maven repository. If you want to do this, you can, but people will likely point you at a mirror. Even with a mirror, you need to make sure that the operators of that particular mirror don't mind you siphoning off a few gigs every month. Bandwidth isn't free, and the first priority is always availability.

#3: Don't Use a 404 as a Search Tool

There are a few tools out there that haven't figured out how to use the Nexus index from the Maven repository (Maven's one of them). Going forward, most tools should start consulting the Nexus index to test for the presence or absence of an artifact. 404 requests are not a problem, but we're trying to encourage people to minimize remote interactions with Central so we can maximize availability. We could like to serve tens of thousands of 404 requests a second, but we'd like to think that people want to minimize remote interaction when possible.

If you are writing a tool and you want to know how to interact with the Nexus Index. The code is licensed under an Eclipse Public License and is available from http://nexus.sonatype.com.

Picture of Tim OBrien

Written by Tim OBrien

Tim is a Software Architect with experience in all aspects of software development from project inception to developing scaleable production architectures for large-scale systems during critical, high-risk events such as Black Friday. He has helped many organizations ranging from small startups to ...

Tags