The Hudson Build Farm Experience, Volume III

February 04, 2009 By John Casey

6 minute read time

I've been working on a Hudson-based build farm for Sonatype and Maven open source builds since sometime in September 2008. I've learned a lot, so I thought I'd share some experiences from the trenches. In this third - and probably, final - installment, I'll discuss some issues we tackled with our VMware environment itself, and look ahead to some issues with which we still grapple daily.

VMware, Efficiency, and the Space-Time Continuum

Compared to what we went through trying to get Windows builds running reliably out on the build farm, this discussion will seem somewhat nitpicky. However, there are some important things to understand when you're running a build farm on VMware ESXi, so let's dive in and take a look.

The first thing to understand is that the hardware specs of your ESXi machine represent a theoretical maximum. Just looking at those numbers (we have 8 cores at 3.16 GHz and 32 gigs of RAM), you'll be tempted to salivate and wring your hands as you dream about all the simultaneous builds you can run. Resist. Remember that you'll have multiple virtual machines sharing that hardware, each of which has a certain sunk cost in terms of memory (and, minimally, CPU) overhead. This overhead comes from the RAM and CPU necessary to run a full-blown operating system, on which your Hudson instance executes. In some cases, like Ubuntu JeOS (Just enough OS), which are designed for use in virtual machines, the overhead is pretty minimal, though still noticeable. In other cases, like Solaris or Windows, you're stuck with the same operating system your desktop machine might run, complete with a GUI. OK, I'm sure you can turn off the GUI on Solaris - it runs webservers. But I'm not a Solaris expert, and more to the point I'm not interested in tainting that environment too much with customizations. Too much customization can render your build platform unique, which is a bad thing. Additionally, there can be a bit of inefficiency in the allocation of RAM and CPU resources if you structure your VMs to grab and reserve those resources no matter what. This means that even if those VMs are completely idle, they may hold onto a certain amount of RAM (usually not CPU, in my experience) and choke out other competing VMs. On the other hand, if you don't reserve resources for your VMs, you may face sudden lock-ups if you have too many VMs competing for what is fundamentally a finite resource.

In theory, this should simply slow down all VMs on the system; sort of a reverse rising-tide-lifts-all-ships effect. In practice, we've found that this sort of competition can lead to full-out system crashes. Funny thing: some operating systems don't respond favorably to having less RAM than they thought. If it's just a CPU-competition issue, then your VMs may simply leak time, but we'll talk about this in a minute. After groping around in the dark for several days, we gradually determined the best policy was to try to limit the total pseudo-hardware configurations for all running VMs to 90% of what the ESXi machine actually has. Note that you must always tell each VM how many CPUs and RAM are "owns," even if you don't reserve those resources by messing with the Resources tab in the VM settings. (Reserving them via the Resources tab should force more of a hard allocation, limiting VMware's ability to shuffle resources to where they're most needed, as I alluded to above.) What I'm talking about is trying to keep the total resources "owned" by all running VMs, just below the actual hardware resources available on the machine. It seems to function more smoothly that way.

Managing Resources: Understanding Your Builds' Needs

I need to stop things here and provide a disclaimer. Some of our builds are quite large, and can take a long time to complete. In the past, each time we encountered resource problems in our build farm, it's been due to these huge builds running simultaneously on all available VMs. So, the load put on our particular build farm varies tremendously from moment to moment. This may seem like a strange niche case, but there's a critical lesson here.

You have to plan for the maximum momentary load you're likely to see on the whole build farm.

It only takes one instant to max out the RAM on your ESXi hardware to cause one or more of your VMs to grind to a halt. If you have more than one build that can run for a long time or runs simultaneously on all VMs, you need to be prepared for saturating your server's hardware. You can limit the effects of this a little bit by using the Locks and Latches Hudson plugin and keeping long-running jobs on the same lock. This will cause the your build times for any particular distributed job to balloon, so be prepared. Failure to do this can completely lobotomize a VM, leaving it with a corrupted disk or something similar. You'll have to ask someone else for a technical explanation of why this is, but believe me: I've had to rebuild VMs multiple times because of this problem.

On the other hand, if you have a lot of small builds that are unlikely to jam up the works for long by themselves, you can probably tune the number of Hudson executors on each VM and leave the CPU/RAM allocations to each VM as suggestions. That way, VMware doesn't have to set aside that segment of its resources for an idle VM. Even if you have this sort of setup, but still have that one huge build, you can avoid Hudson gridlock by having at least two executors on each VM where the long-running job will build. This way, the more agile builds have a passing lane to get around that trundling, grindingly slow 18-wheeler of a build.

We’ve actually been able to cheat the resource allocation rule I mentioned above to a certain extent. Our private build farm tends to have much faster, less frequent builds, so we've almost doubled the number of running VMs on the ESXi server, since the VMs allocated to the private build farm are idle much of the time. As we add new jobs to each build farm, I'm sure this will cease to be true. But for now, the two farms look like they're running on twice as much hardware as we physically have, and they seem happy as two peas in a pod.

Keeping Time

Virtual machines running on ESXi tend to have some trouble keeping time. It's a little embarrassing, and we try not to talk about it in public, but there it is. Left to their own devices, VM operating systems may move backward or forward in time relative to any outside fixed point. To the outsider, some VMs will appear slightly blue, while others will appear slightly reddish, Einstein would be impressed.

Okay, bad physics jokes aside, they're not moving in time; they just sort of lose track of it. The problem is pretty well documented on the internet, and there are some pretty good instructions for compensating. It seems the timekeeping problem arises from CPU allocation and kernels that count CPU "ticks" to keep time. The best practice seems to be taking a two-pronged approach to keep everything synchronized. First install VMware Tools, and second configure NTP time synchronization on each VM operating system.

VMware tools is meant to keep VMs in sync by catching them up when they fall behind (probably due to not getting the CPU access they expect). However, the tools are apparently useless for reigning in VMs that run out ahead of the bunch. Personally, I have no idea why a VM operating system would skip ahead, but the internet assures me it's possible, and I've actually seen it happen in our build farm. To handle this problem, we must enable NTP clock synchronization for our VMs. Installing VMware Tools is a breeze on most operating systems, except for FreeBSD. It seems there is no version available for BSD, so you're left with NTP to keep things up-to-date. That's okay; it does well. As far as enabling NTP, this is also a breeze on most operating systems. Most already have NTP installed, or can have it installed through a simple command like:

<code>sudo aptitude install ntp
</code>

…Except, of course, on Windows. On Windows, you'll need to dig into the Policies section of the Control Panel. This is far less intuitive or simple than any other OS (except possibly Solaris, and on Solaris the cure for all problems is a good manual).

One other interesting point about NTP: if you have an NTP server on your VMware machine you're thinking about using, STOP! Use an NTP server external to VMware; remember how VMware has some problems with timekeeping? I may have touched on this point somewhere above. In our build farm, we're using the following NTP configuration (or approximations of this, on the Windows systems):

<code>$ cat /etc/ntp.conf

server 0.north-america.pool.ntp.org
server 1.north-america.pool.ntp.org
server 2.north-america.pool.ntp.org
server 3.north-america.pool.ntp.org
</code>

Pretty simple, really. Using multiple time sources allows your network to compensate for any clock skew that may appear in any one of the sources. It should also make your configuration more resilient to partial network outages, such as when the entire east coast of the US disappears from the internet (it's happened before).

But why go to all this trouble? Why does it matter that all OS clocks tick in perfect harmony? Apparently, Hudson can lose build results if the timestamps are off by too much. It's not just an urban legend; we've had this problem (which is why I know so much about VMware's timekeeping). Again, I'm not sure why Hudson loses build results, or why it relies on timestamps from slave instances. These are questions best asked by Hudson developers. What I can tell you is that keeping time in sync throughout your build farm is in your best interest.

Written by John Casey

John is a former Engineer at Sonatype and is a software engineering expert specializing in build process / automation (particularly for Java software). His experience emphasizes engineering, not just software development; he interested in the process of making software reliable and supportable in ...

Explore All Posts by John Casey

The Hudson Build Farm Experience, Volume III

VMware, Efficiency, and the Space-Time Continuum

Managing Resources: Understanding Your Builds' Needs

Keeping Time

Try Nexus Repository Free Today

Related Resources

Autonomous Development Meets Real-World Risk

Why LLMs Make Terrible Databases and Why That Matters for Trusted AI

Modern Vulnerability Management with HeroDevs