Buildbot and Intermittent Tests

By Jon Watte, Technical Director at IMVU, Inc.

At IMVU, we do continuous deployment. You may have read previous blog entries describing how we ship code to production 50 times a day. Today, I’d like to talk a little bit about what we’ve learned as IMVU has grown as a company and engineering organization, and what we’re doing to support this growth.

The workflow for anyone who wants to check code or assets into our website looks something like this:

As part of developing a feature increment, we also build the automated tests that prove that the feature works as intended. Over time, as we develop more and more features, we also accumulate more and more tests. Each time we check in code or assets, that check-in is run through all of the tests that have been developed over the years — at last count, this is over 40,000 tests, with seven digits’ worth of individual test assertions (the exact number varied depending on how you count).

The great thing about this is that, if your feature accidentally breaks some part of the code that you didn’t think or know about, the buildbot will tell you. You have the ability to run any test you want in your local sandbox before you check in, and you’ll generally run the tests that you think may be relevant, but in a code base the size and complexity of all of IMVU, you can’t possibly know or run everything.We have an informal goal for our buildbot to run all the tests in 12 minutes or less. This means that we can build 5 times an hour. Some contributors show up for work 8 am (Eastern time); others work until late in the night (Pacific time), so there’s opportunity for well over 50 builds per day.

However, not all builds result in success — the whole reason for having a buildbot is to catch the mistakes that inevitably slip in. When a submitted change causes one or more tests to fail, we call that “making buildbot red,” and that change is not allowed out onto the live website. Instead, you as committer will back out those changes (we have scripts that make this easy and painless), and investigate the failure (buildbot saves the appropriate log files and other information) to fix it before you re-submit.

Unfortunately, tests, like code, are written by humans. This means that tests will have bugs, Some bugs will immediately be obvious — but others will sit silently, working 99% of the time (or, more likely, 99.999% of the time). Then, once in a blue moon, some timing issue or bad assumption will surface, and the test will spuriously fail. What’s worse is that, most often, running the test again will likely not fail, so tracking down the failure requires a fair amount of diligence and effort. When this happens in your local sandbox, it’s not so bad, but if one of these intermittent test failures happens on buildbot, it’s a bit of a problem, because nobody can deploy their code to production while buildbot is red. At a minimum, the test failure has to be investigated to the point that someone understands that it’s an intermittent failure, rather than caused by the code that you committed.

Common causes of intermittent tests may be things that are hard to get right for humans (like systems with sophisticated caching strategies), or things that are hard or expensive to control for in a testing environment (assuming some pre-existing data, or absence of data, in particular databases, for example), or just because of the underlying platform. For example, we test user visible web pages using a system that drives web browsers, and then we assert things about both the state of the web page within the browser, the requests that the browser has made, and even screen shots taken of certain pages. Unfortunately, web browsers will sometimes decide to crash for no discernible reason. There are other components that are not 100% reliable, either, including tests that have to integrate with external services over the internet, although I’ll refrain from going too deep into that particular discussion in this post.

We have been trying to flush out intermittent tests by running tests all night, every night. Because no code changes between the runs, any test that goes red during this nightly grind is intermittent. We will look at the status in the morning, and any test that went red is distributed to the team that seems likely to know the most about the particular failure, for fixing.

Still, intermittent tests have been one of the most annoying thorns in the side of this entire process. As we scale the organization, buildbot will be busy most of the day, generally building three to five different change sets as part of the same build. Any of those committers will be prevented from deploying their changes to production until at least the next build has completed. This can add half an hour to the build-tests-deploy cycle, which hashes the mellow of continuous deployment driven development goodness!

To illustrate, the worst week of our internal buildbot for 2010 looked like this:

Clearly, in an environment that values quick iteration and agile development, if you find yourself in that position, you want to do something about it. So we did!

The key insight is that an intermittent test only fails some of the time, but a real failure caused by new code will fail all of the time. The second insight was that we could gather the specific outcome of each test within each test run, to track metrics on tests over time. We started by just adding these measurements and tracking tests over time. Once we learned that we could predict an intermittent test based on two successive test runs, and that we get clear green-or-red status out of each test run, we re-structured the way that we run tests.

Because we have 40,000 tests to run, allocated over several thousand test implementation files, we split the entire batch of tests across a large number of virtual machines, hosted on a few really beefy server-class machines (raid-6 SSD, 48 GB RAM, dual-CPU six-core-Xeon type machines). Our first implementation spreads the build files across different machines, and uses measurements of how long each test file is to statically sort the files into reasonably equal run lengths. This worked for many years, but now we have better information!

We updated our test runners to first discover all tests that need to run, and put those into a database of available tests. Each build slave will then request one test at a time from this database (which we call the Test Distributor), run the test, and report back. We start by doling out the slowest tests, and then dole out faster and faster tests, to achieve almost 100% perfect balancing in runtime. Even better, though — we know whether a test fails, while the build is still running. Thus, we made the distributor re-distribute a test that is reported as red to some other build slave. If the test comes back as green, we then classify the test as intermittent, and do not make that one red test get in the way of declaring the change set we’re testing as successful. In addition, we remove the build slave that reported the intermittent from the pool of available builders, so that we can analyze the cause of the intermittency whenever convenient.

Does this work?
Oh, yes! Here is a screen shot from the buildbot availability status this week. You will see that the buildbot caught some real, red check-ins, but intermittent tests are not getting in the way anymore. Thursday, we had an entire day with 100% green availability!

Pretty nifty, huh? I’m glad that I work in an environment as dynamic as IMVU that really cares about keeping our productivity up. The more we can remove annoying roadblocks from the path between developers doing hard work, and them seeing their hard work pay off, the better we will do.