Buildbot and Intermittent Tests

By Jon Watte, Technical Director at IMVU, Inc.

At IMVU, we do continuous deployment. You may have read previous blog entries describing how we ship code to production 50 times a day. Today, I’d like to talk a little bit about what we’ve learned as IMVU has grown as a company and engineering organization, and what we’re doing to support this growth.

The workflow for anyone who wants to check code or assets into our website looks something like this:



As part of developing a feature increment, we also build the automated tests that prove that the feature works as intended. Over time, as we develop more and more features, we also accumulate more and more tests. Each time we check in code or assets, that check-in is run through all of the tests that have been developed over the years — at last count, this is over 40,000 tests, with seven digits’ worth of individual test assertions (the exact number varied depending on how you count).

The great thing about this is that, if your feature accidentally breaks some part of the code that you didn’t think or know about, the buildbot will tell you. You have the ability to run any test you want in your local sandbox before you check in, and you’ll generally run the tests that you think may be relevant, but in a code base the size and complexity of all of IMVU, you can’t possibly know or run everything.We have an informal goal for our buildbot to run all the tests in 12 minutes or less. This means that we can build 5 times an hour. Some contributors show up for work 8 am (Eastern time); others work until late in the night (Pacific time), so there’s opportunity for well over 50 builds per day.

However, not all builds result in success — the whole reason for having a buildbot is to catch the mistakes that inevitably slip in. When a submitted change causes one or more tests to fail, we call that “making buildbot red,” and that change is not allowed out onto the live website. Instead, you as committer will back out those changes (we have scripts that make this easy and painless), and investigate the failure (buildbot saves the appropriate log files and other information) to fix it before you re-submit.

Unfortunately, tests, like code, are written by humans. This means that tests will have bugs, Some bugs will immediately be obvious — but others will sit silently, working 99% of the time (or, more likely, 99.999% of the time). Then, once in a blue moon, some timing issue or bad assumption will surface, and the test will spuriously fail. What’s worse is that, most often, running the test again will likely not fail, so tracking down the failure requires a fair amount of diligence and effort. When this happens in your local sandbox, it’s not so bad, but if one of these intermittent test failures happens on buildbot, it’s a bit of a problem, because nobody can deploy their code to production while buildbot is red. At a minimum, the test failure has to be investigated to the point that someone understands that it’s an intermittent failure, rather than caused by the code that you committed.

Common causes of intermittent tests may be things that are hard to get right for humans (like systems with sophisticated caching strategies), or things that are hard or expensive to control for in a testing environment (assuming some pre-existing data, or absence of data, in particular databases, for example), or just because of the underlying platform. For example, we test user visible web pages using a system that drives web browsers, and then we assert things about both the state of the web page within the browser, the requests that the browser has made, and even screen shots taken of certain pages. Unfortunately, web browsers will sometimes decide to crash for no discernible reason. There are other components that are not 100% reliable, either, including tests that have to integrate with external services over the internet, although I’ll refrain from going too deep into that particular discussion in this post.

We have been trying to flush out intermittent tests by running tests all night, every night. Because no code changes between the runs, any test that goes red during this nightly grind is intermittent. We will look at the status in the morning, and any test that went red is distributed to the team that seems likely to know the most about the particular failure, for fixing.

Still, intermittent tests have been one of the most annoying thorns in the side of this entire process. As we scale the organization, buildbot will be busy most of the day, generally building three to five different change sets as part of the same build. Any of those committers will be prevented from deploying their changes to production until at least the next build has completed. This can add half an hour to the build-tests-deploy cycle, which hashes the mellow of continuous deployment driven development goodness!

To illustrate, the worst week of our internal buildbot for 2010 looked like this:

Clearly, in an environment that values quick iteration and agile development, if you find yourself in that position, you want to do something about it. So we did!

The key insight is that an intermittent test only fails some of the time, but a real failure caused by new code will fail all of the time. The second insight was that we could gather the specific outcome of each test within each test run, to track metrics on tests over time. We started by just adding these measurements and tracking tests over time. Once we learned that we could predict an intermittent test based on two successive test runs, and that we get clear green-or-red status out of each test run, we re-structured the way that we run tests.

Because we have 40,000 tests to run, allocated over several thousand test implementation files, we split the entire batch of tests across a large number of virtual machines, hosted on a few really beefy server-class machines (raid-6 SSD, 48 GB RAM, dual-CPU six-core-Xeon type machines). Our first implementation spreads the build files across different machines, and uses measurements of how long each test file is to statically sort the files into reasonably equal run lengths. This worked for many years, but now we have better information!

We updated our test runners to first discover all tests that need to run, and put those into a database of available tests. Each build slave will then request one test at a time from this database (which we call the Test Distributor), run the test, and report back. We start by doling out the slowest tests, and then dole out faster and faster tests, to achieve almost 100% perfect balancing in runtime. Even better, though — we know whether a test fails, while the build is still running. Thus, we made the distributor re-distribute a test that is reported as red to some other build slave. If the test comes back as green, we then classify the test as intermittent, and do not make that one red test get in the way of declaring the change set we’re testing as successful. In addition, we remove the build slave that reported the intermittent from the pool of available builders, so that we can analyze the cause of the intermittency whenever convenient.

Does this work?
Oh, yes! Here is a screen shot from the buildbot availability status this week. You will see that the buildbot caught some real, red check-ins, but intermittent tests are not getting in the way anymore. Thursday, we had an entire day with 100% green availability!

Pretty nifty, huh? I’m glad that I work in an environment as dynamic as IMVU that really cares about keeping our productivity up. The more we can remove annoying roadblocks from the path between developers doing hard work, and them seeing their hard work pay off, the better we will do.

13 thoughts to “Buildbot and Intermittent Tests”

  1. Awesome! How long has the the central test database been in place? Any data on what happens to the rate of intermittent failures? (Are they being ignored more or dealt with more now?)

  2. Wow. That’s really close to a system I developed with some colleagues all the way back in 2004. It had intermittent test detection from re-running on different build agents, ordering of tests from slowest to fastest, etc. We also had some custom reporting of performance tests, and running devs’ builds pre commit (as you say, there’s no way they can run all the tests themselves). Last I heard, it was running around 330,000 tests in about an hour.

    I really think there’s a lot of potential in systems like these, but the trouble is that very few projects get big enough to need it or justify the expense. There were a lot of ideas I would have loved to implement, but they weren’t needed by the company I was working for, so they didn’t happen.

    We never did make anything public about what we were doing, so well done for putting something up and (hopefully) getting some discussion going, and new ideas flowing.

    The area of integrating rich CI systems with source control, release process, bug tracking, and feature tracking systems is really under developed IMO. People seem content with all of the components of development to be relatively basic and unaware of each other. There’s basically nothing on the market that does this stuff and you have to write all the custom parts yourself, and if it took you as long as it did us, that’s a significant investment in time!

    Anyway, this was a cool post to read, glad there’s someone else out there with similar ideas!

  3. I think that the engineering that went behind distributing the tests is pretty fascinating, but have you ever run into a case where an intermittent test failure was actually indicative of a real failure?

    Put another way, if a test failure can be ignored (even if intermittent), is there any reason to run the test in the first place?

  4. Thanks for sharing about the technical side of the continuous deployment activities. I’d be interested in hearing about whether your team(s) started with a continuous deployment behavior or if it was transitioned to over time. If you transitioned to it, would you be willing to share about how the behavior transition was fostered?

  5. We use good-old teamwork to keep the intermittent tests at a tolerable level — it’s about the same as before, with a little bit more flexibility on exactly when we work on the tests. Because slaves are pulled, buildbot times would go up if too many slaves are pulled without fixing, in part to make sure there’s feedback to make fixing tests a priority!

  6. At what point do you guys do real world testing on the new code? Like, at what part does one or all of your tech guys install it on a non development pc and test it out?

  7. An underlying assumption of this article seems to be that intermittent test failures are due to test bugs instead of product bugs. In my experience with automated test systems, the root cause is more often than not in the test. But sometimes the root cause is a flaw in the product itself.

    How often, in your experience, does retrying of tests wind up masking “real” bugs?

  8. Hi Ken,

    Chad here. Often intermittent failures are issues in the product indeed. However, that issue should not interrupt the main flow of development. When an intermittent failure occurs, we pull the test for investigation when the appropriate team is ready for a new task. In the meantime, other teams can continue to work!

Leave a Reply