Mutation Testing

By Llorens Marti Garcia

We’ve been actively working for almost 3 years on IMVU’s new 3D engine called Northstar, which is written in C++ for performance and portability, and compiled to many platforms, including the web using Emscripten.

Because Test Driven Development (TDD) has demonstrated over the years that we can write software in a more robust and efficient manner as well as with higher quality, our 3D engine has a lot of unit tests trying to cover as many features as possible over more than 100K lines of code.

Problem with Unit tests

With this scenario we recently noticed something interesting about TDD. What happens if after hundreds of thousands of lines of code and thousands of unit tests the rhythm of adding those test for new code decreases? The Answer is simple, the quality of the code decays, the probability of future bugs increases or worse, the ability to re-factor some code ends up being more challenging.

IMVU tried to solve the problem in the past with code coverage tools. These tools help, but they do not give the full picture in some scenarios, such as when a test calls one function that executes big portions of the code in one test. Here I want to explain how we attacked the problem from another point of view.

While looking at some source files of our 3D engine, our initial question for testing the quality of our tests was: If we remove this line from this source file, will the tests yell at us because it is missing?

We found this initial idea very interesting. First, we are not testing if the tests cover the code but if there was any test testing a feature related with this removed line of code. And second, the process can end up giving us a list of lines to look at, which is very handy to detect missing tested features. We can find more about Mutation testing and how it help us to find these deficiencies here.

Overview of the solution

We use Git for a source control. C++ is the main language of the project. We use Python for the framework that drives code testing. Additionally, our commit history is as linear as possible inside our repository.

This is a quick overview of the solution we implemented:

1.- Git pull 3D engine code
2.- Checkout commit with hash X
3.- Git submodule update
4.- Get the diff text between commit X~1 and X
5.- Parse the diff and extract all the added lines in commit X
6.- For every line from the step 5:
6.1.- Decide which kind of mutation we will attempt.

// for a normal line we select:  // —  (a commented line)

// for an if (…) line we select if (true) {  and if (false) {

// for a for (…;…;…) we select for (…; false; …) {

// for a for (… : …) (explained below)

7.- For every line and its substitution line/lines:

7.1.- Find the the original line in the code

7.2.- Substitute the original line with the substitution line

7.3.- Build Northstar

7.3.1.- If the build fails, we return because this line was necessary.

7.4.- Build Run Northstar tests

7.4.1.- If at least one of the tests fail, we return because this line was necessary.

7.4.2.- If the tests hung, we treat it as if the tests failed. (see later)

7.5.- If both 7.3 and 7.4 finished with OK then the original line was not covered by any test

8.- Compute % between all the lines tested and all the lines not covered.

Details of the solution and some tips

Many parts of the algorithm above can be tricky:

1.- Git diff (step 4)

What we want is to know is the exact diff between commit X and the commit before it. With this in mind, it is important to have a commit history as linear as possible. Using git rebase often may do the job (we default git pull to use the -r option). In case we don’t have a linear history, then the operation ‘git diff X~1 X’ will give us inaccurate code to test. This can happen when, for example, we are asking for a diff between a normal commit and a merge commit.

2.- Line substitution (step 6.1)

Here comes one of the trickiest part in this project: trying to parse C++, understanding it, and replacing the code by one specially crafted for the occasion.

The most common case is replacing a normal C++ line of code with a blank line (or a comment line ‘//—’ in case we are debugging). The problem with this approximation is that it does not work everywhere. For example, what happens if the statement line is split across multiple lines? In our solution, we use a combination of regex helpers and more complex C++ parsers.

With conditionals, what we want to do is force them to be active or inactive, this is why when we find an IF statement, we replace them with two possible lines: ‘if (true) {‘ and ‘if (false) {‘

This two scenarios tell us if there is any ‘test’ exercising the opposite of the line we are mutating.  For example, if we substitute ‘if (true) {‘ and after compiling the project and running the tests everything is reported as OK, then we can be sure that no ‘test’ covers the feature that is represented in code by this IF statement.

In the same way, when we find a FOR statement, we replace it depending on the type:

A normal FOR statement will be replaced by:

     for(;false;) {

A C++11 FOR statement with the form for(<variable> : <container>) will be replaced by:

for(<variable> = *(<container>.begin()); false; ) {

These are some methods of replacing IF and FOR statements to force them to be disabled or not. Again, keep in mind that parsing C++ is very difficult, and this code can be improved in many different ways.

A special point about FOR statements: There is the possibility that we will replace a line of code inside the body that will make it loop forever. This is why we have a timeout while running the tests, because they may hang forever. If this is the case, wes hould treat this scenarios as if the tests failed when run.

3.- Removing lines (inside step 6.1)

There are many lines we can ignore testing; this is because we know that the code will fail to compile if we remove them. For example, rules like:

  • Line only contains ‘{‘ or ‘}’
  • Line starts with a common type: int, float, double, char, etc.
  • Line starts with ‘class’ and ends with ‘{‘
  • Line starts with #include

As we can see, we can improve the time of testing a given commit by not testing those lines that will make the compilation to fail or their value is not interesting for us, like ‘#include’.

4.- Parallelization

Due to the nature of these code testing, the process is highly parallelizable. For a given project and a given commit of that project, given N lines we want to test, each of those lines can be tested on a separate machine.

Examples

1.- Simple function

Fig 1: Function onDetach examined

BlogFig1
If we look at Fig 1, we can see some interesting output:

Line 107: If we mutate that line to a comment line, any test will fail.
Line 109: Not tested IF statement.
Line 115: Not tested FOR statement.
Line 117: Not tested IF statement.
Line 118: Line mutated to a comment, any test fail.
Line 121: Line mutated to a comment, any test fail.
Line 124: Line mutated to a comment, any test fail.

The main idea behind this output about onDetach() function is that we are lacking tests for the behaviour of this function. We are not testing the reset of the mesh component. We are neither testing what happens if bonesAsGameObjects is empty. Also we are not testing that when calling onDetach(), we reassign GameObject children.

In the images below, we can see how the algorithm changes the original code (Fig 2) with some mutations (Fig 3 and 4).

Fig 2: Original file.

BlogFig2

In Fig 3 we can see how line 107 mutated to a commented line. After this change, the source code is compiled without any errors and all the tests finish without any problem. Because of this, line 107 is detected as a line that needs to be tested, and is marked on Fig 1.

Fig 3: Line 107 mutated to a comment.

BlogFig3

meshComponent_ is a weak pointer to a mesh component, and when we detach the skeleton component, we want to reset this weak pointer.

By not doing it, we will end up with a skeleton component already detached from a Game Object that still can have access to a mesh component. The meaning of this line is to prevent detached components to have access to previous neighbour components. If we don’t test that line and someone remove it, we can have weird scenarios where detached components still affects current components.

In Fig 4 we can see how line 109 mutated to a if (false) {. After this change, the source code is compiled without any errors and all the tests finish without any problem. Line 109 mutation denotes there is no test exercising that if bonesAsGameObjects_ is empty, we remove the first child. Because of this it is marked with red on Fig 1.

Fig 4: Line 109 mutated to if (false) {

BlogFig4

2.- False Negatives

While testing with this tool, we can have false negatives scenarios, and here is an example:

Fig 5: False negative

BlogFig5

In figure 5,  we can see a DomainBase object that appears to be completely tested. The problem here is that we don’t have any test for that class. What happens is that because DomainBase is used by other classes on the project, if we mutate any line, other tests will fail because of the mutation.

This does not means that DomainBase is robust, it only means that the code that is using DomainBase is not exercising any exploit on it.

3.- False positive

Fig 6: False positive

BlogFig6

Normally we can find false positives iin these lines of code that tries to optimize memory for some algorithm. A good example is reserve() from a std::vector. These lines can be removed, and build and testing processes will not become affected. This is because reserve() makes memory access more performant but in any case it change the behavior of the algorithm. To prevent these kind of false positives we can add filters to avoid any mutation in these lines.

Conclusion

The most important idea of this project is the ability to spot areas in the source code that need mutation testing. This project has already found many bugs in our source code, including dead code and code that was forgotten and not compiled.

The way we found that some code was not compiled at all was because our tool was marking the constructor declaration line of one class as a discardable one. Here we can see the affected class:

Fig 7: Not compiled code.

BlogFig7

If we think about it, how can this class be compiled if we mutate the line 11 with an empty line. After some investigation we discovered that this class was removed from the build process script.

We also noticed that in large code bases, some files are tested not by its own tests but by the code that it is exercising it. This means that the code is safely tested only while it remains in the actual project and with no major changes. A good move to solve this problem is to not compile the whole project for testing a line but instead to compile only the tests associated with the file we are modifying.

Something that still requires work is the speed of getting the test results. Currently it takes minutes to get feedback about one given commit, and this disrupts the development iteration cycle. This can be fixed by,for example, parallelizing the process with more build machines.

We find mutation testing to be highly valuable, and I hope this idea can help other engineers deliver solid, well-tested code.