Mutation Testing

By Llorens Marti Garcia

We’ve been actively working for almost 3 years on IMVU’s new 3D engine called Northstar, which is written in C++ for performance and portability, and compiled to many platforms, including the web using Emscripten.

Because Test Driven Development (TDD) has demonstrated over the years that we can write software in a more robust and efficient manner as well as with higher quality, our 3D engine has a lot of unit tests trying to cover as many features as possible over more than 100K lines of code.

Problem with Unit tests

With this scenario we recently noticed something interesting about TDD. What happens if after hundreds of thousands of lines of code and thousands of unit tests the rhythm of adding those test for new code decreases? The Answer is simple, the quality of the code decays, the probability of future bugs increases or worse, the ability to re-factor some code ends up being more challenging.

IMVU tried to solve the problem in the past with code coverage tools. These tools help, but they do not give the full picture in some scenarios, such as when a test calls one function that executes big portions of the code in one test. Here I want to explain how we attacked the problem from another point of view.

While looking at some source files of our 3D engine, our initial question for testing the quality of our tests was: If we remove this line from this source file, will the tests yell at us because it is missing?

We found this initial idea very interesting. First, we are not testing if the tests cover the code but if there was any test testing a feature related with this removed line of code. And second, the process can end up giving us a list of lines to look at, which is very handy to detect missing tested features. We can find more about Mutation testing and how it help us to find these deficiencies here.

Overview of the solution

We use Git for a source control. C++ is the main language of the project. We use Python for the framework that drives code testing. Additionally, our commit history is as linear as possible inside our repository.

This is a quick overview of the solution we implemented:

1.- Git pull 3D engine code
2.- Checkout commit with hash X
3.- Git submodule update
4.- Get the diff text between commit X~1 and X
5.- Parse the diff and extract all the added lines in commit X
6.- For every line from the step 5:
6.1.- Decide which kind of mutation we will attempt.

// for a normal line we select:  // —  (a commented line)

// for an if (…) line we select if (true) {  and if (false) {

// for a for (…;…;…) we select for (…; false; …) {

// for a for (… : …) (explained below)

7.- For every line and its substitution line/lines:

7.1.- Find the the original line in the code

7.2.- Substitute the original line with the substitution line

7.3.- Build Northstar

7.3.1.- If the build fails, we return because this line was necessary.

7.4.- Build Run Northstar tests

7.4.1.- If at least one of the tests fail, we return because this line was necessary.

7.4.2.- If the tests hung, we treat it as if the tests failed. (see later)

7.5.- If both 7.3 and 7.4 finished with OK then the original line was not covered by any test

8.- Compute % between all the lines tested and all the lines not covered.

Details of the solution and some tips

Many parts of the algorithm above can be tricky:

1.- Git diff (step 4)

What we want is to know is the exact diff between commit X and the commit before it. With this in mind, it is important to have a commit history as linear as possible. Using git rebase often may do the job (we default git pull to use the -r option). In case we don’t have a linear history, then the operation ‘git diff X~1 X’ will give us inaccurate code to test. This can happen when, for example, we are asking for a diff between a normal commit and a merge commit.

2.- Line substitution (step 6.1)

Here comes one of the trickiest part in this project: trying to parse C++, understanding it, and replacing the code by one specially crafted for the occasion.

The most common case is replacing a normal C++ line of code with a blank line (or a comment line ‘//—’ in case we are debugging). The problem with this approximation is that it does not work everywhere. For example, what happens if the statement line is split across multiple lines? In our solution, we use a combination of regex helpers and more complex C++ parsers.

With conditionals, what we want to do is force them to be active or inactive, this is why when we find an IF statement, we replace them with two possible lines: ‘if (true) {‘ and ‘if (false) {‘

This two scenarios tell us if there is any ‘test’ exercising the opposite of the line we are mutating.  For example, if we substitute ‘if (true) {‘ and after compiling the project and running the tests everything is reported as OK, then we can be sure that no ‘test’ covers the feature that is represented in code by this IF statement.

In the same way, when we find a FOR statement, we replace it depending on the type:

A normal FOR statement will be replaced by:

     for(;false;) {

A C++11 FOR statement with the form for(<variable> : <container>) will be replaced by:

for(<variable> = *(<container>.begin()); false; ) {

These are some methods of replacing IF and FOR statements to force them to be disabled or not. Again, keep in mind that parsing C++ is very difficult, and this code can be improved in many different ways.

A special point about FOR statements: There is the possibility that we will replace a line of code inside the body that will make it loop forever. This is why we have a timeout while running the tests, because they may hang forever. If this is the case, wes hould treat this scenarios as if the tests failed when run.

3.- Removing lines (inside step 6.1)

There are many lines we can ignore testing; this is because we know that the code will fail to compile if we remove them. For example, rules like:

  • Line only contains ‘{‘ or ‘}’
  • Line starts with a common type: int, float, double, char, etc.
  • Line starts with ‘class’ and ends with ‘{‘
  • Line starts with #include

As we can see, we can improve the time of testing a given commit by not testing those lines that will make the compilation to fail or their value is not interesting for us, like ‘#include’.

4.- Parallelization

Due to the nature of these code testing, the process is highly parallelizable. For a given project and a given commit of that project, given N lines we want to test, each of those lines can be tested on a separate machine.

Examples

1.- Simple function

Fig 1: Function onDetach examined

BlogFig1
If we look at Fig 1, we can see some interesting output:

Line 107: If we mutate that line to a comment line, any test will fail.
Line 109: Not tested IF statement.
Line 115: Not tested FOR statement.
Line 117: Not tested IF statement.
Line 118: Line mutated to a comment, any test fail.
Line 121: Line mutated to a comment, any test fail.
Line 124: Line mutated to a comment, any test fail.

The main idea behind this output about onDetach() function is that we are lacking tests for the behaviour of this function. We are not testing the reset of the mesh component. We are neither testing what happens if bonesAsGameObjects is empty. Also we are not testing that when calling onDetach(), we reassign GameObject children.

In the images below, we can see how the algorithm changes the original code (Fig 2) with some mutations (Fig 3 and 4).

Fig 2: Original file.

BlogFig2

In Fig 3 we can see how line 107 mutated to a commented line. After this change, the source code is compiled without any errors and all the tests finish without any problem. Because of this, line 107 is detected as a line that needs to be tested, and is marked on Fig 1.

Fig 3: Line 107 mutated to a comment.

BlogFig3

meshComponent_ is a weak pointer to a mesh component, and when we detach the skeleton component, we want to reset this weak pointer.

By not doing it, we will end up with a skeleton component already detached from a Game Object that still can have access to a mesh component. The meaning of this line is to prevent detached components to have access to previous neighbour components. If we don’t test that line and someone remove it, we can have weird scenarios where detached components still affects current components.

In Fig 4 we can see how line 109 mutated to a if (false) {. After this change, the source code is compiled without any errors and all the tests finish without any problem. Line 109 mutation denotes there is no test exercising that if bonesAsGameObjects_ is empty, we remove the first child. Because of this it is marked with red on Fig 1.

Fig 4: Line 109 mutated to if (false) {

BlogFig4

2.- False Negatives

While testing with this tool, we can have false negatives scenarios, and here is an example:

Fig 5: False negative

BlogFig5

In figure 5,  we can see a DomainBase object that appears to be completely tested. The problem here is that we don’t have any test for that class. What happens is that because DomainBase is used by other classes on the project, if we mutate any line, other tests will fail because of the mutation.

This does not means that DomainBase is robust, it only means that the code that is using DomainBase is not exercising any exploit on it.

3.- False positive

Fig 6: False positive

BlogFig6

Normally we can find false positives iin these lines of code that tries to optimize memory for some algorithm. A good example is reserve() from a std::vector. These lines can be removed, and build and testing processes will not become affected. This is because reserve() makes memory access more performant but in any case it change the behavior of the algorithm. To prevent these kind of false positives we can add filters to avoid any mutation in these lines.

Conclusion

The most important idea of this project is the ability to spot areas in the source code that need mutation testing. This project has already found many bugs in our source code, including dead code and code that was forgotten and not compiled.

The way we found that some code was not compiled at all was because our tool was marking the constructor declaration line of one class as a discardable one. Here we can see the affected class:

Fig 7: Not compiled code.

BlogFig7

If we think about it, how can this class be compiled if we mutate the line 11 with an empty line. After some investigation we discovered that this class was removed from the build process script.

We also noticed that in large code bases, some files are tested not by its own tests but by the code that it is exercising it. This means that the code is safely tested only while it remains in the actual project and with no major changes. A good move to solve this problem is to not compile the whole project for testing a line but instead to compile only the tests associated with the file we are modifying.

Something that still requires work is the speed of getting the test results. Currently it takes minutes to get feedback about one given commit, and this disrupts the development iteration cycle. This can be fixed by,for example, parallelizing the process with more build machines.

We find mutation testing to be highly valuable, and I hope this idea can help other engineers deliver solid, well-tested code.

The case of the “Page can’t be displayed” intermittent selenium test

By Eric Hohenstein

 

IMVU relies heavily on unit testing with dependency injection for most of the testing of its website code written in PHP. Unit testing is great because it can be crafted in such a way that it will (hopefully) only test the things you’re interested in testing to verify the behavior of a particular component. This ideally makes unit tests both fast and reliable.

However, it’s possible to write unit that all pass, but that fail to catch an incompatibility between components that causes your application to break horribly. Because of this, IMVU uses a testing tool called selenium for end-to-end acceptance testing. Selenium works by loading a web page or pages in an actual browser window and it performs a set of configurable actions on that web page and verifies that the page behaves as expected. For example, if you have a web page with a button with the text “Hide” next to a widget and you expect that when the button is clicked, the widget is hidden, you can write a selenium test that will load the page in question and click the button and confirm that the widget is hidden.
For at least the past several months, IMVU’s selenium test infrastructure has had a problem that causes otherwise reliable selenium tests to fail intermittently when run in Internet Explorer, showing an error page with the text “This page can’t be displayed”. This has been a constant source of frustration and a drag on engineering efficiency. A great deal of time has been spent by engineers on multiple teams investigating the cause. Up until now, the cause has remained a mystery.

 

Ericblog1
 

DNS failure
At first, the problem seemed to be related to a DNS lookup failure. The reason for this is that inspecting the frame where the web page was supposed to have loaded reveals that the URL of the error page is res://ieframe.dll/dnserrordiagoff.htm*. The path of that URL (dnserrordiagoff.htm) suggests a DNS lookup failure. Some experimentation shows that DNS errors are, indeed, displayed in this way in Internet Explorer. However, a lot of other network errors will cause the same error page to be displayed so this was not a valid conclusion and this information did little to uncover the source of the problem.
In order to eliminate our corporate DNS service as a possible source of the problem, we tried hard-coding the address of the remote web server host in the hosts file of the Windows host running the selenium test. We initially thought that this had resolved the problem but we soon found that the problem continued.
* It turns out that the “diagoff” part of the path of the error page disables the presentation of a “Fix connection problems” button on the error page. Internet Explorer will show the “diagoff” version of the error page inside an iframe and the alternate dnserror.htm version with the button when shown in the top most frame.

 

SSL failure
It had been observed that this problem occurred only when the selenium test was trying to load a web page using the HTTPS scheme. This was a useful observation but it didn’t give us much insight into what the problem was early on.

 

Network reliability
We tried capturing a network trace of the problem occurring several times. One of these time, the capture showed a “Bad record MAC” SSL fatal alert message during an SSL handshake. This made us wonder if the the problem was being caused by an unreliable network between the Windows host where the selenium test was running and the host where the HTTPS page was being served from. Checksums are supposed to catch packet transmission errors but if the errors are happening regularly enough, some garbled packets might slip through with matching checksums. We inspected our network traces for signs of packets with bad checksums but didn’t find any other evidence of that. We inspected several network traces and found that while they all seemed to fail during the SSL handshake, only one failed with a “Bad record MAC” SSL fatal alert message. We decided that this was likely a red herring.

 

Error duplication
In order to narrow down the source of the problem, we wrote a python script to try and duplicate it. The python script would, 4 times per second, do a DNS lookup of the domain name that the HTTPS URL was using when the problem occurred, and then make an HTTP followed by an HTTPS request to that domain. The script would report when any of these operations took more than a small threshold to complete. This script was run in parallel to the selenium test in the hopes that it would reveal some kind of temporary network lag between the selenium test host and the web server host. The idea was that if this test showed a problem that coincided with the selenium failure then it would indicate that the problem was between the Windows TCP/IP stack and the remote web server. If it didn’t then the problem had to be at a higher level. The results of this experiment showed no failure in the script when the selenium failure occurred so the problem had to be in Internet Explorer or something it relies upon above the low level Windows networking layer.

 

Lots of moving parts
When we first noticed the problem, we had recently upgraded to a newer version of selenium so that we could run our selenium tests on a newer version of Internet Explorer so that we could run our selenium tests on Windows 7 rather than Windows XP. All of these things changed at approximately the same time and we knew that the problem could potentially be related to any of them.

 

Windows Defender
In Windows 7, a feature called “Windows Defender” will periodically scan the system for potential problems. We noticed that there was an alert showing on every one of our Windows 7 selenium test hosts warning about changes we had made to the hosts file to facilitate our testing. We thought that Windows Defender might be interfering with Internet Explorer in some way that could be causing the problem. Disabling Windows Defender got rid of the alert, but it the problem continued.

 

Background tasks
The intermittent nature of the problem seemed to suggest that it might be happening when the test coincided with a background task in Windows 7 that wasn’t happening in Windows XP or that had at least not been causing the same type of problem. One type of background task that we thought might be related was periodically checking for certificate revocation, since we had identified that the problem only affected HTTPS pages. We tried disabling all Windows scheduled tasks, but the problem continued.

 

Internet Explorer settings
We wondered if some Internet Explorer setting in the new version of Internet Explorer we were using was causing the problem. We tried setting all Internet Explorer settings to the lowest security, lowest privacy settings, disabling extensions, disabling HTTP 1.1, disabling checking SSL certificate revocation, disabling “Do Not Track,” but the problem continued.

 

Internet Explorer cache/cookies
We had noticed in the past that leftover state in Internet Explorer from previous tests could sometimes cause intermittent selenium test failures. We had some infrastructure in place that would attempt to delete cookies and cache between selenium tests to avoid this problem. We noticed that the file system and registry paths used to find and eliminate leftover Internet Explorer state had been changed since we upgrade Internet Explorer and Windows versions but our infrastructure hadn’t been updated. We fixed this, but the problem continued.

 

Running out of ideas
At this point, it seemed that IMVU might have to either abandon selenium testing or it might have to abandon using Internet Explorer for selenium testing. Both solutions would be very expensive and would also be a very frustrating failure. That’s when an AHA moment happened. We captured some more network traces when the problem occurred and noticed all of them (by luck, as it turns out) contained “Bad record MAC” SSL fatal alerts. This information combined with the fact that the problem had to be above the low level Windows networking layer really narrowed the problem down quite a bit. Furthermore, all the “Bad record MAC” alerts were being sent by the server, which indicated that the server was rejecting the MAC of the SSL handshake calculated and sent by the Internet Explorer. The conversations all looked like this in Wireshark (in this trace, the address of the server is 172.16.100.16):

 

Ericblog2
 

Note that there are warning alerts issued by the server about “Unrecognized Name”. This is because the client is sending the SNI TLS extension record and the server is mis-configured in our testing environment. We tried configuring it correctly to eliminate this warning, but the problem continued.
In order to test the hypothesis that a bug in the SSL implementation used by Internet Explorer might be miscalculating the handshake MAC, we tried disabling various SSL/TLS versions. Finally we found a solution! It turned out that disabling TLS 1.2 in Internet Explorer settings resolved the problem. It seemed like we’d narrowed down the problem all the way to the root cause.

 

Internet Explorer or openssl
But was Internet Explorer really to blame? What we knew from the network trace was that the server’s implementation of TLS 1.2 was rejecting the handshake MAC that it was being given by Internet Explorer. However, this isn’t enough to conclude that the MAC calculated by Internet Explorer was incorrect. It was equally possible that Internet Explorer was calculating the MAC correctly but that the server’s implementation of TLS 1.2 was incorrectly rejecting it. Our server is Apache 2.2.25 with mod_ssl. The mod_ssl Apache module is implemented in terms of openssl, for which we use version 1.0.1. It was possible that what we were witnessing was a bug in openssl. If so, this problem might have affected clients other than just Internet Explorer. If so, rather than just ignoring the problem and disabling TLS 1.2 in Internet Explorer settings on our selenium test hosts, it would be better to fix or reconfigure openssl in our servers. But with only one client (Internet Explorer) and one server (Apache/openssl) demonstrating the problem, we couldn’t know which one had the bug. Since openssl is open source, we could theoretically find the problem if that was where the problem was, given enough time, but it would take a very long time to prove that the problem wasn’t in openssl just by code inspection. Not having Internet Explorer source, we couldn’t do the same there. If we could duplicate the problem between Internet Explorer and a separate SSL/TLS server implementation then that would point very strongly to the problem being in Internet Explorer and not in openssl.
 

Analysis using Wireshark
Hoping to get a tool like Wireshark to validate the MAC sent by Internet Explorer so that we could narrow the problem to either Internet Explorer or openssl, we configured Wireshark with the private key used by our server. However, this didn’t help because it turned out that the key exchange algorithm negotiated by Internet Explorer and Apache was using Diffie-Hellman. Here’s the relevant portion of the Wireshark SSL debug output file for the “Client Key Exchange” message when the error occurred:
ssl_generate_pre_master_secret: found SSL_HND_CLIENT_KEY_EXCHG, state 17
ssl_decrypt_pre_master_secret session uses DH (17) key exchange, which is impossible to decrypt
ssl_generate_pre_master_secret: can’t decrypt pre master secret
The key material used by the client and server to exchange secrets can’t be obtained using the private key of the server because the exchange uses an ephemeral key pair generated by the server on-the-fly and signed using its private key. This strategy provides a very powerful security feature known as “Perfect Forward Secrecy“. It also means that even if Wireshark has the private key of the server, it still can’t decode the SSL conversation, including validating the handshake MAC.

 

Ericblog3

 

We then configured the server such that it and Internet Explorer would negotiate the exact same cipher suite but using just RSA rather than DHE-RSA for key exchange. We were successful in doing this, but we weren’t able to isolate the problem to Internet Explorer or openssl using this approach because it eliminated the problem. If the problem was in IE, it seems that it had to be related to it use of Diffie-Hellman key exchange.
 

New error duplication
In order to more easily and quickly duplicate the problem, we wrote a test program that uses the same networking Windows API that Internet Explorer uses. The library that provides this API on Windows is called WinInet. Our program did the same thing as our python script did, however it failed to duplicate the problem. A network trace while the program was running showed that it only ever opened one SSL connection to the server and made all requests using that single connection. This was worked around by making the test program only make one HTTPS request but we called it in a loop from the Windows command line. This duplicated the problem. Here’s the program:
#include <windows.h>
#include <wininet.h>
#include <stdio.h>
#include <tchar.h>
 
void ReadUrl(LPCTSTR url) {
    HINTERNET hInternet = NULL;
    HINTERNET hRequest = NULL;
    char buffer[1024];
    DWORD bytesRead = 0L;
    BOOL result = FALSE;
 
    hInternet = InternetOpen(_T(“Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”), INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
    if (NULL == hInternet) {
        _tprintf(_T(“InternetOpen error: %08X\n”), GetLastError());
        return;
    }
    
    hRequest = InternetOpenUrl(hInternet, url, NULL, 0, INTERNET_FLAG_RELOAD | INTERNET_FLAG_SECURE, (DWORD_PTR)NULL);
    if (NULL == hRequest) {
        _tprintf(_T(“InternetOpenUrl error: %08X\n”), GetLastError());
        InternetCloseHandle(hInternet);
        return;
    }
 
    do {
        bytesRead = 0L;
        result = InternetReadFile(hRequest, buffer, (DWORD)sizeof(buffer), &bytesRead);
        if (result && (bytesRead == 0L)) {
            break;
        }
    } while (result);
 
    if (!result) {
        DWORD error = GetLastError();
        TCHAR errorText[1024];
        DWORD length = (DWORD)(sizeof(errorText) / sizeof(TCHAR));
        result = InternetGetLastResponseInfo(&error, errorText, &length);
        if (!result) {
            _tcscpy(errorText, “unknown”);
            length = 7L;
        }
        _tprintf(_T(“error reading request: (%08X) %.*s\n”), error, length, errorText);
    }
 
    InternetCloseHandle(hInternet);
    InternetCloseHandle(hRequest);
}
 
int _tmain(int argc, LPCTSTR * argv) {
    if (argc != 2) {
        _tprintf(_T(“%s usage: %s <URL>\r\n”), argv[0], argv[0]);
        return 1;
    }
 
    ReadUrl(argv[1]);
 
    return 0;
}
When run, in just a few seconds, it produces output similar to the following:
C:\imvu\temp>for /l %i in (1, 1, 1000) do @wininet_test.exe https://localhost.imvu.com/dummy.txt
InternetOpenUrl error: 00000008
InternetOpenUrl error: 00002F7D
InternetOpenUrl error: 00002F7D
^C
Out of 1000 requests, roughly 10-12 will fail. About half of them will fail with error code 8 and the other half will fail with error code 0x2F7D. Error code 8 corresponds to ERROR_NOT_ENOUGH_MEMORY. Error code 0x2F7D corresponds to ERROR_INTERNET_SECURITY_CHANNEL_ERROR. The description of this error is “The application experienced an internal error loading the SSL libraries.” I suspect this description is misleading and is actually a generic error used for SSL handshake errors and possibly other kinds of failures as well. It seems to be associated with the schannel.dll Windows system library.
A network trace capturing the SSL conversations when these errors occur shows that the failures with error code 8 manifest as the client closing the connection with a [FIN, ACK] packet after receiving the “Server Key Exchange” message without ever sending a “Client Key Exchange” message. The server responds by sending an “Unexpected Message” fatal alert message. The failures with error code 0x2F7D manifest as the server sending a “Bad record MAC” fatal alert after the client sends the final SSL handshake message containing a “Change Cipher Spec” message and terminated with the handshake MAC.
 

Duplication with gnutls-serv
In order to finally isolate the problem to either Internet Explorer or openssl, we used a tool called gnutls-serv. Gnutls is an SSL/TLS library from GNU. It comes with a tool called gnutls-serv that will act as a simple HTTPS server. After downloading and compiling the latest version (gnutls-3.3.9), we started it with this command line (after killing Apache) on our web server host:
#LD_LIBRARY_PATH=/usr/local/lib/ src/gnutls-serv -p 443 -g –http -a –x509certfile /etc/apache2/certs/ssl.crt/sandbox.crt –x509keyfile /etc/apache2/certs/ssl.key/sandbox.key –priority “SECURE:-ECDHE-RSA” -d 9999
The –priority “SECURE:-ECDHE-RSA” option is given in order to force the cipher suite negotiated between WinInet and gnutls-serv to be the exact one negotiated between Internet Explorer and Apache. It was initially choosing the Eliptic Curve Diffie-Hellman key exchange algorithm. It turns out that the problem is not reproducible when using Eliptic Curve Diffie-Hellman.
Running the WinInet test program against this server, the problem continued to be observed with the same frequency as when using Apache/openssl. The network trace of these failures look more-or-less identical to the ones generated with Apache/openssl:
Ericblog4
The only significant difference is the missing server name warning. gnutls-serv apparently ignores the SNI TLS extension.
Inspecting the relevant debugging output of gnutls-serv corresponding to the failure shows this:
|<3>| ASSERT: gnutls_cipher.c:728
|<3>| ASSERT: gnutls_cipher.c:167
|<3>| ASSERT: gnutls_record.c:1239
|<0x22ae770>| Discarded message[0] due to invalid decryption
|<3>| ASSERT: gnutls_buffers.c:1358
|<3>| ASSERT: gnutls_handshake.c:1428
|<3>| ASSERT: gnutls_handshake.c:785
|<3>| ASSERT: gnutls_handshake.c:3054
|<3>| ASSERT: gnutls_handshake.c:3200
Error in handshake
|<5>| REC: Sending Alert[2|20] – Bad record MAC
The original failure is reported on line 728 of gnutls_cipher.c which is this bit of code:
    if (unlikely
        (memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {
        /* HMAC was not the same. */
        dummy_wait(params, compressed, pad_failed, pad,
               length + preamble_size);
 
        return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);
    }
In this code, the unlikely() function is just a compiler hint and can be replaced with the condition passed to it.
This shows that the problem could be either the MAC is bad or the padding calculation failed. To narrow down the problem even further we modified the code to be this:
    if (unlikely
        (memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {
        if (memcmp(tag, tag_ptr, tag_size) != 0) {
            char buf[1024];
            _gnutls_assert_log(“ASSERT: %s:%d – mismatching MAC\n”, __FILE__, __LINE__);
            _gnutls_assert_log(“ASSERT: expected MAC: %s\n”,
                _gnutls_bin2hex(tag, tag_size, buf, sizeof(buf), NULL));
            _gnutls_assert_log(“ASSERT: MAC received: %s\n”,
                _gnutls_bin2hex(tag_ptr, tag_size, buf, sizeof(buf), NULL));
        }
        if (pad_failed != 0) {
            gnutls_assert();
        }
        /* HMAC was not the same. */
        dummy_wait(params, compressed, pad_failed, pad,
               length + preamble_size);
 
        return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);
    }
With this change, we get this debug output:
|<3>| ASSERT: gnutls_cipher.c:726 – mismatching MAC
|<3>| ASSERT: expected MAC: ab628ad135b519d2f71686e78d4d84fc
|<3>| ASSERT: MAC received: b053a9400b47c8ed3796ddb317b6a957
|<3>| ASSERT: gnutls_cipher.c:739
|<3>| ASSERT: gnutls_cipher.c:167
|<3>| ASSERT: gnutls_record.c:1239
|<0x156c770>| Discarded message[0] due to invalid decryption
|<3>| ASSERT: gnutls_buffers.c:1358
|<3>| ASSERT: gnutls_handshake.c:1428
|<3>| ASSERT: gnutls_handshake.c:785
|<3>| ASSERT: gnutls_handshake.c:3054
|<3>| ASSERT: gnutls_handshake.c:3200
Error in handshake
|<5>| REC: Sending Alert[2|20] – Bad record MAC
This shows that the actual failure is a mismatch between the MAC calculated over the handshake by the client and server and that there was not a problem detected with the padding.
 

Conclusion
The fact that the problem occurs with both the openssl and gnutls server implementations indicates that this is almost certainly a bug in the WinInet SSL handshake MAC calculation. The fact that the problem only occurs when using DHE-RSA key exchange algorithms indicates that the failure is caused by a bug in the WinInet Diffie-Hellman MAC calculation or else the derivation of the client MAC key when using Diffie-Hellman key exchange.
This analysis has focused only on the “Bad record MAC” failures. There were also another set of failures with roughly the same frequency that didn’t involve “Bad record MAC” fatal alert messages. The fact that these failures occurred after the server had sent its final handshake message suggests that this could actually be another symptom of the same problem. The same or equivalent bug could cause the client to miscalculate the expected server’s handshake MAC. If the client is rejecting the server’s handshake MAC but not sending “Bad record MAC” alerts when this happens then the observed behavior is exactly what we’d expect to see. I have no explanation as to why the error code 8 (ERROR_NOT_ENOUGH_MEMORY) would be emitted in this condition.
The bottom line is that the implementation of Diffie-Hellman key exchange used by Internet Explorer 11 on Windows 7 is broken. We could disable Diffie-Hellman key exchange in our Apache server configuration. The frequency of this problem and the market share of Internet Explorer likely don’t warrant that course of action since “Perfect Forward Secrecy” is rather valuable. Instead, IMVU will be disabling TLS 1.2 in Internet Explorer configuration on selenium test hosts.

Code Reviews: Follow the Data

After years of reviewing other people’s code, I’d like to share a couple practices that improve the effectiveness of code reviews.

Why Review Code?

First, why review code at all? There are a few reasons:

  • Catching bugs
  • Enforce stylistic consistency with the rest of the code
  • Finding opportunities to share code across systems
  • Helping new engineers spin up with the team and project
  • Helping API authors see actual problems that API consumers face
  • Maintain the health of the system overall

It seems most people reach for one of two standard techniques when reviewing code; they either review diffs as they arrive or they review every change in one large chunk. We’ve used both techniques, but I find there’s a more effective way to spend code review time.

Reviewing Diffs

It seems the default code review technique for most people is to sign up for commit emails or proposed patches and read every diff as it goes by. This has some benefits – it’s nice to see when someone is working in an area or that a library had a deficiency needing correction. However, on a large application, diffs become blinding. When all you see is streams of diffs, you can lose sight of how the application’s architecture is evolving.

I’ve ended up in situations where every diff looked perfectly reasonable but, when examining the application at a higher level, key system invariants had been broken.

In addition, diffs tends to emphasize small, less important details over more important integration and design risks. I’ve noticed that, when people only review diffs, they tend to point out things like whitespace style, how iteration over arrays is expressed, and the names of local variables. In some codebases these are important, but higher-level structure is often much more important over the life of the code.

Reviewing diffs can also result in wasted work. Perhaps someone is iterating towards a solution. The code reviewer may waste time reviewing code that its author is intending to rework anyway.

Reviewing Everything

Less often, I’ve seen another code review approach similar to reviewing diffs, but on entire bodies of work at a time. This approach can work, but it’s often mindnumbing. See, there are two types of code: core, foundational code and leaf code. Foundational code sits beneath and between everything else in the application. It’s important that it be correct, extensible, and maintainable. Leaf code is the specific functionality a feature needs. It is likely to be used in a single place and may never be touched again. Leaf code is not that interesting, so most of your code review energy should go towards the core pieces. Reviewing all the code in a project or user story mixes the leaf code in with the foundational code and makes it harder see exactly what’s going on.

I think there’s a better way to run code reviews. It’s not as boring, tends to catch important changes to core systems, and is fairly efficient in terms of time spent.

Follow the Data

My preferred technique for reviewing code is to trace data as it flows through the system. This should be done after a meaningful, but not TOO large, body of work. You want about as much code as you can review in an hour: perhaps more than a user story, but less than an entire feature. Start with a single piece of data, perhaps some text entered on a website form. Then, trace that data all the way through the system to the output. This includes any network protocols, transformation functions, text encoding, decoding, storage in databases, caching, and escaping.

Following data through the code makes its high-level structure apparent. After all, code only exists to transform data. You may even notice scenarios where two transformations can be folded into one. Or perhaps eliminated entirely — sometimes abstraction adds no value at all.

This style of code review frequently covers code that wasn’t written by the person or team that initiated the code review. But that’s okay! It helps people get a bigger picture, and if the goal is to maintain overall system health, new code and existing shouldn’t be treated differently.

It’s also perfectly fine for the code review not to cover every new function. You’ll likely hit most of them while tracing the data (otherwise the functions would be dead code) but it’s better to emphasize the main flows. Once the code’s high-level structure is apparent, it’s usually clear which functions are more important than others.

After experimenting with various code review techniques, this approach has been the most effective and reliable over time. Make sure code reviews are somewhat frequent, however. After completion of every “project” or “story” or “module” or whatever, sit down for an hour with the code’s authors and appropriate tech leads and review the code. If the code review takes longer than an hour, people become too fatigued to add value.

Handling Code Review Follow-Up Tasks

At IMVU in particular, as we’re reviewing the code, someone will rapidly take notes into a shared document. The purpose of the review meeting is to review the code, not discuss the appropriate follow-up actions. It’s important not to interrupt the flow of the review meeting with a drawn-out discussion about what to do about one particular issue.

After the meeting, the team leads should categorize follow-up tasks into one of three categories:

  1. Do it right now. Usually tiny tweaks, for example: rename a function, call a different API, delete some commented-out code, move a function to a different file.
  2. Do it at the top of the next iteration. This is for small or medium-sized tasks that are worth doing. Examples: fix a bug, rework an API a bit, change an important but not-yet-ubiquitous file format.
  3. Would be nice someday. Delete these tasks. Don’t put them in a backlog. Maybe mention them to a research or infrastructure team. Example: “It would be great if our job scheduling system could specify dependencies declaratively.”

Nothing should float around on an amorphous backlog. If they are important, they’ll come up again. Plus, it’s very tempting to say “We’ll get to it” but you never will, and even if you have time, nobody will have context. So either get it done right away or be honest with yourself and consciously drop it.

Now go and review some code! :)

The Real-time Web in REST Services at IMVU

By Jon Watte, VP Technology @ IMVU

IMVU has built a rich, graph-shaped REST (REpresentational State Transfer) API (Application Programming Interface) to our data. This data includes a full social network, as well as e-commerce, virtual currencies, and the biggest 3D user generated content catalog in the world. This post discusses how IMVU addresses two of the bigger draw-backs of REST-based service architectures for real-time interactive content: Cache Invalidation (where users want to know about new data as soon as it becomes available,) and Request Chattiness (where request latency kills your performance.)

Cache Invalidation

REST principles like cacheability and hypertext-based documents work great for exposing data to a variety of clients (desktop, web, and mobile,) but runs into trouble when it meets the expectation of real-time interaction. For example, when a user changes their Motto in their profile, they would like for the world to see the new Motto right away — yet, much of the scalability wins of REST principles rely on caching, and the web does not have a good invalidation model. Here is an illustration of the problem:

At 10:03 am, Bob logs in and the client application fetches the profile information about his friend Alice. This data potentially gets cached at several different layers:

  • in application caches at the server end, such as Varnish or Squid
  • in content delivery network caches at the network edge, such as Akamai or Cloudfront
  • in the user’s own browser (every web browser has a local cache)

Let’s say that our service marks the data as cacheable for one hour.

At 10:04 am, Alice updates her Motto to say “there’s no business like show business.”

At 10:05 am, Alice sends a message to Bob asking “how do you like my motto?”

At 10:06 am, Bob looks at Alice’s profile, but because he has a stale version in cache, he sees the old version. Confusion (and, if this were a TV show, hilarity) ensues.

HTTP provides two options to solve this problem. One is to not do caching at all, which certainly “solves” the problem, but then also removes all benefits of a caching architecture. The other is to add the “must-validate” option to the cache control headers of the delivered data. This tells any client that, while it might want to store the data locally, it has to check back with the server to see whether the data has changed or not before re-using it. In the case that data has not changed, this saves on bytes transferred — the data doesn’t need to be sent twice — but it still requires the client to make a request to the server before presenting the data.

In modern web architectures, while saving throughput is nice, the real application performance killer is latency, which means that even the “zero bytes” response of checking in with the server introduces an unacceptable cost in end-user responsiveness. Cache validation and/or E-tags might sound like a big win, but for a piece of data like a “motto,” the overhead in HTTP headers (several kilobytes) dwarfs the savings of 30 bytes of payload — the client might as well just re-get the resource for approximately the same cost.

Another option that’s used in some public APIs is to version all the data, and when data is updated, update the version, which means that the data now has a new URL. A client asking for the latest version of the data would then not get a cached version. Because of HATEOAS (Hypertext As The Engine Of Application State) we would be able to discover the new URL for “Alice’s Profile Information,” and thus read the updated data. Unfortunately, there is no good way to discover that the new version is there — the client running on Bob’s machine would have to walk the tree of data from the start to get back to Alice’s new profile link, which is even more round-trip requests and makes the latency even worse.

A third option is to use REST transfer for the bulk data, but use some other, out-of-band (from the point of view of the HTTP protocol) mechanism to send changes to interested clients. Examples of this approach include the Meteor web framework, and the MQTT based push approach taken by Facebook Mobile Messenger. Meteor doesn’t really scale past a few hundred online users, and has an up-to-10-seconds-delay once it’s put across multiple hosts. Even with multiple hosts and “oplog tailing,” it ends up using a lot of CPU on each server, which means that a large write volume ends up with unacceptably low performance, and a scalability ceiling determined by overall write load, that doesn’t shard. At any time, IMVU has hundreds of thousands of concurrent users, which is a volume Meteor doesn’t support.

As for the MQTT-based mobile data push, Facebook isn’t currently making their solution available on the open market, and hadn’t even begun talking about it when we started our own work. Small components of that solution (such as MQTT middleware) are available for clients that can use direct TCP connections, and could be a building block for a solution to the problem.

The good news is that we at IMVU already have a highly scalable, multi-cast architecture, in the form of IMQ (the IMVU Message Queue.) This queue allows us to send lightweight messages to all connected users in real-time (typical latencies are less than 10 milliseconds plus one-way network delay.) Thus, if we can know what kinds of things that a user is currently interested in seeing, and we can know whether those things change, we can let the user know that the data changed and needs to be re-fetched.

Jonblog1

 

The initial version of IMQ used Google Protocol Buffers on top of a persistent TCP connection for communications. This works great for desktop applications, and may work for some mobile applications as long as the device is persistently connected, but it does not work well for web browsers with no raw TCP connection ability, or intermittently connected mobile devices. To solve for these use cases, we added the ability to connect to IMQ using the websockets protocol, and additionally to fall back to an occasionally polled mail-drop pick-up model over HTTP for the worst-case connectivity situations. Note that this is still much more efficient than polling individual services for updated data — IMQ will buffer all the things that received change notifications across our service stack, and deliver them in a single HTTP response back to the client, when the client manages to make a HTTP request.

To make sure that the data for an endpoint is not stale when it is re-fetched by the client, we then mark the output of real-time updated REST services as non-cacheable by the intermediate caching layers. We have to do this, because we cannot tell the intermediate actors (especially, the browser cache) about the cache invalidation — even though we have JavaScript code running in the browser, and it knows about the invalidation of a particular URL, it cannot tell the browser cache that the data at the end of that URL is now updated.

Instead, we keep a local cache inside the web page. This cache maps URL to JSON payload, and our wrapper on top of XMLHttpRequest will first check this cache, and deliver the data if it’s there. When we receive an invalidation request over IMQ, we mark it stale (although we may still deliver it, for example for offline browsing purposes.)

Request Chattiness (Latency)

Our document-like object model looks like a connected graph with URL links as the edges, and JSON documents as the nodes. When receiving a particular set of data (such as the set of links that comprises my friends list) it is very likely that I will immediately turn around and ask for the data that’s pointed to by those links. If the browser and server both support the SPDY protocol, we could pre-stuff the right answers into the SPDY connection, in anticipation of the client requests. However, not all our clients have this support, and not even popular server-side tools like Nginx or Apache HTTPd support pre-caching, so instead we accomplish the same thing in our REST response envelope.

Instead of responding with just a single JSON document, we respond with a look-up table of URLs to JSON documents, including all the information we believe the client will want, based on the original request. This is entirely optional — the server doesn’t have to add any extra information; the client doesn’t have to pay attention to the extra data; but servers and clients that are in cahoots and pay attention will end up delivering a user experience with more than 30x fewer server round-trips! On internet connections where latency matters more than individual byte counts, this is a huge win. On very narrow-band connections (like 2G cell phones or dial-up modems,) the client can provide a header that tells the server to never send any data more than what’s immediately requested.

Jonblog2

Because the server knows all the data it has sent (including the speculatively pre-loaded, or “denormalized” data,) the server can now make arrangements for the client to receive real-time updates through IMQ when the data backing those documents changes. Thus, when a friend comes online, or when a new catalog item from a creator I’m interested in is released, or when I purchase more credits, the server sends an invalidation message on the appropriate topic through the message queue, and any client that is interested in this topic will receive it, and update its local cache appropriately.

Putting it Together

This, in turn, ties into a reactive UI model. The authority of the data, within the application, lives in the in-process JSON cache, and the IMQ invalidation events are received by this cache. The cache can then know whether any piece of UI is currently displaying this data; if so, it issues a request to the server to fetch it, and once received, it updates the UI. If not, then it can just mark the element as stale, and re-fetch it if it’s later requested by some piece of UI or other application code.

The end-to-end flow is then:

  1. Bob loads Alice’s profile information
  2. Specific elements on the screen are tied to the information such as “name” or “motto”
  3. Bob’s client creates a subscription to updates to Alice’s information
  4. Alice changes her motto
  5. The back-end generates a message saying “Alice’s information changed” to everyone who is subscribed (which includes Bob)
  6. Bob’s client receives the invalidation message
  7. Bob’s client re-requests Alice’s profile information
  8. The underlying data model for Alice’s profile information on Bob’s display page changes
  9. The reactive UI updates the appropriate fields on the screen, so Bob sees the new data

All of these pieces means re-thinking a number of building blocks of the standard web stack, which means more work for our foundational libraries. In return, we get a more reactive web application, where anything you see on the screen is always up to date, and changes respond quickly, both through the user interface, and through the back-end, with minimal per-request overhead.

This might seem complex, but it ends up working really well, and with the proper attention to library design for back- and front-end development, building a reactive application like this is no harder than building an old-style, slow polling (or manually refreshed) application.

It would be great if SPDY (and, future, HTTP2) could support pre-stuffing responses in the real world. It would also be great if the browser DOM had an interface to the local cache, so that the application could tell the browser about a particular URL being invalidated. However, the solution we’ve built up achieves the same benefits, using existing protocols, which goes to show the fantastic flexibility and resilience inherent in the protocols and systems that make up the web!

 

The Case of the Trailing Space

Solve the case of the trailing space with IMVU’s Senior Engineer, Michael Slezak.

In this post, I wanted to discuss a problem we ran into recently dealing with REST authentication in our IMVU client application which ultimately boiled down to small discrepancies between JSON encoders. In this case, I’m going to focus on Python’s JSON library and javascript’s JSON encoder. Before I dive into the problem, I want to provide a quick background on the client application architecture.

Client Architecture

The IMVU client application contains several layers in its architecture ranging from rendering, business logic, and front-end development. We use C++ for rendering and other low level functionality such as windows, call stacks, and interfacing with the business logic. Our business/client logic layer is all in Python. We use Python for communicating to the front-end, pinging our servers, and maintaining advanced chat logic and chat state, among many other things. Finally, we use the Gecko SDK (a.k.a. XULRunner) to handle all of our UI needs. This means we can write our front-end using HTML, Javascript, and CSS. We also have our own library to allow the front-end to call out to Python for things such as user data. Using web technologies for client UI development has allowed us to unify the technology we use for our site such as jQuery, Underscore.js, and even a few in-house libraries, resulting in increased engineering productivity.

With that said, we’ve recently hand rolled our own implementation of Promises as identified by the upcoming ECMAScript 6 proposal in our open-sourced imvu.js library. I decided to drop this implementation into the client for our immediate use. With this change, I was able to also drop in a new REST client to help with chaining our requests in a more synchronous-like fashion (despite the fact that it’s asynchronous). Hooray! Asynchronous programming has just gotten easier for the client! However, that wasn’t the case…

XMLHttpRequest, are you there?

We got our feet wet using this exciting change with a new feature that my team is developing. There was one issue: Not authorized for request.

Uh oh.

Looks like we need an auth token for these requests. Specifically, the POST and DELETE requests. Simple enough. I found that we handle authentication in our Python code within this “securePostRaw” function. Every single request ever made in the client goes through this. To my knowledge, POSTing with XMLHttpRequest has never been used in the client. Ever. Needless to say, this was news to me…

It’s Dangerous to go Alone…

Looking at the securePostRaw function, we seem to take the auth token that the server initially gave us, and hash everything and use that as the new auth token. For example, if there is a request body, we JSON encode it and also utf-8 encode it. We then take the the customer id, the original auth token, the JSON encoded body, and the query parameters (if they exist), and then run a hash function over the whole, concatenated blob

OK… a little odd way of securing a POST request since we are running over HTTPS. But this is legacy code! So, it’s understandable.

I took this hashing function and copied it into another file for our front-end to call directly. I don’t need Python to send off the request, I just want to set the correct auth headers via XMLHttpRequest so that we can use this new REST client. You might be wondering, “Why not just let Python handle it then?” Partly because we have a bigger vision in the near future where we bring in a bigger, hand-rolled front-end library that is now ubiquitous to how we write front-end software at IMVU. To achieve this, I need XHR to work so that it’s easier to just “plop it in”.

Anyways, I finally get the right tokens to put in our headers. We’re on our way! All is right with the world, so we test again and: Not authorized for request

…Take This!

Wait, but I did what you told me. I did all the right things. The old and new client tokens match up! This isn’t fair!!

Luckily, I can run the IMVU client off a local server and dig into what the server is seeing. Through our REST middleware stack on the server, we’re failing authentication! When we attempt to grab the logged in user, it fails to identify us! But the old and new client tokens are the same! I’m in parity! Not so fast…

Here is a dump from the server of when Python makes the request using the securePostRaw function:

slezakblog1

And here is the dump when XHR makes the request:

slezakblog2

sign is what the server comes up and sent is what the client sent (obviously). Why in the hell are they different? The server is playing tricks on us… So, I decide to log the final output of the string the server uses before it runs the hash function on it.

The Case of the Trailing Space

The log looked something like this:

slezakblog3

….There is an extra space right before the "http://" starts!!! , It turns out JSON.stringify doesn’t leave any spaces like this. Since we let Python compute the hash, it also encodes the request body into JSON which means that it’s causing the spacing! Since we do xhr.send(JSON.stringify(body)), we have a mismatch between what the client calculates and what the server calculates because the server technically has a different request body (by one single space!).

Fortunately, the json library in Python has a keyword argument in it’s dump function called separators. So, the code now looks like json.dumps(body, separators=(‘,’, ‘:’)) which gives us a more compact version of the encoding. We are now matching with JSON.stringify.

After this change, we were finally able to come to a solution and it works!

Conclusion

Several lessons learned from this:

  • Hashing things that are dependent on variable data (which is also encoded) can be problematic. Since JSON is flexible in its encoding and allows for spacing, it can throw off the whole hash. Things that are more in our control such as integer values and constant strings are probably better to use.
  • All JSON encoders/decoders aren’t created equal.

If you enjoyed reading this article and are excited about solving problems such as unifying web technologies across multiple platforms, you’re in luck! We’re hiring!

 

 

 

How IMVU Builds Web Services: Part 3

In this 3-part series, IMVU senior engineer Bill Welden describes the means and technology behind IMVU’s web services.

Part 3: Documents and Links

In the previous entry in this series I described how IMVU uses a structured network model to implement the uniform contract for our REST services and showed how they might apply to a set of services for a hypothetical high school scheduling system.

Engblog1102-01

Under this model we have Node Groups (represented by the different colors of circles in this diagram), Nodes (the circles themselves), Edge Groups (the rounded boxes hanging off of the circles) and Edges (the lines connecting the circles).

We model these various notions using HTTP documents linked together with URLs. There are four kinds of documents, corresponding to each of the four concepts.

A Node Group determines the properties and relationships of the Nodes, and the corresponding Node Group Document contains a list of URLs, each locating a Node.

A Node, through its properties and relationships, is the basic repository of information in the network. The Node Group Document contains the values of the Node’s properties, as well as a list of URLs, each locating an Edge Group.

An Edge Group groups together all of the links that go to a particular type of Node. The corresponding  Edge Group Document, specific to a Node, contains a list of URLs, each locating an Edge within that group.

An Edge is a link from its Node to another Node, usually of a different class. Edges can also contain properties, as we saw last time. An Edge Document therefore contains a URL for the Node at the other end of the Edge, as well as the values of any properties of the Edge.

Each type of document has a distinctive URL.

The URL for a Node Group Document consists of a single path segment, a singular noun describing the Node Group:

https://api.imvu.com/course

The URL for a Node consists of two path segments: the Node Group URL extended by a segment consisting of a unique identifier for the Node. The Node identifier can be any string, but we encourage the use of Node Identifiers which contain the name of the Node Group:

https://api.imvu.com/course/course-1035

The URL for an Edge Group consists of three segments: the Node URL extended either by a plural noun naming the Node Group that the Edge Group points off to, or by a noun describing the relationship created by the Edge Group:

https://api.imvu.com/course/course-1035/teachers
https://api.imvu.com/course/course-1035/roster

The URL for an Edge consists of four segments: the Edge Group URL extended by an identifier for the Edge. This identifier need only be unique within the Edge Group. Sometimes (for convenience) it is identical to the unique identifier of the Node that the Edge points to, but this is not required:

https://api.imvu.com/course/course-1035/teachers/1
https://api.imvu.com/course/course-1035/roster/student-5331

These URL formats are a constraint on the service design – a strong suggestion but not a strict requirement.

Note especially that, according to the principle of HATEOAS (Hypertext As The Engine Of Application State), the client cannot depend on any specific format, and is not allowed to construct these URLs or attempt to extract information from them. All URLs required by the client must be obtained whole, as opaque strings, from the response to an earlier service request.

Specifically, URLs returned from the server must be fully qualified, including the protocol and server name (“https://api.imvu.com”). However, in our own internal discussions and here in this presentation, we will often leave off the protocol and server name for clarity. So:

/course
/course/course-1035
/course-1035/teachers

and so forth. Wherever one of these relative URLs appears in the discussion below, the actual implementation will return a fully qualified version.

One of the biggest differences between REST services and the earlier Remote Procedure Call style is that RPC APIs define as many verbs as required by or convenient for implementation of the application. Designing RPC services is primarily about designing new verbs and specifying their parameters and their semantics.

With IMVU Rest services, the verbs are specified, and there are only five of them. Designing a service consists of designing a set of Nodes and Edges, together with their properties and relationships, in such a way that these five verbs can provide all of the required functionality.

The verbs are

GET (any endpoint)
POST (to a Node or Edge),
POST (to a Node Group or an Edge Group),
POST (to a Node Group, including Edges), and
DELETE (a Node or Edge)

GET retrieves the document associated with the URL, which can be a Node Group Document, Node Document, Edge Group Document or an Edge Document. GETs must be nullipotent, which is to say that they can have no side-effects.

POST, when applied to a Node or Edge URL makes changes to the properties and relations of the Node or Edge. Such POSTs must be idempotent, which means that sending to POST twice must have exactly the same effect as sending it once.

When POST is applied to a Node Group or Edge Group, it adds a new Node or a new Edge. Such POSTs will not be idempotent, since sending the POST twice will add two new Nodes or two new Edges.

There is a third sort of POST, a POST specifically to a Node Group which adds a new Node, but which also contains data for a set of new Edges to be added along with the Node (including Edge Groups as necessary).

Finally there is DELETE, which is used to delete a Node or an Edge. The implementation of DELETE must be idempotent.

GET returns the document associated with the URL, but in order to minimize round trips the service is allowed to return any additional documents that it thinks the client may soon need.

You can see this in the format of the JSON document returned by a GET:

{
  "id": "/course/course-1035",
  "status": "success",
  "denormalized":
  {
   "/course/course-1035": { … } 
   "/course/course-1035/teachers": { … }
   "/course/course-1035/teachers/1": { … }
   … etc …
  }
}

There is a status, “success” or “failure”, and if the GET fails, some information about why, but if it succeeds a package of endpoints is included, grouped under the response member “denormalized”.

In this example the client has asked for a Course, and the server has chosen to return not only the Course Node, but the Edge Group and Edges for all of the teachers that teach that course.

Each of the four kinds of documents in these responses has a specific JSON format.

A Node Group Document has a member “nodes” which is an array of URLs, one for each Node.

{
"nodes": [
 "/course/course-1035",  
 "/course/coures-1907", 
 ...
 ]
}

If the client wants to present the list of Nodes in a certain order, it is responsible for sorting them itself. It cannot depend on the order of entries in this array.

A Node Document has two members. The “data” member contains the Node’s properties. The “relations” member contains the URLs that link the Node to its Edge Groups and to other Nodes in the system.

{
 "data": {
 "description": "Algebra I",
 "starting_time": "10:00"
 }
"relations": {
 "teacher": "/teacher/teacher-372",
 "roster": "/course/course-1035/roster"
 }
}

Names in these objects represent a contract with the client. Based on the name, the server guarantees the semantics of the value including its type, the allowed values, the meaning of the value and if it is a link, the Node Group of the Node it points to.

Note that here “teacher” is a link to a Teacher Node. This design precludes the possibility of team teaching where there are two teachers for a class.

Links directly from one Node to another create a one-to-N relationships. One Teacher to many Courses. As a rule, however, relationships are more often N-to-N than not, and such restrictions on cardinality are a red flag. Not wrong, necessarily, but something that may be called out in design reviews.

The design could be made to support many teachers per course by implementing an Edge Group “teachers” for Course Nodes. This is a more flexible design, because the cardinality restrictions, say a maximum of two teachers, or allowing multiple teachers only for certain courses, can be implemented on the server, where they are easier to change.

An Edge Group Document has a member “edges” which is a JSON array of URLs for the Node’s Edges.

{
  "edges": [
   "/course/course-1035/roster/student-5331",
   "/course/course-1035/roster/student-1070",
   ...
  ],
}

Again, the client cannot depend on the order that the Edges come back from the server.

Finally, an Edge Document has an optional “data” member for when the Edge has properties, and a “relations” member which contains the URL for the Node at the other end of the Edge.

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/student/student-5331"
 }
}

Here is an Edge between one Course (course-1035) and one Student (student-513312).

Edges go both ways. Here are the Edges between that same student and his courses.

{
 "edges": [
   "/student/student-53312/schedule/01",
   "/student/student-53312/schedule/02",
   ...
 ],
}

Now it’s not required to implement every Edge Group implied by the Node/Edge model, so it’s acceptable to implement only half of this Edge relationship without its symmetrical partner. When we have an Edge Group in our design, however, we think carefully about the symmetrical Edge Group. It’s often a very interesting view on the data, and it’s seldom very difficult to implement.

Note that the Edge document itself can come back in two different ways, but the underlying database record will be the same. Here is one of the Edge documents linked from the Edge Group above:

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/course/course-1035",
 }
}

Whether you look at this link from the Student or the Course perspective, the count of tardies is the same data element.

Nodes and Edges are updated using the same JSON document format as the response from a GET, though there is no denormalization envelope.

Here is the document for a POST to a Student Node (/student/student-513312) intended to update the student’s birth date and counsellor (documents for POSTs to Edges look pretty much the same):

{
 "data": {
  "birth_date": "5/11/96",
 },
 "relations": {
  "counsellor": "/teacher/teacher-121"
 }
}

If a property is missing from the data or relations sections, it is left unchanged in the Node or Edge.

The response which comes back from a POST to a Node or Edge is the same as the response from a GET to that Node or Edge, including additional denormalized data at the discretion of the server.

New Nodes and Edges are added by posting to the corresponding Node Group or Edge Group  and providing the data and relations for the new Node or Edge in the body of the POST. This is a document POSTed to /course/course-1035/roster to create a new Edge, enrolling a new student in a course:

{
 "relations": {
   "ref": "/student/student-10706"
 }
}

All of the required properties must be included in this kind of POST. A POST creating a new Node looks pretty much the same.

Again, the response to this kind of POST is the same as you would receive from a GET to the newly created Node or Edge (though you will only know the id of the new Node or Edge once the response comes back).

It is often useful to be able to add a Node and a number of Edges in one POST request. The response would include the new Node, a new Edge Group and all of the specified Edges. Here is a POST to /student which adds a new Student along with Enrollment Edges in two courses:

{
"data": {
  "name": "Andrew",
    "birth_date": "7/21/95"
  },
  "edges": {
    "schedule": [ 
      {
        "relations": {
          "ref": "/course/course-1035"
        }
      }, 
      {
        "relations": {
          "ref": "/course/course-2995"
        }
      },
    ]
  }
}

The only other verb is DELETE, which can be applied to a Node or an Edge by providing the URL of the Node or Edge. No document is passed with a DELETE request. DELETEing a Node will also delete all of the Edge Groups and Edges for that Node.

The service implements much of the functionality of an application by implementing business rules which add (within limits) to the semantics of POSTs and DELETEs.

Business rules cannot change the fundamental semantics of these verbs. A successful POST to a Node Group or Edge Group must still add a new Node or Edge. A successful POST to a Node or Edge must still make the specified modifications to the Node or Edge, and a successful DELETE must still result in the specified deletion.

In particular, POST to a Node or Edge must remain idempotent. POSTing twice to a Node or Edge must have exactly the same effect as POSTing once.

Business rules, however, can reject requests which violate a desired constraint. We might want to disallow deletion of students prior to their 21st birthday.

Business rules can also limit operations to specific users or classes of user. Our school system might have an administrator class who are responsible for placing students in classes. An attempt to add a Student Edge to a class would be rejected if the client making the request was not identified as an administrator.

Business rules also have broad authority to make additional changes to the back end data structures based on a POST or DELETE. We could add a property to Student showing the number of classes each student is enrolled in, and then when the client POSTs to the schedule Edge Group to add a new Course, increment this count in the Student Node.

As I mentioned earlier, the client is not allowed to construct URLs, but is required to retrieve them from the server. It is, however, allowed to append query parameters. These must have no effect on the query other than limiting the set of records returned.

Here are some examples.

GET /student?name=piers*
GET /student/student-10706/schedule?tardies=0
GET /student?schedule.tardies=gt.0

In the first two cases the query is based on a property of the Node or Edge. The first is intended to retrieve all students whose name begins with “piers”, the second to get all courses where the given student is currently enrolled and has no tardies.

The third example shows a query that would be achieved with a join in SQL. It is imagined to retrieve all students which have been late for at least one class.

Note that the HTTP query syntax doesn’t really support the kind of database queries we want to perform very well, and because of that we have not yet settled on standards for specifying queries (and different projects have come up with different solutions). We are still striking out into new territory, but remain clear that we want to be working toward a company wide solution in the long term.

Some Node Groups contain a lot of Nodes. At IMVU we have a Node Group for the tens of millions of products in our catalog. We don’t currently support querying our product Node Group, but if we did, we would have to get a response back something like this:

{ …
  {
    "nodes": [
      "/product/product-10057",
      "/product/product-10092",
      ...
      "/product/product-11033",
    ],
    "next": "/product?offset=50",
  }
}

The list of Nodes includes only the first fifty, but provides us with a URL – with a built-in query parameter – which allows us to retrieve another group of fifty.

There are a number of services which do paging like this, but note that offsets don’t work very well for paging when Nodes are often being added and deleted, since the offsets associated with specific Nodes can change. Paging is another area in which our standards are still under development.

Finally, in order to allow clients to cache the responses they receive we provide a way for the service to let the client know when cached URL responses are no longer valid.

This involves the use of IMQ, IMVU’s proprietary system for efficiently pushing data to clients in real time. In our case, the data is simply a notification that an earlier response to a particular URL is no longer valid. We’ll go into detail on IMQ and how it works in a future post.

Most endpoints do not provide invalidation. IMVU product information, for example, doesn’t change often enough to make the overhead associated with invalidation worthwhile.

When invalidation is available, the response to an endpoint will include a member called “updates”, which provides the information necessary to subscribe to the appropriate IMQ queue. Here is a response to a GET of a particular enrollment Edge 
(/student/student-513312/schedule/01):
{
 {
    "data": { 
      "tardies": 1,
    },
    "relations": {
      "ref": "/course/course-1035",
    }
    "updates": "imq://inv.student.student-513312"
  }
}

Invalidations for different endpoints come in on different queues, and it is the server’s prerogative to choose the queue for each endpoint.

Currently, for the endpoints which support invalidation, we have one queue per Node. Invalidations for the Node come in on this queue, but the same queue also provides invalidations for the Node’s Edge Groups and Edges. This design allows us to achieve a balance between the number of queues we create and the number of subscribers to each queue.

In the example above, only clients that have an active interest in Student 513312 will be subscribed to this queue. With this mechanism, if someone POSTs a change to the number of tardies, every client which has this record cached will receive a notification. If the number of tardies is displayed on the screen, it can be updated immediately.

Note that under this scheme, there can be no queue associated with a Node Group. Such a queue would be used to notify all clients that, for example, a new Student Node had been added. To be useful, however, every client application would need to subscribe to the queue, and IMQ is not built to support such a large number of subscribers. There are solutions to this problem, but for the moment we do not support invalidation for Node Groups.

The discipline of REST and the specific uniform contract IMVU has adopted give us the power and flexibility to quickly create, enhance and share back end services. The principles behind REST (and especially the principles of uniform contract, HATEAOS and separation of concerns) provide us with a solid framework as we continue to complete and codify our standards.

We’ll keep you up to date as things develop.

Optimizing WebGL Shaders by Reading D3D Shader Assembly

We are optimizing WebGL shaders for the Intel GMA 950 chipset, which is basically the slowest WebGL-capable device we care about. Unfortunately, it’s a fairly common chipset too. On the plus side, if we run well on the GMA 950, we should basically run well anywhere. :)

When you’re writing GLSL in WebGL on Windows, your code is three layers of abstraction away from what actually runs on the GPU. First, ANGLE translates your GLSL into HLSL. Then, D3DX compiles the HLSL into optimized shader assembly bytecode. That shader bytecode is given to the driver, where it’s translated into hardware instructions for execution on the silicon.

Ideally, when writing GLSL, we’d like to see at least the resulting D3D shader assembly.

With a great deal of effort and luck, I have finally succeeded at extracting Direct3D shader instructions from WebGL. I installed NVIDIA Nsight and Visual Studio 2013 in Boot Camp on my Mac. To use Nsight, you need to create a dummy Visual Studio project. Without a project, it won’t launch at all. Once you have your dummy project open, configure Nsight (via Nsight User Properties under Project Settings) to launch Firefox.exe instead. Firefox is easier to debug than Chrome because it runs in a single process.

If you’re lucky — and I’m not sure why it’s so unreliable — the Nsight UI will show up inside Firefox. If it doesn’t show up, try launching it again. Move the window around, open various menus. Eventually you should have the ability to capture a frame. If you try to capture a frame before the Nsight UI shows up in Firefox, Firefox will hang.

Once you capture a frame, find an interesting draw call, make sure the geometry is from your WebGL application, and then begin looking at the shader. (Note: again, this tool is unreliable. Sometimes you get to look at the HLSL that ANGLE produced, which you can compile and disassemble with fxc.exe, and sometimes you get the raw shader assembly.)

Anyway, check this out. We’re going to focus on the array lookup in the following bone skinning GLSL:

attribute vec4 vBoneIndices;
uniform vec4 uBones[3 * 68];

ivec3 i = ivec3(vBoneIndices.xyz) * 3;
vec4 row0 = uBones[i.x];

ANGLE translates the array lookup into HLSL. Note the added bounds check for security. (Why? D3D claims it already bounds-checks array accesses.)

int3 _i = (ivec3(_vBoneIndices) * 3);
float4 _row0 = _uBones[int(clamp(float(_i.x), 0.0, 203.0))]

This generates the following shader instructions:

# NOTE: v0 is vBoneIndices
# The actual load isn't shown here.  This is just index calculation.

def c0, 2.00787401, -1, 3, 0
def c218, 203, 2, 0.5, 0

# r1 = v0, truncated towards zero
slt r1.xyz, v0, -v0
frc r2.xyz, v0
add r3.xyz, -r2, v0
slt r2.xyz, -r2, r2
mad r1.xyz, r1, r2, r3

mul r2.xyz, r1, c0.z # times three

# clamp
max r2.xyz, r2, c0.w
min r2.xyz, r2, c218.x

# get ready to load, using a0.x as index into uBones
mova a0.x, r2.y

That blob of instructions that implements truncation towards zero? Let’s decode that.

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = v0 - floor(v0)
r3.xyz = v0 - r2
r2.xyz = (-r2 < r2) ? 1 : 0
r1.xyz = r1 * r2 + r3

Simplified further:

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = (floor(v0) < v0) ? 1 : 0
r1.xyz = (r1 * r2) + floor(v0)

In short, r1 = floor(v0), UNLESS v0 < 0 and floor(v0) < v0, in which case r1 = floor(v0) + 1.

That’s a lot of instructions just to calculate an array index. After the index is calculated, it’s multiplied by three, clamped to the array boundaries (securitah!), and loaded into the address register.

Can we convince ANGLE and HLSL that the array index will never be negative? (It should know this, since it’s already clamping later on, but whatever.) Perhaps avoid a bunch of generated code? Let’s tweak the GLSL a bit.

ivec3 i = ivec3(max(vec3(0), vBoneIndices.xyz)) * 3;
vec4 row0 = uBones[i.x];

Now the instruction stream is substantially reduced!

def c0, 2.00787401, -1, 0, 3
def c218, 203, 1, 2, 0.5

# clamp v0 positive
max r1, c0.z, v0.xyzx

# r1 = floor(r1)
frc r2, r1.wyzw
add r1, r1, -r2

mul r1, r1, c0.w # times three

# bound-check against array
min r2.xyz, r1.wyzw, c218.x
mova a0.x, r2.y

By clamping the bone indices against zero before converting to integer, the shader optimizer eliminates several instructions.

Can we get rid of the two floor instructions? We know that the mova instruction rounds to the nearest integer when converting a float to an index. Given that knowledge, I tried to eliminate the floor by making my GLSL match mova semantics, but the HLSL compiler didn’t seem smart enough to elide the two floor instructions. If you can figure this out, please let me know!

Either way, I wanted to demonstrate that, by reading the generated Direct3D shader code from your WebGL shaders, you may find small changes that result in big wins.

Software engineering and product development best practices at IMVU

Follow

Get every new post delivered to your Inbox.

Join 98 other followers

%d bloggers like this: