The case of the “Page can’t be displayed” intermittent selenium test

By Eric Hohenstein

 

IMVU relies heavily on unit testing with dependency injection for most of the testing of its website code written in PHP. Unit testing is great because it can be crafted in such a way that it will (hopefully) only test the things you’re interested in testing to verify the behavior of a particular component. This ideally makes unit tests both fast and reliable.

However, it’s possible to write unit that all pass, but that fail to catch an incompatibility between components that causes your application to break horribly. Because of this, IMVU uses a testing tool called selenium for end-to-end acceptance testing. Selenium works by loading a web page or pages in an actual browser window and it performs a set of configurable actions on that web page and verifies that the page behaves as expected. For example, if you have a web page with a button with the text “Hide” next to a widget and you expect that when the button is clicked, the widget is hidden, you can write a selenium test that will load the page in question and click the button and confirm that the widget is hidden.
For at least the past several months, IMVU’s selenium test infrastructure has had a problem that causes otherwise reliable selenium tests to fail intermittently when run in Internet Explorer, showing an error page with the text “This page can’t be displayed”. This has been a constant source of frustration and a drag on engineering efficiency. A great deal of time has been spent by engineers on multiple teams investigating the cause. Up until now, the cause has remained a mystery.

 

Ericblog1
 

DNS failure
At first, the problem seemed to be related to a DNS lookup failure. The reason for this is that inspecting the frame where the web page was supposed to have loaded reveals that the URL of the error page is res://ieframe.dll/dnserrordiagoff.htm*. The path of that URL (dnserrordiagoff.htm) suggests a DNS lookup failure. Some experimentation shows that DNS errors are, indeed, displayed in this way in Internet Explorer. However, a lot of other network errors will cause the same error page to be displayed so this was not a valid conclusion and this information did little to uncover the source of the problem.
In order to eliminate our corporate DNS service as a possible source of the problem, we tried hard-coding the address of the remote web server host in the hosts file of the Windows host running the selenium test. We initially thought that this had resolved the problem but we soon found that the problem continued.
* It turns out that the “diagoff” part of the path of the error page disables the presentation of a “Fix connection problems” button on the error page. Internet Explorer will show the “diagoff” version of the error page inside an iframe and the alternate dnserror.htm version with the button when shown in the top most frame.

 

SSL failure
It had been observed that this problem occurred only when the selenium test was trying to load a web page using the HTTPS scheme. This was a useful observation but it didn’t give us much insight into what the problem was early on.

 

Network reliability
We tried capturing a network trace of the problem occurring several times. One of these time, the capture showed a “Bad record MAC” SSL fatal alert message during an SSL handshake. This made us wonder if the the problem was being caused by an unreliable network between the Windows host where the selenium test was running and the host where the HTTPS page was being served from. Checksums are supposed to catch packet transmission errors but if the errors are happening regularly enough, some garbled packets might slip through with matching checksums. We inspected our network traces for signs of packets with bad checksums but didn’t find any other evidence of that. We inspected several network traces and found that while they all seemed to fail during the SSL handshake, only one failed with a “Bad record MAC” SSL fatal alert message. We decided that this was likely a red herring.

 

Error duplication
In order to narrow down the source of the problem, we wrote a python script to try and duplicate it. The python script would, 4 times per second, do a DNS lookup of the domain name that the HTTPS URL was using when the problem occurred, and then make an HTTP followed by an HTTPS request to that domain. The script would report when any of these operations took more than a small threshold to complete. This script was run in parallel to the selenium test in the hopes that it would reveal some kind of temporary network lag between the selenium test host and the web server host. The idea was that if this test showed a problem that coincided with the selenium failure then it would indicate that the problem was between the Windows TCP/IP stack and the remote web server. If it didn’t then the problem had to be at a higher level. The results of this experiment showed no failure in the script when the selenium failure occurred so the problem had to be in Internet Explorer or something it relies upon above the low level Windows networking layer.

 

Lots of moving parts
When we first noticed the problem, we had recently upgraded to a newer version of selenium so that we could run our selenium tests on a newer version of Internet Explorer so that we could run our selenium tests on Windows 7 rather than Windows XP. All of these things changed at approximately the same time and we knew that the problem could potentially be related to any of them.

 

Windows Defender
In Windows 7, a feature called “Windows Defender” will periodically scan the system for potential problems. We noticed that there was an alert showing on every one of our Windows 7 selenium test hosts warning about changes we had made to the hosts file to facilitate our testing. We thought that Windows Defender might be interfering with Internet Explorer in some way that could be causing the problem. Disabling Windows Defender got rid of the alert, but it the problem continued.

 

Background tasks
The intermittent nature of the problem seemed to suggest that it might be happening when the test coincided with a background task in Windows 7 that wasn’t happening in Windows XP or that had at least not been causing the same type of problem. One type of background task that we thought might be related was periodically checking for certificate revocation, since we had identified that the problem only affected HTTPS pages. We tried disabling all Windows scheduled tasks, but the problem continued.

 

Internet Explorer settings
We wondered if some Internet Explorer setting in the new version of Internet Explorer we were using was causing the problem. We tried setting all Internet Explorer settings to the lowest security, lowest privacy settings, disabling extensions, disabling HTTP 1.1, disabling checking SSL certificate revocation, disabling “Do Not Track,” but the problem continued.

 

Internet Explorer cache/cookies
We had noticed in the past that leftover state in Internet Explorer from previous tests could sometimes cause intermittent selenium test failures. We had some infrastructure in place that would attempt to delete cookies and cache between selenium tests to avoid this problem. We noticed that the file system and registry paths used to find and eliminate leftover Internet Explorer state had been changed since we upgrade Internet Explorer and Windows versions but our infrastructure hadn’t been updated. We fixed this, but the problem continued.

 

Running out of ideas
At this point, it seemed that IMVU might have to either abandon selenium testing or it might have to abandon using Internet Explorer for selenium testing. Both solutions would be very expensive and would also be a very frustrating failure. That’s when an AHA moment happened. We captured some more network traces when the problem occurred and noticed all of them (by luck, as it turns out) contained “Bad record MAC” SSL fatal alerts. This information combined with the fact that the problem had to be above the low level Windows networking layer really narrowed the problem down quite a bit. Furthermore, all the “Bad record MAC” alerts were being sent by the server, which indicated that the server was rejecting the MAC of the SSL handshake calculated and sent by the Internet Explorer. The conversations all looked like this in Wireshark (in this trace, the address of the server is 172.16.100.16):

 

Ericblog2
 

Note that there are warning alerts issued by the server about “Unrecognized Name”. This is because the client is sending the SNI TLS extension record and the server is mis-configured in our testing environment. We tried configuring it correctly to eliminate this warning, but the problem continued.
In order to test the hypothesis that a bug in the SSL implementation used by Internet Explorer might be miscalculating the handshake MAC, we tried disabling various SSL/TLS versions. Finally we found a solution! It turned out that disabling TLS 1.2 in Internet Explorer settings resolved the problem. It seemed like we’d narrowed down the problem all the way to the root cause.

 

Internet Explorer or openssl
But was Internet Explorer really to blame? What we knew from the network trace was that the server’s implementation of TLS 1.2 was rejecting the handshake MAC that it was being given by Internet Explorer. However, this isn’t enough to conclude that the MAC calculated by Internet Explorer was incorrect. It was equally possible that Internet Explorer was calculating the MAC correctly but that the server’s implementation of TLS 1.2 was incorrectly rejecting it. Our server is Apache 2.2.25 with mod_ssl. The mod_ssl Apache module is implemented in terms of openssl, for which we use version 1.0.1. It was possible that what we were witnessing was a bug in openssl. If so, this problem might have affected clients other than just Internet Explorer. If so, rather than just ignoring the problem and disabling TLS 1.2 in Internet Explorer settings on our selenium test hosts, it would be better to fix or reconfigure openssl in our servers. But with only one client (Internet Explorer) and one server (Apache/openssl) demonstrating the problem, we couldn’t know which one had the bug. Since openssl is open source, we could theoretically find the problem if that was where the problem was, given enough time, but it would take a very long time to prove that the problem wasn’t in openssl just by code inspection. Not having Internet Explorer source, we couldn’t do the same there. If we could duplicate the problem between Internet Explorer and a separate SSL/TLS server implementation then that would point very strongly to the problem being in Internet Explorer and not in openssl.
 

Analysis using Wireshark
Hoping to get a tool like Wireshark to validate the MAC sent by Internet Explorer so that we could narrow the problem to either Internet Explorer or openssl, we configured Wireshark with the private key used by our server. However, this didn’t help because it turned out that the key exchange algorithm negotiated by Internet Explorer and Apache was using Diffie-Hellman. Here’s the relevant portion of the Wireshark SSL debug output file for the “Client Key Exchange” message when the error occurred:
ssl_generate_pre_master_secret: found SSL_HND_CLIENT_KEY_EXCHG, state 17
ssl_decrypt_pre_master_secret session uses DH (17) key exchange, which is impossible to decrypt
ssl_generate_pre_master_secret: can’t decrypt pre master secret
The key material used by the client and server to exchange secrets can’t be obtained using the private key of the server because the exchange uses an ephemeral key pair generated by the server on-the-fly and signed using its private key. This strategy provides a very powerful security feature known as “Perfect Forward Secrecy“. It also means that even if Wireshark has the private key of the server, it still can’t decode the SSL conversation, including validating the handshake MAC.

 

Ericblog3

 

We then configured the server such that it and Internet Explorer would negotiate the exact same cipher suite but using just RSA rather than DHE-RSA for key exchange. We were successful in doing this, but we weren’t able to isolate the problem to Internet Explorer or openssl using this approach because it eliminated the problem. If the problem was in IE, it seems that it had to be related to it use of Diffie-Hellman key exchange.
 

New error duplication
In order to more easily and quickly duplicate the problem, we wrote a test program that uses the same networking Windows API that Internet Explorer uses. The library that provides this API on Windows is called WinInet. Our program did the same thing as our python script did, however it failed to duplicate the problem. A network trace while the program was running showed that it only ever opened one SSL connection to the server and made all requests using that single connection. This was worked around by making the test program only make one HTTPS request but we called it in a loop from the Windows command line. This duplicated the problem. Here’s the program:
#include <windows.h>
#include <wininet.h>
#include <stdio.h>
#include <tchar.h>
 
void ReadUrl(LPCTSTR url) {
    HINTERNET hInternet = NULL;
    HINTERNET hRequest = NULL;
    char buffer[1024];
    DWORD bytesRead = 0L;
    BOOL result = FALSE;
 
    hInternet = InternetOpen(_T(“Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”), INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
    if (NULL == hInternet) {
        _tprintf(_T(“InternetOpen error: %08X\n”), GetLastError());
        return;
    }
    
    hRequest = InternetOpenUrl(hInternet, url, NULL, 0, INTERNET_FLAG_RELOAD | INTERNET_FLAG_SECURE, (DWORD_PTR)NULL);
    if (NULL == hRequest) {
        _tprintf(_T(“InternetOpenUrl error: %08X\n”), GetLastError());
        InternetCloseHandle(hInternet);
        return;
    }
 
    do {
        bytesRead = 0L;
        result = InternetReadFile(hRequest, buffer, (DWORD)sizeof(buffer), &bytesRead);
        if (result && (bytesRead == 0L)) {
            break;
        }
    } while (result);
 
    if (!result) {
        DWORD error = GetLastError();
        TCHAR errorText[1024];
        DWORD length = (DWORD)(sizeof(errorText) / sizeof(TCHAR));
        result = InternetGetLastResponseInfo(&error, errorText, &length);
        if (!result) {
            _tcscpy(errorText, “unknown”);
            length = 7L;
        }
        _tprintf(_T(“error reading request: (%08X) %.*s\n”), error, length, errorText);
    }
 
    InternetCloseHandle(hInternet);
    InternetCloseHandle(hRequest);
}
 
int _tmain(int argc, LPCTSTR * argv) {
    if (argc != 2) {
        _tprintf(_T(“%s usage: %s <URL>\r\n”), argv[0], argv[0]);
        return 1;
    }
 
    ReadUrl(argv[1]);
 
    return 0;
}
When run, in just a few seconds, it produces output similar to the following:
C:\imvu\temp>for /l %i in (1, 1, 1000) do @wininet_test.exe https://localhost.imvu.com/dummy.txt
InternetOpenUrl error: 00000008
InternetOpenUrl error: 00002F7D
InternetOpenUrl error: 00002F7D
^C
Out of 1000 requests, roughly 10-12 will fail. About half of them will fail with error code 8 and the other half will fail with error code 0x2F7D. Error code 8 corresponds to ERROR_NOT_ENOUGH_MEMORY. Error code 0x2F7D corresponds to ERROR_INTERNET_SECURITY_CHANNEL_ERROR. The description of this error is “The application experienced an internal error loading the SSL libraries.” I suspect this description is misleading and is actually a generic error used for SSL handshake errors and possibly other kinds of failures as well. It seems to be associated with the schannel.dll Windows system library.
A network trace capturing the SSL conversations when these errors occur shows that the failures with error code 8 manifest as the client closing the connection with a [FIN, ACK] packet after receiving the “Server Key Exchange” message without ever sending a “Client Key Exchange” message. The server responds by sending an “Unexpected Message” fatal alert message. The failures with error code 0x2F7D manifest as the server sending a “Bad record MAC” fatal alert after the client sends the final SSL handshake message containing a “Change Cipher Spec” message and terminated with the handshake MAC.
 

Duplication with gnutls-serv
In order to finally isolate the problem to either Internet Explorer or openssl, we used a tool called gnutls-serv. Gnutls is an SSL/TLS library from GNU. It comes with a tool called gnutls-serv that will act as a simple HTTPS server. After downloading and compiling the latest version (gnutls-3.3.9), we started it with this command line (after killing Apache) on our web server host:
#LD_LIBRARY_PATH=/usr/local/lib/ src/gnutls-serv -p 443 -g –http -a –x509certfile /etc/apache2/certs/ssl.crt/sandbox.crt –x509keyfile /etc/apache2/certs/ssl.key/sandbox.key –priority “SECURE:-ECDHE-RSA” -d 9999
The –priority “SECURE:-ECDHE-RSA” option is given in order to force the cipher suite negotiated between WinInet and gnutls-serv to be the exact one negotiated between Internet Explorer and Apache. It was initially choosing the Eliptic Curve Diffie-Hellman key exchange algorithm. It turns out that the problem is not reproducible when using Eliptic Curve Diffie-Hellman.
Running the WinInet test program against this server, the problem continued to be observed with the same frequency as when using Apache/openssl. The network trace of these failures look more-or-less identical to the ones generated with Apache/openssl:
Ericblog4
The only significant difference is the missing server name warning. gnutls-serv apparently ignores the SNI TLS extension.
Inspecting the relevant debugging output of gnutls-serv corresponding to the failure shows this:
|<3>| ASSERT: gnutls_cipher.c:728
|<3>| ASSERT: gnutls_cipher.c:167
|<3>| ASSERT: gnutls_record.c:1239
|<0x22ae770>| Discarded message[0] due to invalid decryption
|<3>| ASSERT: gnutls_buffers.c:1358
|<3>| ASSERT: gnutls_handshake.c:1428
|<3>| ASSERT: gnutls_handshake.c:785
|<3>| ASSERT: gnutls_handshake.c:3054
|<3>| ASSERT: gnutls_handshake.c:3200
Error in handshake
|<5>| REC: Sending Alert[2|20] – Bad record MAC
The original failure is reported on line 728 of gnutls_cipher.c which is this bit of code:
    if (unlikely
        (memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {
        /* HMAC was not the same. */
        dummy_wait(params, compressed, pad_failed, pad,
               length + preamble_size);
 
        return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);
    }
In this code, the unlikely() function is just a compiler hint and can be replaced with the condition passed to it.
This shows that the problem could be either the MAC is bad or the padding calculation failed. To narrow down the problem even further we modified the code to be this:
    if (unlikely
        (memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {
        if (memcmp(tag, tag_ptr, tag_size) != 0) {
            char buf[1024];
            _gnutls_assert_log(“ASSERT: %s:%d – mismatching MAC\n”, __FILE__, __LINE__);
            _gnutls_assert_log(“ASSERT: expected MAC: %s\n”,
                _gnutls_bin2hex(tag, tag_size, buf, sizeof(buf), NULL));
            _gnutls_assert_log(“ASSERT: MAC received: %s\n”,
                _gnutls_bin2hex(tag_ptr, tag_size, buf, sizeof(buf), NULL));
        }
        if (pad_failed != 0) {
            gnutls_assert();
        }
        /* HMAC was not the same. */
        dummy_wait(params, compressed, pad_failed, pad,
               length + preamble_size);
 
        return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);
    }
With this change, we get this debug output:
|<3>| ASSERT: gnutls_cipher.c:726 – mismatching MAC
|<3>| ASSERT: expected MAC: ab628ad135b519d2f71686e78d4d84fc
|<3>| ASSERT: MAC received: b053a9400b47c8ed3796ddb317b6a957
|<3>| ASSERT: gnutls_cipher.c:739
|<3>| ASSERT: gnutls_cipher.c:167
|<3>| ASSERT: gnutls_record.c:1239
|<0x156c770>| Discarded message[0] due to invalid decryption
|<3>| ASSERT: gnutls_buffers.c:1358
|<3>| ASSERT: gnutls_handshake.c:1428
|<3>| ASSERT: gnutls_handshake.c:785
|<3>| ASSERT: gnutls_handshake.c:3054
|<3>| ASSERT: gnutls_handshake.c:3200
Error in handshake
|<5>| REC: Sending Alert[2|20] – Bad record MAC
This shows that the actual failure is a mismatch between the MAC calculated over the handshake by the client and server and that there was not a problem detected with the padding.
 

Conclusion
The fact that the problem occurs with both the openssl and gnutls server implementations indicates that this is almost certainly a bug in the WinInet SSL handshake MAC calculation. The fact that the problem only occurs when using DHE-RSA key exchange algorithms indicates that the failure is caused by a bug in the WinInet Diffie-Hellman MAC calculation or else the derivation of the client MAC key when using Diffie-Hellman key exchange.
This analysis has focused only on the “Bad record MAC” failures. There were also another set of failures with roughly the same frequency that didn’t involve “Bad record MAC” fatal alert messages. The fact that these failures occurred after the server had sent its final handshake message suggests that this could actually be another symptom of the same problem. The same or equivalent bug could cause the client to miscalculate the expected server’s handshake MAC. If the client is rejecting the server’s handshake MAC but not sending “Bad record MAC” alerts when this happens then the observed behavior is exactly what we’d expect to see. I have no explanation as to why the error code 8 (ERROR_NOT_ENOUGH_MEMORY) would be emitted in this condition.
The bottom line is that the implementation of Diffie-Hellman key exchange used by Internet Explorer 11 on Windows 7 is broken. We could disable Diffie-Hellman key exchange in our Apache server configuration. The frequency of this problem and the market share of Internet Explorer likely don’t warrant that course of action since “Perfect Forward Secrecy” is rather valuable. Instead, IMVU will be disabling TLS 1.2 in Internet Explorer configuration on selenium test hosts.

Code Reviews: Follow the Data

After years of reviewing other people’s code, I’d like to share a couple practices that improve the effectiveness of code reviews.

Why Review Code?

First, why review code at all? There are a few reasons:

  • Catching bugs
  • Enforce stylistic consistency with the rest of the code
  • Finding opportunities to share code across systems
  • Helping new engineers spin up with the team and project
  • Helping API authors see actual problems that API consumers face
  • Maintain the health of the system overall

It seems most people reach for one of two standard techniques when reviewing code; they either review diffs as they arrive or they review every change in one large chunk. We’ve used both techniques, but I find there’s a more effective way to spend code review time.

Reviewing Diffs

It seems the default code review technique for most people is to sign up for commit emails or proposed patches and read every diff as it goes by. This has some benefits – it’s nice to see when someone is working in an area or that a library had a deficiency needing correction. However, on a large application, diffs become blinding. When all you see is streams of diffs, you can lose sight of how the application’s architecture is evolving.

I’ve ended up in situations where every diff looked perfectly reasonable but, when examining the application at a higher level, key system invariants had been broken.

In addition, diffs tends to emphasize small, less important details over more important integration and design risks. I’ve noticed that, when people only review diffs, they tend to point out things like whitespace style, how iteration over arrays is expressed, and the names of local variables. In some codebases these are important, but higher-level structure is often much more important over the life of the code.

Reviewing diffs can also result in wasted work. Perhaps someone is iterating towards a solution. The code reviewer may waste time reviewing code that its author is intending to rework anyway.

Reviewing Everything

Less often, I’ve seen another code review approach similar to reviewing diffs, but on entire bodies of work at a time. This approach can work, but it’s often mindnumbing. See, there are two types of code: core, foundational code and leaf code. Foundational code sits beneath and between everything else in the application. It’s important that it be correct, extensible, and maintainable. Leaf code is the specific functionality a feature needs. It is likely to be used in a single place and may never be touched again. Leaf code is not that interesting, so most of your code review energy should go towards the core pieces. Reviewing all the code in a project or user story mixes the leaf code in with the foundational code and makes it harder see exactly what’s going on.

I think there’s a better way to run code reviews. It’s not as boring, tends to catch important changes to core systems, and is fairly efficient in terms of time spent.

Follow the Data

My preferred technique for reviewing code is to trace data as it flows through the system. This should be done after a meaningful, but not TOO large, body of work. You want about as much code as you can review in an hour: perhaps more than a user story, but less than an entire feature. Start with a single piece of data, perhaps some text entered on a website form. Then, trace that data all the way through the system to the output. This includes any network protocols, transformation functions, text encoding, decoding, storage in databases, caching, and escaping.

Following data through the code makes its high-level structure apparent. After all, code only exists to transform data. You may even notice scenarios where two transformations can be folded into one. Or perhaps eliminated entirely — sometimes abstraction adds no value at all.

This style of code review frequently covers code that wasn’t written by the person or team that initiated the code review. But that’s okay! It helps people get a bigger picture, and if the goal is to maintain overall system health, new code and existing shouldn’t be treated differently.

It’s also perfectly fine for the code review not to cover every new function. You’ll likely hit most of them while tracing the data (otherwise the functions would be dead code) but it’s better to emphasize the main flows. Once the code’s high-level structure is apparent, it’s usually clear which functions are more important than others.

After experimenting with various code review techniques, this approach has been the most effective and reliable over time. Make sure code reviews are somewhat frequent, however. After completion of every “project” or “story” or “module” or whatever, sit down for an hour with the code’s authors and appropriate tech leads and review the code. If the code review takes longer than an hour, people become too fatigued to add value.

Handling Code Review Follow-Up Tasks

At IMVU in particular, as we’re reviewing the code, someone will rapidly take notes into a shared document. The purpose of the review meeting is to review the code, not discuss the appropriate follow-up actions. It’s important not to interrupt the flow of the review meeting with a drawn-out discussion about what to do about one particular issue.

After the meeting, the team leads should categorize follow-up tasks into one of three categories:

  1. Do it right now. Usually tiny tweaks, for example: rename a function, call a different API, delete some commented-out code, move a function to a different file.
  2. Do it at the top of the next iteration. This is for small or medium-sized tasks that are worth doing. Examples: fix a bug, rework an API a bit, change an important but not-yet-ubiquitous file format.
  3. Would be nice someday. Delete these tasks. Don’t put them in a backlog. Maybe mention them to a research or infrastructure team. Example: “It would be great if our job scheduling system could specify dependencies declaratively.”

Nothing should float around on an amorphous backlog. If they are important, they’ll come up again. Plus, it’s very tempting to say “We’ll get to it” but you never will, and even if you have time, nobody will have context. So either get it done right away or be honest with yourself and consciously drop it.

Now go and review some code! 🙂