January 2015 – IMVU Engineering Blog

By Eric Hohenstein

IMVU relies heavily on unit testing with dependency injection for most of the testing of its website code written in PHP. Unit testing is great because it can be crafted in such a way that it will (hopefully) only test the things you’re interested in testing to verify the behavior of a particular component. This ideally makes unit tests both fast and reliable.

However, it’s possible to write unit that all pass, but that fail to catch an incompatibility between components that causes your application to break horribly. Because of this, IMVU uses a testing tool called selenium for end-to-end acceptance testing. Selenium works by loading a web page or pages in an actual browser window and it performs a set of configurable actions on that web page and verifies that the page behaves as expected. For example, if you have a web page with a button with the text “Hide” next to a widget and you expect that when the button is clicked, the widget is hidden, you can write a selenium test that will load the page in question and click the button and confirm that the widget is hidden.

For at least the past several months, IMVU’s selenium test infrastructure has had a problem that causes otherwise reliable selenium tests to fail intermittently when run in Internet Explorer, showing an error page with the text “This page can’t be displayed”. This has been a constant source of frustration and a drag on engineering efficiency. A great deal of time has been spent by engineers on multiple teams investigating the cause. Up until now, the cause has remained a mystery.

DNS failure

At first, the problem seemed to be related to a DNS lookup failure. The reason for this is that inspecting the frame where the web page was supposed to have loaded reveals that the URL of the error page is res://ieframe.dll/dnserrordiagoff.htm*. The path of that URL (dnserrordiagoff.htm) suggests a DNS lookup failure. Some experimentation shows that DNS errors are, indeed, displayed in this way in Internet Explorer. However, a lot of other network errors will cause the same error page to be displayed so this was not a valid conclusion and this information did little to uncover the source of the problem.

In order to eliminate our corporate DNS service as a possible source of the problem, we tried hard-coding the address of the remote web server host in the hosts file of the Windows host running the selenium test. We initially thought that this had resolved the problem but we soon found that the problem continued.

* It turns out that the “diagoff” part of the path of the error page disables the presentation of a “Fix connection problems” button on the error page. Internet Explorer will show the “diagoff” version of the error page inside an iframe and the alternate dnserror.htm version with the button when shown in the top most frame.

SSL failure

It had been observed that this problem occurred only when the selenium test was trying to load a web page using the HTTPS scheme. This was a useful observation but it didn’t give us much insight into what the problem was early on.

Network reliability

We tried capturing a network trace of the problem occurring several times. One of these time, the capture showed a “Bad record MAC” SSL fatal alert message during an SSL handshake. This made us wonder if the the problem was being caused by an unreliable network between the Windows host where the selenium test was running and the host where the HTTPS page was being served from. Checksums are supposed to catch packet transmission errors but if the errors are happening regularly enough, some garbled packets might slip through with matching checksums. We inspected our network traces for signs of packets with bad checksums but didn’t find any other evidence of that. We inspected several network traces and found that while they all seemed to fail during the SSL handshake, only one failed with a “Bad record MAC” SSL fatal alert message. We decided that this was likely a red herring.

Error duplication

In order to narrow down the source of the problem, we wrote a python script to try and duplicate it. The python script would, 4 times per second, do a DNS lookup of the domain name that the HTTPS URL was using when the problem occurred, and then make an HTTP followed by an HTTPS request to that domain. The script would report when any of these operations took more than a small threshold to complete. This script was run in parallel to the selenium test in the hopes that it would reveal some kind of temporary network lag between the selenium test host and the web server host. The idea was that if this test showed a problem that coincided with the selenium failure then it would indicate that the problem was between the Windows TCP/IP stack and the remote web server. If it didn’t then the problem had to be at a higher level. The results of this experiment showed no failure in the script when the selenium failure occurred so the problem had to be in Internet Explorer or something it relies upon above the low level Windows networking layer.

Lots of moving parts

When we first noticed the problem, we had recently upgraded to a newer version of selenium so that we could run our selenium tests on a newer version of Internet Explorer so that we could run our selenium tests on Windows 7 rather than Windows XP. All of these things changed at approximately the same time and we knew that the problem could potentially be related to any of them.

Windows Defender

In Windows 7, a feature called “Windows Defender” will periodically scan the system for potential problems. We noticed that there was an alert showing on every one of our Windows 7 selenium test hosts warning about changes we had made to the hosts file to facilitate our testing. We thought that Windows Defender might be interfering with Internet Explorer in some way that could be causing the problem. Disabling Windows Defender got rid of the alert, but it the problem continued.

Background tasks

The intermittent nature of the problem seemed to suggest that it might be happening when the test coincided with a background task in Windows 7 that wasn’t happening in Windows XP or that had at least not been causing the same type of problem. One type of background task that we thought might be related was periodically checking for certificate revocation, since we had identified that the problem only affected HTTPS pages. We tried disabling all Windows scheduled tasks, but the problem continued.

Internet Explorer settings

We wondered if some Internet Explorer setting in the new version of Internet Explorer we were using was causing the problem. We tried setting all Internet Explorer settings to the lowest security, lowest privacy settings, disabling extensions, disabling HTTP 1.1, disabling checking SSL certificate revocation, disabling “Do Not Track,” but the problem continued.

Internet Explorer cache/cookies

We had noticed in the past that leftover state in Internet Explorer from previous tests could sometimes cause intermittent selenium test failures. We had some infrastructure in place that would attempt to delete cookies and cache between selenium tests to avoid this problem. We noticed that the file system and registry paths used to find and eliminate leftover Internet Explorer state had been changed since we upgrade Internet Explorer and Windows versions but our infrastructure hadn’t been updated. We fixed this, but the problem continued.

Running out of ideas

At this point, it seemed that IMVU might have to either abandon selenium testing or it might have to abandon using Internet Explorer for selenium testing. Both solutions would be very expensive and would also be a very frustrating failure. That’s when an AHA moment happened. We captured some more network traces when the problem occurred and noticed all of them (by luck, as it turns out) contained “Bad record MAC” SSL fatal alerts. This information combined with the fact that the problem had to be above the low level Windows networking layer really narrowed the problem down quite a bit. Furthermore, all the “Bad record MAC” alerts were being sent by the server, which indicated that the server was rejecting the MAC of the SSL handshake calculated and sent by the Internet Explorer. The conversations all looked like this in Wireshark (in this trace, the address of the server is 172.16.100.16):

Note that there are warning alerts issued by the server about “Unrecognized Name”. This is because the client is sending the SNI TLS extension record and the server is mis-configured in our testing environment. We tried configuring it correctly to eliminate this warning, but the problem continued.

In order to test the hypothesis that a bug in the SSL implementation used by Internet Explorer might be miscalculating the handshake MAC, we tried disabling various SSL/TLS versions. Finally we found a solution! It turned out that disabling TLS 1.2 in Internet Explorer settings resolved the problem. It seemed like we’d narrowed down the problem all the way to the root cause.

Internet Explorer or openssl

But was Internet Explorer really to blame? What we knew from the network trace was that the server’s implementation of TLS 1.2 was rejecting the handshake MAC that it was being given by Internet Explorer. However, this isn’t enough to conclude that the MAC calculated by Internet Explorer was incorrect. It was equally possible that Internet Explorer was calculating the MAC correctly but that the server’s implementation of TLS 1.2 was incorrectly rejecting it. Our server is Apache 2.2.25 with mod_ssl. The mod_ssl Apache module is implemented in terms of openssl, for which we use version 1.0.1. It was possible that what we were witnessing was a bug in openssl. If so, this problem might have affected clients other than just Internet Explorer. If so, rather than just ignoring the problem and disabling TLS 1.2 in Internet Explorer settings on our selenium test hosts, it would be better to fix or reconfigure openssl in our servers. But with only one client (Internet Explorer) and one server (Apache/openssl) demonstrating the problem, we couldn’t know which one had the bug. Since openssl is open source, we could theoretically find the problem if that was where the problem was, given enough time, but it would take a very long time to prove that the problem wasn’t in openssl just by code inspection. Not having Internet Explorer source, we couldn’t do the same there. If we could duplicate the problem between Internet Explorer and a separate SSL/TLS server implementation then that would point very strongly to the problem being in Internet Explorer and not in openssl.

Analysis using Wireshark

Hoping to get a tool like Wireshark to validate the MAC sent by Internet Explorer so that we could narrow the problem to either Internet Explorer or openssl, we configured Wireshark with the private key used by our server. However, this didn’t help because it turned out that the key exchange algorithm negotiated by Internet Explorer and Apache was using Diffie-Hellman. Here’s the relevant portion of the Wireshark SSL debug output file for the “Client Key Exchange” message when the error occurred:

ssl_generate_pre_master_secret: found SSL_HND_CLIENT_KEY_EXCHG, state 17

ssl_decrypt_pre_master_secret session uses DH (17) key exchange, which is impossible to decrypt

ssl_generate_pre_master_secret: can’t decrypt pre master secret

The key material used by the client and server to exchange secrets can’t be obtained using the private key of the server because the exchange uses an ephemeral key pair generated by the server on-the-fly and signed using its private key. This strategy provides a very powerful security feature known as “Perfect Forward Secrecy“. It also means that even if Wireshark has the private key of the server, it still can’t decode the SSL conversation, including validating the handshake MAC.

We then configured the server such that it and Internet Explorer would negotiate the exact same cipher suite but using just RSA rather than DHE-RSA for key exchange. We were successful in doing this, but we weren’t able to isolate the problem to Internet Explorer or openssl using this approach because it eliminated the problem. If the problem was in IE, it seems that it had to be related to it use of Diffie-Hellman key exchange.

New error duplication

In order to more easily and quickly duplicate the problem, we wrote a test program that uses the same networking Windows API that Internet Explorer uses. The library that provides this API on Windows is called WinInet. Our program did the same thing as our python script did, however it failed to duplicate the problem. A network trace while the program was running showed that it only ever opened one SSL connection to the server and made all requests using that single connection. This was worked around by making the test program only make one HTTPS request but we called it in a loop from the Windows command line. This duplicated the problem. Here’s the program:

#include <windows.h>

#include <wininet.h>

#include <stdio.h>

#include <tchar.h>

void ReadUrl(LPCTSTR url) {

HINTERNET hInternet = NULL;

HINTERNET hRequest = NULL;

char buffer[1024];

DWORD bytesRead = 0L;

BOOL result = FALSE;

hInternet = InternetOpen(_T(“Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”), INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);

if (NULL == hInternet) {

_tprintf(_T(“InternetOpen error: %08X\n”), GetLastError());

return;

}

hRequest = InternetOpenUrl(hInternet, url, NULL, 0, INTERNET_FLAG_RELOAD | INTERNET_FLAG_SECURE, (DWORD_PTR)NULL);

if (NULL == hRequest) {

_tprintf(_T(“InternetOpenUrl error: %08X\n”), GetLastError());

InternetCloseHandle(hInternet);

return;

}

do {

bytesRead = 0L;

result = InternetReadFile(hRequest, buffer, (DWORD)sizeof(buffer), &bytesRead);

if (result && (bytesRead == 0L)) {

break;

}

} while (result);

if (!result) {

DWORD error = GetLastError();

TCHAR errorText[1024];

DWORD length = (DWORD)(sizeof(errorText) / sizeof(TCHAR));

result = InternetGetLastResponseInfo(&error, errorText, &length);

if (!result) {

_tcscpy(errorText, “unknown”);

length = 7L;

}

_tprintf(_T(“error reading request: (%08X) %.*s\n”), error, length, errorText);

}

InternetCloseHandle(hInternet);

InternetCloseHandle(hRequest);

}

int _tmain(int argc, LPCTSTR * argv) {

if (argc != 2) {

_tprintf(_T(“%s usage: %s <URL>\r\n”), argv[0], argv[0]);

return 1;

}

ReadUrl(argv[1]);

return 0;

}

When run, in just a few seconds, it produces output similar to the following:

C:\imvu\temp>for /l %i in (1, 1, 1000) do @wininet_test.exe https://localhost.imvu.com/dummy.txt

InternetOpenUrl error: 00000008

InternetOpenUrl error: 00002F7D

Out of 1000 requests, roughly 10-12 will fail. About half of them will fail with error code 8 and the other half will fail with error code 0x2F7D. Error code 8 corresponds to ERROR_NOT_ENOUGH_MEMORY. Error code 0x2F7D corresponds to ERROR_INTERNET_SECURITY_CHANNEL_ERROR. The description of this error is “The application experienced an internal error loading the SSL libraries.” I suspect this description is misleading and is actually a generic error used for SSL handshake errors and possibly other kinds of failures as well. It seems to be associated with the schannel.dll Windows system library.

A network trace capturing the SSL conversations when these errors occur shows that the failures with error code 8 manifest as the client closing the connection with a [FIN, ACK] packet after receiving the “Server Key Exchange” message without ever sending a “Client Key Exchange” message. The server responds by sending an “Unexpected Message” fatal alert message. The failures with error code 0x2F7D manifest as the server sending a “Bad record MAC” fatal alert after the client sends the final SSL handshake message containing a “Change Cipher Spec” message and terminated with the handshake MAC.

Duplication with gnutls-serv

In order to finally isolate the problem to either Internet Explorer or openssl, we used a tool called gnutls-serv. Gnutls is an SSL/TLS library from GNU. It comes with a tool called gnutls-serv that will act as a simple HTTPS server. After downloading and compiling the latest version (gnutls-3.3.9), we started it with this command line (after killing Apache) on our web server host:

#LD_LIBRARY_PATH=/usr/local/lib/ src/gnutls-serv -p 443 -g –http -a –x509certfile /etc/apache2/certs/ssl.crt/sandbox.crt –x509keyfile /etc/apache2/certs/ssl.key/sandbox.key –priority “SECURE:-ECDHE-RSA” -d 9999

The –priority “SECURE:-ECDHE-RSA” option is given in order to force the cipher suite negotiated between WinInet and gnutls-serv to be the exact one negotiated between Internet Explorer and Apache. It was initially choosing the Eliptic Curve Diffie-Hellman key exchange algorithm. It turns out that the problem is not reproducible when using Eliptic Curve Diffie-Hellman.

Running the WinInet test program against this server, the problem continued to be observed with the same frequency as when using Apache/openssl. The network trace of these failures look more-or-less identical to the ones generated with Apache/openssl:

The only significant difference is the missing server name warning. gnutls-serv apparently ignores the SNI TLS extension.

Inspecting the relevant debugging output of gnutls-serv corresponding to the failure shows this:

|<3>| ASSERT: gnutls_cipher.c:728

|<3>| ASSERT: gnutls_cipher.c:167

|<3>| ASSERT: gnutls_record.c:1239

|<0x22ae770>| Discarded message[0] due to invalid decryption

|<3>| ASSERT: gnutls_buffers.c:1358

|<3>| ASSERT: gnutls_handshake.c:1428

|<3>| ASSERT: gnutls_handshake.c:785

|<3>| ASSERT: gnutls_handshake.c:3054

|<3>| ASSERT: gnutls_handshake.c:3200

Error in handshake

|<5>| REC: Sending Alert[2|20] – Bad record MAC

The original failure is reported on line 728 of gnutls_cipher.c which is this bit of code:

if (unlikely

(memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {

/* HMAC was not the same. */

dummy_wait(params, compressed, pad_failed, pad,

length + preamble_size);

return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);

}

In this code, the unlikely() function is just a compiler hint and can be replaced with the condition passed to it.

This shows that the problem could be either the MAC is bad or the padding calculation failed. To narrow down the problem even further we modified the code to be this:

if (unlikely

(memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {

if (memcmp(tag, tag_ptr, tag_size) != 0) {

char buf[1024];

_gnutls_assert_log(“ASSERT: %s:%d – mismatching MAC\n”, __FILE__, __LINE__);

_gnutls_assert_log(“ASSERT: expected MAC: %s\n”,

_gnutls_bin2hex(tag, tag_size, buf, sizeof(buf), NULL));

_gnutls_assert_log(“ASSERT: MAC received: %s\n”,

_gnutls_bin2hex(tag_ptr, tag_size, buf, sizeof(buf), NULL));

}

if (pad_failed != 0) {

gnutls_assert();

}

/* HMAC was not the same. */

dummy_wait(params, compressed, pad_failed, pad,

length + preamble_size);

return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);

}

With this change, we get this debug output:

|<3>| ASSERT: gnutls_cipher.c:726 – mismatching MAC

|<3>| ASSERT: expected MAC: ab628ad135b519d2f71686e78d4d84fc

|<3>| ASSERT: MAC received: b053a9400b47c8ed3796ddb317b6a957

|<3>| ASSERT: gnutls_cipher.c:739

|<3>| ASSERT: gnutls_cipher.c:167

|<3>| ASSERT: gnutls_record.c:1239

|<0x156c770>| Discarded message[0] due to invalid decryption

|<3>| ASSERT: gnutls_buffers.c:1358

|<3>| ASSERT: gnutls_handshake.c:1428

|<3>| ASSERT: gnutls_handshake.c:785

|<3>| ASSERT: gnutls_handshake.c:3054

|<3>| ASSERT: gnutls_handshake.c:3200

Error in handshake

|<5>| REC: Sending Alert[2|20] – Bad record MAC

This shows that the actual failure is a mismatch between the MAC calculated over the handshake by the client and server and that there was not a problem detected with the padding.

Conclusion

The fact that the problem occurs with both the openssl and gnutls server implementations indicates that this is almost certainly a bug in the WinInet SSL handshake MAC calculation. The fact that the problem only occurs when using DHE-RSA key exchange algorithms indicates that the failure is caused by a bug in the WinInet Diffie-Hellman MAC calculation or else the derivation of the client MAC key when using Diffie-Hellman key exchange.

This analysis has focused only on the “Bad record MAC” failures. There were also another set of failures with roughly the same frequency that didn’t involve “Bad record MAC” fatal alert messages. The fact that these failures occurred after the server had sent its final handshake message suggests that this could actually be another symptom of the same problem. The same or equivalent bug could cause the client to miscalculate the expected server’s handshake MAC. If the client is rejecting the server’s handshake MAC but not sending “Bad record MAC” alerts when this happens then the observed behavior is exactly what we’d expect to see. I have no explanation as to why the error code 8 (ERROR_NOT_ENOUGH_MEMORY) would be emitted in this condition.

The bottom line is that the implementation of Diffie-Hellman key exchange used by Internet Explorer 11 on Windows 7 is broken. We could disable Diffie-Hellman key exchange in our Apache server configuration. The frequency of this problem and the market share of Internet Explorer likely don’t warrant that course of action since “Perfect Forward Secrecy” is rather valuable. Instead, IMVU will be disabling TLS 1.2 in Internet Explorer configuration on selenium test hosts.

Month: January 2015

The case of the “Page can’t be displayed” intermittent selenium test

By Eric Hohenstein

Code Reviews: Follow the Data

Why Review Code?

Reviewing Diffs

Reviewing Everything

Follow the Data

Handling Code Review Follow-Up Tasks