BlueSteel

While writing the source code of a 3D engine, we can have many types of problems. These may range from simple problems of code that will not compile, to harder problems like memory leaks. At IMVU, we have automated solutions in place to help find these kinds of problems. For example, unit testing combined with Valgrind has helped us track down a lot of memory leak problems.

But what about performance problems?

A single code commit can subtly affect the speed and performance of the 3D engine in ways that are not obvious to the engineer. We suffer from this problem enough that I found it worthwhile to look into in more detail. We hacked a rough solution trying to minimize the performance regressions we had in our 3D engine, and the solution quickly paid off, informing us which were the commits affecting the performance of our 3D engine. While it wasn’t part of the original IMVU schedule, we encourage engineers to solve problems they see, and I took it upon myself to create a tool that solved this problem. I spent several late evenings and even some weekends to build what we now call BlueSteel.

BlueSteel is a tool to help track the performance of our projects by commit, branch, and machine.

Given a set of commands that we will call “Benchmark Definition”, BlueSteel will execute them and will read and store its output on something called “Benchmark Execution”. In the case this output follows an specific schema, BlueSteel will understand the data and will inform the commit author if a fluctuation exists between Benchmark Executions. The ability to check every commit on any branch in our projects is very important.

This gives us the ability to quickly react to a performance degradation, and more importantly, the author of the commit, who has the most context on that code, will be notified very quickly, before that context is lost. In addition, it gives us the ability to freely experiment with new ideas on non stable branches and see whether the resulting code performs better or worse compared to the stable one.

As an example, we are going to expose a real situation of how BlueSteel detected a performance issue, and how BlueSteel helped us by notifying us to fix it. We started from the idea of splitting the list of elements to render between opaque and transparent ones. This was an experiment in preparation for rendering transparent elements using the Depth Peeling techniqueWith git, we created a branch to start the experiment, and we committed some code. After that, the commits were pushed to our Github repository. At this point, BlueSteel took few minutes to notice the change and feed all the information to its database.

git checkout -b wip-render-slot-split
git push -u origin wip-render-slot-split

Afterwards, we went to the BlueSteel main page, which allows us to manage the different concepts listed below:

  • Layouts: A Layout is a group of projects that work together for benchmarking purposes, and this tab will show us the list of Layouts configured in BlueSteel.
  • Projects: A project represents a git repository with the code we want to benchmark or the project we want to use to benchmark other projects.
  • Definitions: A Benchmark Definition is a set of properties. These properties are:
    • A list of commands that will give the result of the benchmark.
    • A list of allowed machines to run this benchmark on.
    • Fluctuation thresholds and overrides.
    • Notification controls.
  • Executions: A Benchmark Execution is the combination of Benchmark Definitions, commits, and machines. Each Benchmark Execution will store the result of a Benchmark Definition executed while the project was checked out at a given commit on a certain machine. When all these Benchmark Executions are stacked together, then we are able to see the progression of our performance on a branch.

As a quick note, we have a Benchmark Definition configured with the following commands:

Image 1

This Benchmark Definition is going to be executed on af050975 and benchmark3 machines:

Image 2

And BlueSteel will send notifications if it detects a fluctuation greater than 2%.

Image 3

If we click on the Projects tab (1), we will see a list of registered projects and each of these projects allow us to list all of its branches (2).

Image 4

After clicking on ‘See project’s branches’ button, BlueSteeI presents us with a paginated list of all the public branches of our project and a quick view of the status of every commit. In this page we can see that BlueSteel is aware of the branch we pushed, and all the required workers are already working on it.

Image 5

After that, we clicked on the ‘Executions’ tab, which allows us to retrieve results of Benchmark Definitions and commits within a machine. In this case, we selected:

Layout: northstar-layout.

Project: northstar-project.

Branch: wip-render-slots-split.

Definition: cpp-render-executer.

Worker: af050975.dmz.imvu.com

This combination of options gave us the following result:

Image 6

As we can see, this branch only has 3 commits on it (the most darkest blue ones), and the rest of the commits are from the master branch. Also, we can see that there is a problem with the latest commit. The time spent on the main render pipeline increased more than expected. The increase of time in this commit was greater than the previous result of the previous commit. The maximum allowed increase between commits, defined in this Benchmark Definition, is 2%.

We can have a look at the email that BlueSteel sent us because of this issue:

Image 7

We can see that the fluctuation value was ~2.17% compared to the previous commit, enough to bring our attention to the affected code and investigate what was causing the performance degradation. After a while, we discovered that splitting the list of render slots into two lists was forcing the engine to make more pointer de-references than expected inside a high frequency loop. Because of this information, we took a step back and rethought the whole strategy for the required solution.

Now the render pipeline has a more optimal way of separating opaque and transparent render slots without paying hardly any penalty on performance.

BlueSteel is now a publicly available tool under the MIT license: https://github.com/imvu/bluesteel

Llorenç Marti Garcia

Promises in Unity

By Michael Powell

The IMVU Unity API makes heavy use of promises, to simplify asynchronous operations. Unfortunately, they’re different from the way things are normally done in Unity, and thus many people using our API have been understandably confused by them. The objective of this post is to alleviate that confusion. We first illustrate the problem we face without promises, then we explore a number of partial solutions, and finally we pull them all together to show just how powerful promises are.

Synchronous vs Asynchronous

The IMVU Unity API is focused almost entirely on loading things over the network, which really needs to be done asynchronously. To illustrate, consider this code snippet, which uses a hypothetical naive API:

class MyAvatar: LocalAssetLoader {
    void Start() {
        UserModel user = Imvu.Login();
        Load(user);
        // ... Do stuff to the loaded avatar
    }
}

That looks great! It’s super short, simple, and clean.

But there’s a problem. The call to Imvu.Login() can’t return a UserModel right away. This will, at the very least, require a few network roundtrips, which could easily take a second or two, and it may need to redirect the user to a website to go through the login flow, which could take minutes. The call to LocalAssetLoader.Load() will likewise take some time. It will kick off dozens of network requests, which in the optimistic case will take a few seconds to complete, and 10-20 seconds is pretty common.

The reason this is a problem is because, if we wrote the API this way, the entire program would be frozen waiting for these calls to complete. No frames would render. No user input would be read. It would feel like the program had crashed, because it’s just sitting there waiting for these calls to return.

This is what it means to execute code synchronously. It allows you to specify things in a simple sequence with guaranteed order. This makes it easy to reason about, which is great, and is why we do things synchronously when we can. However, it falls apart with slow operations. If the program is to remain responsive, slow operations need to be executed asynchronously. Unfortunately, asynchronous code is much more difficult to reason about, and it requires a new set of abstractions.

There’s another problem with the above code: It doesn’t handle failure. Both of those calls make network requests, and network requests can fail. So we need to do something like this:

class MyAvatar: LocalAssetLoader {
    void Start() {
        UserModel user = Imvu.Login();
        if (user == null) {
            Debug.LogError("Failed to log in for some unknown reason.");
            return;
        }
        bool loaded = Load(user);
        if (!loaded) {
            Debug.LogError("Failed to load avatar for some unknown reason.");
            return;
        }
        // ... Do stuff to the loaded avatar
    }
}

Well, that just got a lot uglier. Every one of these steps needs an error check after it, and we have no mechanism for getting any information on what actual failure occurred, so they can’t output very useful error messages.

Callback Hell

The most obvious approach to deal with asynchronous programming is to use callbacks, also known as continuation passing style, or CPS. When you call one of these functions that will take a while to complete, you also pass it another function. That function will be called when it’s done, with the result as its argument. For instance:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login(LoginCallback);
    }

    void LoginCallback(UserModel user) {
        Load(user, AvatarLoadedCallback);
    }

    void AvatarLoadedCallback() {
        // ... Do stuff with the loaded avatar
    }
}

Well, that’s a lot more verbose, but it’s still pretty comprehensible, right? Unfortunately, we have the same problem as before: No error handling. Thankfully, CPS provides a fairly straight-forward mechanism for handling errors. Instead of passing in one callback, you pass in two: a success and a failure callback. This also allows us to pass back an error condition, for better error messages. For instance:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login(LoginCallback, LoginErrorCallback);
    }

    void LoginCallback(UserModel user) {
        Load(user, AvatarLoadedCallback, AvatarLoadFailedCallback);
    }

    void LoginErrorCallback(Error error) {
        Debug.LogError("Failed to login: " + error);
    }

    void AvatarLoadedCallback() {
        // ... Do stuff with the loaded avatar
    }

    void AvatarLoadFailedCallback(Error error) {
        Debug.LogError("Failed to load avatar: " + error);
    }
}

It’s starting to get a bit complicated there, and this is one of the simplest possible examples. Now you’re getting a glimpse of what many programmers have come to refer to as “callback hell”. All of your code gets broken into tons of little functions, defined not by what they do but by where they fall in the execution flow. This makes it hard to trace the flow of execution. As you solve complicated, real problems, this quickly gets completely unmanageable. So let’s explore some ways to make this better.

Achieving Closure

Closures are a feature of C# (and many other languages) which tends to be under-utilized in Unity code. They’re a special case of lambdas, which are functions that are declared as an expression.

In code, lambda values don’t intrinsically have a name — which is why they are sometimes called “anonymous functions” — though they can be given one by assigning them to a variable. They’re usually small, one line functions, though they can be just as elaborate as any other function. In C#, they look like this:

Func<int, int> myLambda = (int a) => { return a*2; };
Debug.Log(myLambda(5));

There are a lot of things to dissect here, so let’s start at the beginning. Func is the type used for storing functions, which includes lambdas or normal methods. The last type parameter to Func is the return type, and the others are the argument types. So Func<string, float, int> can store a function with the signature int function(string, float). Note that if you want a function which returns nothing, you use Action instead. For instance, Action<string, float> can store a function with the signature void function(string, float).

The lambda syntax itself starts with the argument list, in parentheses, followed by =>, followed by the body of the function. Note that I wrote this in the most verbose way, for clarity. When C# already knows the function signature, you can omit the types from the argument list, allowing you to write:

Func<int, int> myLambda = (a) => { return a*2; };

Also, when there is only one argument and its type can be inferred, you can leave off the parentheses:

Func<int, int> myLambda = a => { return a*2; };

And finally, when the body of the function is just a single return statement, you can omit the curly braces and the return, and just provide the expression to be returned:

Func<int, int> myLambda = a => a*2;

Most of the places where lambdas are used meet all of these constraints: a single argument, of a known type, which just returns a single expression. So lambdas tend to be very compact.

Note that when you want to make a lambda which takes no arguments, you use () for the argument list, like this:

Func<int> myLambda = () => 5;

Now it just so happens that C#’s lambdas are actually closures. A closure is a lambda that allows you to access any variables that were in scope when the lambda was declared. For instance:

int num = 5;
Func<int, int> myClosure = a => a*num;
Debug.Log(myClosure(2)); // logs "10"

Notice that the closure uses the variable num in its body. When you do this, it’s said that it “closes over” num.

If you want to read more about C#’s support for lambdas, check out this MSDN article on the subject: https://msdn.microsoft.com/en-us/library/bb397687.aspx

So, let’s try using closures to make the earlier example less verbose:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login(
            user => Load(
                () => { /* ... Do stuff with the loaded avatar */ },
                error => Debug.LogError("Failed to load avatar: " + error)
            ),
            error => Debug.LogError("Failed to login: " + error)
        );
    }
}

This is a lot less verbose than the earlier example, and the flow is more ordered, but the nested closures are a little hard to read. Again, this is the simplest possible example. If we need to do anything much more complicated, this will still get unmanageable quickly, possibly even worse than the callback hell above.

We can do better, though closures will be very important to our solution.

Error Results

Let’s go back to our earlier examples, where we needed to do an explicit error check after each call, and examine a different problem that crops up there. The solution will introduce another class that’s broadly useful to us.

UserModel user = Imvu.Login();
if (user == null) {
    Debug.LogError("Failed to login for ... some unknown reason?");
    return;
}

The trouble is, those error checks are easy to skip. You may not even realize that a given call could fail. And every time you skip a necessary error check, you introduce a potential crash into your program.

But what if we had a type that could be a useful value OR an error, which makes it impossible to access the data without doing a proper error check? Sounds like a pipe dream, I know, but using closures we can simulate what’s known in computer science as a sum type with pattern matching, which gives us exactly what we’re looking for.

The IMVU Unity API provides this type, which we call Result. It’s generic over the type that it can store. So if you want to return a string or an error, you use a Result<string>. The key thing about Result is that you can’t just directly access the value it stores. You need to do so through a function you pass in, which will only run if the value is actually there to be accessed. This makes it impossible to access the value if it’s not present, which is what we mean by making it impossible to access the data without doing a proper error check. The most common way to do this is with the Match() function, which takes two functions as arguments. The first is given the value, if it exists. The second is given the error, if the value doesn’t exist. For example:

result.Match(
    value => Debug.Log("Success! We got: " + value),
    error => Debug.LogError("Error! " + error)
)

If we rebuilt our initial error handling example to work with Result, we’d get something like this:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login().Match(
            user => {
                Load(user).Match(
                    _ => { /* ... Do stuff to the loaded avatar */ },
                    error => Debug.LogError("Failed to load avatar: " + error)
                );
            },
            error => Debug.LogError("Failed to login: " + error)
        );
    }
}

That looks a lot like our closure example in the previous section, and it’ll have the same problem. This is one of the simplest possible examples, and it’s already looking kind of complicated. But, there’s something more interesting we can do here to make this better.

Railway Oriented Programming

The functions passed in to Match() can have a return type. Since only one of the two functions will actually be run, whatever that function returns can be returned by Match(). Note that both of them must return the same type for this to work, because Match() itself can only have one return type.

Now, imagine that our functions return another Result. That means we could call Match() on the return value of the previous call to Match(), instead of having to nest them. Let’s see what that looks like:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login().Match(
            user => Load(user),
            error => Result<Unit>.Err(error)
        ).Match(
            _ => { /* ... Do stuff to the loaded avatar */ },
            error => Debug.LogError(error)
        );
    }
}

Note that Unit is a placeholder type. It has no public constructor, and will always be null. It’s used here because Load() doesn’t have any useful data to return. The Result<Unit> in this case is just to represent the fact that Load() can fail, and capture the resulting error.

Anyway, that’s looking a bit cleaner. As we add complexity to this example, it will just add more calls to Match() to the sequence. It won’t increase the indentation, and won’t further separate the error handlers from the code that’s generating the errors. Let’s try something a bit more complicated, like loading the user’s first saved outfit:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login().Match(
            user => user.GetOutfits(),
            error => Result<OutfitCollection>.Err(error)
        ).Match(
            outfits => outfits.info.items[0],
            error => Result<OutfitModel>.Err(error)
        ).Match(
            outfit => Load(outfit),
            error => Result<Unit>.Err(error)
        ).Match(
            _ => { /* ... Do stuff to the loaded avatar */ },
            error => Debug.LogError(error)
        );
    }
}

Note that the error function in the first three calls to Match() just creates a new Result with the same error. This causes Match() to return the error, passing the buck to the next error function. Thus any error, regardless of which step generates it, will be handled by the Debug.LogError() in the last step. Also note that the type of the Result we create must match the type returned by the success lambda.

What you’re seeing here is the beginning of what’s known as Railway Oriented Programming (also referred to as chaining). Imagine you have two parallel railway tracks, which represent the flow of execution. One of them is the success rail, and the other is the failure rail. At each step of execution, we’re either on the success rail or the failure rail, and we can either stay on the same rail or switch rails.

In the example above, the call to Imvu.Login() starts our railway. It can start us out on the success or failure rails. If it’s on the success rail, we call user.GetOutfits(), and what it returns determines whether we stay on the success rail, or move over to the failure rail. If Imvu.Login() put us on the failure rail, then we choose to stay on the failure rail by returning a new Result with the same error. This is the first step along our railway. The next call to Match() is the second step. The second and third steps are similarly structured. If we’re still on the success rail when we arrive at the fourth step, then we’ve succeeded, and it’s time to do whatever we want to do with the avatar. If we’re on the failure rail, then it’s time to log whatever error we hit along the way.

Staying on the success rail can be thought of as the normal flow of execution. Switching to the failure rail is like throwing an exception. Every error callback in the railway can be thought of as catching an exception. When it packs the error into a new Result and returns that, we can think of that as re-throwing the exception.

This is a pretty typical example of a railway, where all steps but the last one do something interesting on the success rail that generates a new Result, but on the failure rail they just pass along the error. So having to always manually pass through the error generates a lot of boilerplate. This is why Result has another set of functions designed specifically for railways like this.

The first of these is Then(), which has overloads that take one or two arguments. The single-argument version takes just a success callback, and passes through errors automatically. The two-argument version takes both success and failure callbacks. The second function is Catch(), which takes a single failure callback. It passes through successes the same way the single-argument Then() passes through errors. Both Then() and Catch() always return a Result, and all of the callbacks passed into them must return a Result, which determines which rail you end up on.

So you can clean up the above example a bit using these functions:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login().Then(
            user => user.GetOutfits()
        ).Then(
            outfits => outfits.info.items[0]
        ).Then(
            outfit => Load(outfit)
        ).Match(
            _ => { /* ... Do stuff to the loaded avatar */ },
            error => Debug.LogError(error)
        );
    }
}

Well that’s now looking pretty nice. It’s very easy to follow the sequence of actions we take, and we don’t even have to think about error handling beyond having the single log statement at the bottom.

Note that the last step in the chain uses Match() instead of Then(). This is because Then() requires the lambdas to return a Result, so that Then() can itself return a Result. However, our lambdas in that last step don’t return anything. So we need to use Match().

Going Asynchronous

The problem is, Result is still synchronous. It’s given a value or an error in its constructor, and then when Match() or Then() are called, the appropriate callback is immediately called. There’s still no opportunity for slow network requests.

However, everything is being handled through callbacks, so it should be possible to build something similar to Result, but which allows for asynchronous behavior, right? Well, yes, in fact. That’s exactly what Promise is!

A Promise is like a Result, except the value it represents may not exist yet. It’s a promise to deliver that value. When it’s given a value, the promise is said to be accepted. If it fails, it’s said to be rejected. Either accepting or rejecting the promise is considered to be resolving it, and when it’s resolved the appropriate callbacks will be called. If the Promise is already resolved when the callbacks are registered, then they’re called immediately.

The API for using a Promise is nearly identical to Result. In fact, the actual API, which uses promises, looks like this:

class MyAvatar: LocalAssetLoader {
    void Start() {
        Imvu.Login().Then(
            user => user.GetOutfits()
        ).Then(
            outfits => outfits.info.items[0]
        ).Then(
            outfit => Load(outfit)
        ).Match(
            _ => { /* ... Do stuff to the loaded avatar */ },
            error => Debug.LogError(error)
        );
    }
}

That is character-for-character identical to our last example.

There are some differences between Promise and Result, though. The callbacks passed to Then() or Catch() on a Promise can return either another Promise, or a Result. If they return a Result, it will be converted into an equivalent Promise. Also, Match() on Promise can’t return anything, because the callbacks passed into it may not run immediately, so it doesn’t necessarily have a value to return. If you wish to return something from a Match() on a Promise, use Then() instead, which will allow you to return it in another Promise.

Also, and this is possibly the most important difference, you must account for the fact that the callbacks do not run synchronously. For instance, consider this broken code:

class MyAvatar: LocalAssetLoader {
    void Start() {
        UserModel userModel = null;
        Imvu.Login().MatchValue(
            user => userModel = user
        );
        string name = userModel.info.name;
    }
}

If the user isn’t already logged in, this would generate a null reference exception. This is because the lambda passed in to Imvu.Login().Then() doesn’t run right away. We’re setting it up to run later, when the login completes. So when we try to access a field on the userModel immediately after we set that up, it’s still null.

Conclusion

When designing this API, we considered many different options that we dismissed:

  • Continuation Passing Style – We dismissed this because it turned into callback hell.
  • Coroutines – We examined Unity’s coroutines, which are specifically designed to handle asynchronous cases cleanly. They’re quite nice. Unfortunately, they’re only usable within a MonoBehaviour, and we needed to be able to run asynchronous code from anywhere. We considered implementing our own version of coroutines using roughly the same syntax, but that was too large an undertaking.
  • async/await – C#’s async and await keywords bake an even better version of coroutines into the language. They look great, but it turns out Unity’s version of C# doesn’t support them yet.
  • Threads – This would allow you to write something like the synchronous example at the top of this article. However, it has some major problems. First, threading doesn’t translate to WebGL at all, so if we relied on threading it would fall apart on one of the major platforms we’re targeting. Second, developers using the API would be forced to write code that goes into threads, and threading is infamously error prone and difficult to get right. We’d rather not put that burden on our developers. Third, it would be difficult to load an avatar on a thread, because only the main thread can interact with the GPU. There are ways of working around this, but combined with the other problems it wouldn’t be worth it.

So given all of this, promises are the best way we could come up with for handling the heavily asynchronous code our API demands. We understand that they’re confusing at first, particularly to programmers unaccustomed to working with lambdas and asynchronous code. However, hopefully you now have a stronger understanding of how they work, and why they’re necessary.

Further Reading

  • Railway Oriented Programming – This is a slide deck that introduces the railway metaphor in a lot more detail, from the perspective of F#.
  • How do Promises Work? – This is a deep dive into the underlying mechanics of promises, as they’re implemented in JavaScript. Our promises are very similar in principle, but have a slightly different API.
  • You’re Missing the Point of Promises – A critique of some promise polyfills common in JavaScript, which goes into detail on some often-missed points about what makes promises really useful.
  • Promises are the Monad of Asynchronous Programming – This is a more advanced article discussing the relationship between promises and monads. This provides an alternate viewpoint on promises for those familiar with monads.

Testable IO in Haskell

By Andy Friesen

At IMVU, we write a lot of tests. Ideally, we write tests for every feature and bugfix we write. The problem we run into is one of scale: if each of IMVU’s tests were 99.9% reliable, 1 out of every 5 runs would result in an intermittent failure.

Tests erroneously fail for lots of reasons: the test could be running in the midst of the “extra” daylight-savings hour or a leap day (or a leap second!). The database could have been left corrupted by another test. CPU scheduling could prioritize one process over another. Maybe the random number generator just so happened to produce two zeroes in a row.

All of these things boil down to the same root cause: nondeterminism within the test.

We’ve done a lot of work at IMVU to isolate and control nondeterminism in our test frameworks. One of my favourite techniques is the way we make our Haskell tests provably perfectly deterministic.

Here’s how it works.

This post is Literate Haskell, which basically means you can point GHC at it directly and run it. You can download it here.

We’ll start with some boilerplate.

{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE NamedFieldPuns #-}

module Main where

import Control.Monad.State.Lazy as S

What we’re looking to achieve here is a syntax-lightweight way of writing side effectful logic in a way that permits easy unit testing.

In particular, a property we’d very much like to have is the ability to deny our actions access to IO when they are running in a unit test.

For this example, we’ll posit that the very important business action we wish to test is to prompt the user for their name, then say hello:

importantBusinessAction = do
    writeLine "Please enter your name: "
    name <- readLine
    if "" == name
      then do
        writeLine "I really really need a name!"
        importantBusinessAction
      else
        writeLine $ "Hello, " ++ name ++ "!"

We’ll achieve this by defining a class of monad in which testable side effects can occur. We’ll name this class World.

class Monad m => World m where
    writeLine :: String -> m ()
    readLine :: m String

We can now write the type of our importantBusinessAction:

importantBusinessAction :: World m => m ()

The name of this type can be read as “an action producing unit for some monad m in World.”

When our application is running in production, we don’t require anything except IO to run, so it’s perfectly sensible for IO to be a context in which World actions can be run. The Haskell Prelude already offers the exact functions we need, so this instance is completely trivial:

instance World IO where
    writeLine = putStrLn
    readLine = getLine

In unit tests, we specifically want to deny access to any kind of nondeterminism, so we’ll use the State monad. State provides the illusion of a mutable piece of data through a pure computation. We’ll pack the state of our application up in a record.

type FakeIO = S.State FakeState

(I’ll get to FakeState in a second)

Aside from reliability, this design has another very useful property: It is impossible for tests to interfere with one another even if many tests share the same state. This means that “test fixtures” can trivially be effected by simply running an action and using the resulting state in as many tests as desired.

The state record FakeState itself essentially captures the full state of the fake application at any one moment.

The writeLine implementation is very easy: We just need to accumulate a list of lines that were printed. We can carry that directly in our state record.

The readLine action is a bit more complicated. We’re going to write all kinds of tests for our application, and we really don’t want to burn any one particular behaviour into the framework. We want to parameterize this on a per-test basis.

We’ll solve this by embedding an action directly into our state record.

data FakeState = FS
    { fsWrittenLines :: [String]
    , fsReadLine     :: FakeIO String
    }

def :: FakeState
def = FS
    { fsWrittenLines = []
    , fsReadLine = return ""
    }

Now, given this record, we can declare that FakeIO is also a valid World Monad, and provide implementations for our platform when run under unit test.

instance World (S.State FakeState) where
    writeLine s = do
        st <- S.get
        let oldLines = fsWrittenLines st
        S.put st { fsWrittenLines = s:oldLines }

    readLine = do
        st <- S.get
        let readLineAction = fsReadLine st
        readLineAction

We also write a small helper function to make unit tests read a bit more naturally:

runFakeWorld :: b -> State b a -> (a, b)
runFakeWorld = flip S.runState

Now, let’s write our first unit test.

We wish to test that our application rejects the empty string as a name. When the user does this, we wish to verify that the customer sees an error message and is asked again for their name.

First, we’ll craft a readLine implementation that produces the empty string once, then the string “Joe.”

Making this function more natural without compromising extensibility is left as an exercise to the reader. 🙂

Note that by providing the type FakeIO String, we have effectively authored an action that can only be used in a unit test. The build will fail if production code tries to use this action.

main :: IO ()
main = do
    let readLine_for_test :: FakeIO String
        readLine_for_test = do
            S.modify $
                \s ->
                    s {fsReadLine=return "Joe"})
            return ""

Now that we have that, we can create a FakeState that represents the scenario we wish to test.

    let initState = def
            { fsReadLine = readLine_for_test }

And go!

    let ((), endState) =
            runFakeWorld
                initState importantBusinessAction

Note that runFakeWorld produces a pair of the result of the action and the final state. We can inspect this record freely:

    forM_ (reverse $ fsWrittenLines endState) $
        \line ->
            print line

That’s it!

In a real application, your FakeState analogue will be much more complex, potentially including things like a clock, a pseudo-random number generator, and potentially state for a pure database of some sort. Some of these things are themselves complex to build out, but, as long as those implementations are pure, everything snaps together neatly.

If complete isolation from IO is impractical, this technique could also be adjusted to run atop a StateT rather than pure State. This allows for imperfect side-effect isolation where necessary.

Happy testing!

Source Code.

Mutation Testing

By Llorens Marti Garcia

We’ve been actively working for almost 3 years on IMVU’s new 3D engine called Northstar, which is written in C++ for performance and portability, and compiled to many platforms, including the web using Emscripten.

Because Test Driven Development (TDD) has demonstrated over the years that we can write software in a more robust and efficient manner as well as with higher quality, our 3D engine has a lot of unit tests trying to cover as many features as possible over more than 100K lines of code.

Problem with Unit tests

With this scenario we recently noticed something interesting about TDD. What happens if after hundreds of thousands of lines of code and thousands of unit tests the rhythm of adding those test for new code decreases? The Answer is simple, the quality of the code decays, the probability of future bugs increases or worse, the ability to re-factor some code ends up being more challenging.

IMVU tried to solve the problem in the past with code coverage tools. These tools help, but they do not give the full picture in some scenarios, such as when a test calls one function that executes big portions of the code in one test. Here I want to explain how we attacked the problem from another point of view.

While looking at some source files of our 3D engine, our initial question for testing the quality of our tests was: If we remove this line from this source file, will the tests yell at us because it is missing?

We found this initial idea very interesting. First, we are not testing if the tests cover the code but if there was any test testing a feature related with this removed line of code. And second, the process can end up giving us a list of lines to look at, which is very handy to detect missing tested features. We can find more about Mutation testing and how it help us to find these deficiencies here.

Overview of the solution

We use Git for a source control. C++ is the main language of the project. We use Python for the framework that drives code testing. Additionally, our commit history is as linear as possible inside our repository.

This is a quick overview of the solution we implemented:

1.- Git pull 3D engine code
2.- Checkout commit with hash X
3.- Git submodule update
4.- Get the diff text between commit X~1 and X
5.- Parse the diff and extract all the added lines in commit X
6.- For every line from the step 5:
6.1.- Decide which kind of mutation we will attempt.

// for a normal line we select:  // —  (a commented line)

// for an if (…) line we select if (true) {  and if (false) {

// for a for (…;…;…) we select for (…; false; …) {

// for a for (… : …) (explained below)

7.- For every line and its substitution line/lines:

7.1.- Find the the original line in the code

7.2.- Substitute the original line with the substitution line

7.3.- Build Northstar

7.3.1.- If the build fails, we return because this line was necessary.

7.4.- Build Run Northstar tests

7.4.1.- If at least one of the tests fail, we return because this line was necessary.

7.4.2.- If the tests hung, we treat it as if the tests failed. (see later)

7.5.- If both 7.3 and 7.4 finished with OK then the original line was not covered by any test

8.- Compute % between all the lines tested and all the lines not covered.

Details of the solution and some tips

Many parts of the algorithm above can be tricky:

1.- Git diff (step 4)

What we want is to know is the exact diff between commit X and the commit before it. With this in mind, it is important to have a commit history as linear as possible. Using git rebase often may do the job (we default git pull to use the -r option). In case we don’t have a linear history, then the operation ‘git diff X~1 X’ will give us inaccurate code to test. This can happen when, for example, we are asking for a diff between a normal commit and a merge commit.

2.- Line substitution (step 6.1)

Here comes one of the trickiest part in this project: trying to parse C++, understanding it, and replacing the code by one specially crafted for the occasion.

The most common case is replacing a normal C++ line of code with a blank line (or a comment line ‘//—’ in case we are debugging). The problem with this approximation is that it does not work everywhere. For example, what happens if the statement line is split across multiple lines? In our solution, we use a combination of regex helpers and more complex C++ parsers.

With conditionals, what we want to do is force them to be active or inactive, this is why when we find an IF statement, we replace them with two possible lines: ‘if (true) {‘ and ‘if (false) {‘

This two scenarios tell us if there is any ‘test’ exercising the opposite of the line we are mutating.  For example, if we substitute ‘if (true) {‘ and after compiling the project and running the tests everything is reported as OK, then we can be sure that no ‘test’ covers the feature that is represented in code by this IF statement.

In the same way, when we find a FOR statement, we replace it depending on the type:

A normal FOR statement will be replaced by:

     for(;false;) {

A C++11 FOR statement with the form for(<variable> : <container>) will be replaced by:

for(<variable> = *(<container>.begin()); false; ) {

These are some methods of replacing IF and FOR statements to force them to be disabled or not. Again, keep in mind that parsing C++ is very difficult, and this code can be improved in many different ways.

A special point about FOR statements: There is the possibility that we will replace a line of code inside the body that will make it loop forever. This is why we have a timeout while running the tests, because they may hang forever. If this is the case, wes hould treat this scenarios as if the tests failed when run.

3.- Removing lines (inside step 6.1)

There are many lines we can ignore testing; this is because we know that the code will fail to compile if we remove them. For example, rules like:

  • Line only contains ‘{‘ or ‘}’
  • Line starts with a common type: int, float, double, char, etc.
  • Line starts with ‘class’ and ends with ‘{‘
  • Line starts with #include

As we can see, we can improve the time of testing a given commit by not testing those lines that will make the compilation to fail or their value is not interesting for us, like ‘#include’.

4.- Parallelization

Due to the nature of these code testing, the process is highly parallelizable. For a given project and a given commit of that project, given N lines we want to test, each of those lines can be tested on a separate machine.

Examples

1.- Simple function

Fig 1: Function onDetach examined

BlogFig1
If we look at Fig 1, we can see some interesting output:

Line 107: If we mutate that line to a comment line, any test will fail.
Line 109: Not tested IF statement.
Line 115: Not tested FOR statement.
Line 117: Not tested IF statement.
Line 118: Line mutated to a comment, any test fail.
Line 121: Line mutated to a comment, any test fail.
Line 124: Line mutated to a comment, any test fail.

The main idea behind this output about onDetach() function is that we are lacking tests for the behaviour of this function. We are not testing the reset of the mesh component. We are neither testing what happens if bonesAsGameObjects is empty. Also we are not testing that when calling onDetach(), we reassign GameObject children.

In the images below, we can see how the algorithm changes the original code (Fig 2) with some mutations (Fig 3 and 4).

Fig 2: Original file.

BlogFig2

In Fig 3 we can see how line 107 mutated to a commented line. After this change, the source code is compiled without any errors and all the tests finish without any problem. Because of this, line 107 is detected as a line that needs to be tested, and is marked on Fig 1.

Fig 3: Line 107 mutated to a comment.

BlogFig3

meshComponent_ is a weak pointer to a mesh component, and when we detach the skeleton component, we want to reset this weak pointer.

By not doing it, we will end up with a skeleton component already detached from a Game Object that still can have access to a mesh component. The meaning of this line is to prevent detached components to have access to previous neighbour components. If we don’t test that line and someone remove it, we can have weird scenarios where detached components still affects current components.

In Fig 4 we can see how line 109 mutated to a if (false) {. After this change, the source code is compiled without any errors and all the tests finish without any problem. Line 109 mutation denotes there is no test exercising that if bonesAsGameObjects_ is empty, we remove the first child. Because of this it is marked with red on Fig 1.

Fig 4: Line 109 mutated to if (false) {

BlogFig4

2.- False Negatives

While testing with this tool, we can have false negatives scenarios, and here is an example:

Fig 5: False negative

BlogFig5

In figure 5,  we can see a DomainBase object that appears to be completely tested. The problem here is that we don’t have any test for that class. What happens is that because DomainBase is used by other classes on the project, if we mutate any line, other tests will fail because of the mutation.

This does not means that DomainBase is robust, it only means that the code that is using DomainBase is not exercising any exploit on it.

3.- False positive

Fig 6: False positive

BlogFig6

Normally we can find false positives iin these lines of code that tries to optimize memory for some algorithm. A good example is reserve() from a std::vector. These lines can be removed, and build and testing processes will not become affected. This is because reserve() makes memory access more performant but in any case it change the behavior of the algorithm. To prevent these kind of false positives we can add filters to avoid any mutation in these lines.

Conclusion

The most important idea of this project is the ability to spot areas in the source code that need mutation testing. This project has already found many bugs in our source code, including dead code and code that was forgotten and not compiled.

The way we found that some code was not compiled at all was because our tool was marking the constructor declaration line of one class as a discardable one. Here we can see the affected class:

Fig 7: Not compiled code.

BlogFig7

If we think about it, how can this class be compiled if we mutate the line 11 with an empty line. After some investigation we discovered that this class was removed from the build process script.

We also noticed that in large code bases, some files are tested not by its own tests but by the code that it is exercising it. This means that the code is safely tested only while it remains in the actual project and with no major changes. A good move to solve this problem is to not compile the whole project for testing a line but instead to compile only the tests associated with the file we are modifying.

Something that still requires work is the speed of getting the test results. Currently it takes minutes to get feedback about one given commit, and this disrupts the development iteration cycle. This can be fixed by,for example, parallelizing the process with more build machines.

We find mutation testing to be highly valuable, and I hope this idea can help other engineers deliver solid, well-tested code.

The case of the “Page can’t be displayed” intermittent selenium test

By Eric Hohenstein

 

IMVU relies heavily on unit testing with dependency injection for most of the testing of its website code written in PHP. Unit testing is great because it can be crafted in such a way that it will (hopefully) only test the things you’re interested in testing to verify the behavior of a particular component. This ideally makes unit tests both fast and reliable.

However, it’s possible to write unit that all pass, but that fail to catch an incompatibility between components that causes your application to break horribly. Because of this, IMVU uses a testing tool called selenium for end-to-end acceptance testing. Selenium works by loading a web page or pages in an actual browser window and it performs a set of configurable actions on that web page and verifies that the page behaves as expected. For example, if you have a web page with a button with the text “Hide” next to a widget and you expect that when the button is clicked, the widget is hidden, you can write a selenium test that will load the page in question and click the button and confirm that the widget is hidden.
For at least the past several months, IMVU’s selenium test infrastructure has had a problem that causes otherwise reliable selenium tests to fail intermittently when run in Internet Explorer, showing an error page with the text “This page can’t be displayed”. This has been a constant source of frustration and a drag on engineering efficiency. A great deal of time has been spent by engineers on multiple teams investigating the cause. Up until now, the cause has remained a mystery.

 

Ericblog1
 

DNS failure
At first, the problem seemed to be related to a DNS lookup failure. The reason for this is that inspecting the frame where the web page was supposed to have loaded reveals that the URL of the error page is res://ieframe.dll/dnserrordiagoff.htm*. The path of that URL (dnserrordiagoff.htm) suggests a DNS lookup failure. Some experimentation shows that DNS errors are, indeed, displayed in this way in Internet Explorer. However, a lot of other network errors will cause the same error page to be displayed so this was not a valid conclusion and this information did little to uncover the source of the problem.
In order to eliminate our corporate DNS service as a possible source of the problem, we tried hard-coding the address of the remote web server host in the hosts file of the Windows host running the selenium test. We initially thought that this had resolved the problem but we soon found that the problem continued.
* It turns out that the “diagoff” part of the path of the error page disables the presentation of a “Fix connection problems” button on the error page. Internet Explorer will show the “diagoff” version of the error page inside an iframe and the alternate dnserror.htm version with the button when shown in the top most frame.

 

SSL failure
It had been observed that this problem occurred only when the selenium test was trying to load a web page using the HTTPS scheme. This was a useful observation but it didn’t give us much insight into what the problem was early on.

 

Network reliability
We tried capturing a network trace of the problem occurring several times. One of these time, the capture showed a “Bad record MAC” SSL fatal alert message during an SSL handshake. This made us wonder if the the problem was being caused by an unreliable network between the Windows host where the selenium test was running and the host where the HTTPS page was being served from. Checksums are supposed to catch packet transmission errors but if the errors are happening regularly enough, some garbled packets might slip through with matching checksums. We inspected our network traces for signs of packets with bad checksums but didn’t find any other evidence of that. We inspected several network traces and found that while they all seemed to fail during the SSL handshake, only one failed with a “Bad record MAC” SSL fatal alert message. We decided that this was likely a red herring.

 

Error duplication
In order to narrow down the source of the problem, we wrote a python script to try and duplicate it. The python script would, 4 times per second, do a DNS lookup of the domain name that the HTTPS URL was using when the problem occurred, and then make an HTTP followed by an HTTPS request to that domain. The script would report when any of these operations took more than a small threshold to complete. This script was run in parallel to the selenium test in the hopes that it would reveal some kind of temporary network lag between the selenium test host and the web server host. The idea was that if this test showed a problem that coincided with the selenium failure then it would indicate that the problem was between the Windows TCP/IP stack and the remote web server. If it didn’t then the problem had to be at a higher level. The results of this experiment showed no failure in the script when the selenium failure occurred so the problem had to be in Internet Explorer or something it relies upon above the low level Windows networking layer.

 

Lots of moving parts
When we first noticed the problem, we had recently upgraded to a newer version of selenium so that we could run our selenium tests on a newer version of Internet Explorer so that we could run our selenium tests on Windows 7 rather than Windows XP. All of these things changed at approximately the same time and we knew that the problem could potentially be related to any of them.

 

Windows Defender
In Windows 7, a feature called “Windows Defender” will periodically scan the system for potential problems. We noticed that there was an alert showing on every one of our Windows 7 selenium test hosts warning about changes we had made to the hosts file to facilitate our testing. We thought that Windows Defender might be interfering with Internet Explorer in some way that could be causing the problem. Disabling Windows Defender got rid of the alert, but it the problem continued.

 

Background tasks
The intermittent nature of the problem seemed to suggest that it might be happening when the test coincided with a background task in Windows 7 that wasn’t happening in Windows XP or that had at least not been causing the same type of problem. One type of background task that we thought might be related was periodically checking for certificate revocation, since we had identified that the problem only affected HTTPS pages. We tried disabling all Windows scheduled tasks, but the problem continued.

 

Internet Explorer settings
We wondered if some Internet Explorer setting in the new version of Internet Explorer we were using was causing the problem. We tried setting all Internet Explorer settings to the lowest security, lowest privacy settings, disabling extensions, disabling HTTP 1.1, disabling checking SSL certificate revocation, disabling “Do Not Track,” but the problem continued.

 

Internet Explorer cache/cookies
We had noticed in the past that leftover state in Internet Explorer from previous tests could sometimes cause intermittent selenium test failures. We had some infrastructure in place that would attempt to delete cookies and cache between selenium tests to avoid this problem. We noticed that the file system and registry paths used to find and eliminate leftover Internet Explorer state had been changed since we upgrade Internet Explorer and Windows versions but our infrastructure hadn’t been updated. We fixed this, but the problem continued.

 

Running out of ideas
At this point, it seemed that IMVU might have to either abandon selenium testing or it might have to abandon using Internet Explorer for selenium testing. Both solutions would be very expensive and would also be a very frustrating failure. That’s when an AHA moment happened. We captured some more network traces when the problem occurred and noticed all of them (by luck, as it turns out) contained “Bad record MAC” SSL fatal alerts. This information combined with the fact that the problem had to be above the low level Windows networking layer really narrowed the problem down quite a bit. Furthermore, all the “Bad record MAC” alerts were being sent by the server, which indicated that the server was rejecting the MAC of the SSL handshake calculated and sent by the Internet Explorer. The conversations all looked like this in Wireshark (in this trace, the address of the server is 172.16.100.16):

 

Ericblog2
 

Note that there are warning alerts issued by the server about “Unrecognized Name”. This is because the client is sending the SNI TLS extension record and the server is mis-configured in our testing environment. We tried configuring it correctly to eliminate this warning, but the problem continued.
In order to test the hypothesis that a bug in the SSL implementation used by Internet Explorer might be miscalculating the handshake MAC, we tried disabling various SSL/TLS versions. Finally we found a solution! It turned out that disabling TLS 1.2 in Internet Explorer settings resolved the problem. It seemed like we’d narrowed down the problem all the way to the root cause.

 

Internet Explorer or openssl
But was Internet Explorer really to blame? What we knew from the network trace was that the server’s implementation of TLS 1.2 was rejecting the handshake MAC that it was being given by Internet Explorer. However, this isn’t enough to conclude that the MAC calculated by Internet Explorer was incorrect. It was equally possible that Internet Explorer was calculating the MAC correctly but that the server’s implementation of TLS 1.2 was incorrectly rejecting it. Our server is Apache 2.2.25 with mod_ssl. The mod_ssl Apache module is implemented in terms of openssl, for which we use version 1.0.1. It was possible that what we were witnessing was a bug in openssl. If so, this problem might have affected clients other than just Internet Explorer. If so, rather than just ignoring the problem and disabling TLS 1.2 in Internet Explorer settings on our selenium test hosts, it would be better to fix or reconfigure openssl in our servers. But with only one client (Internet Explorer) and one server (Apache/openssl) demonstrating the problem, we couldn’t know which one had the bug. Since openssl is open source, we could theoretically find the problem if that was where the problem was, given enough time, but it would take a very long time to prove that the problem wasn’t in openssl just by code inspection. Not having Internet Explorer source, we couldn’t do the same there. If we could duplicate the problem between Internet Explorer and a separate SSL/TLS server implementation then that would point very strongly to the problem being in Internet Explorer and not in openssl.
 

Analysis using Wireshark
Hoping to get a tool like Wireshark to validate the MAC sent by Internet Explorer so that we could narrow the problem to either Internet Explorer or openssl, we configured Wireshark with the private key used by our server. However, this didn’t help because it turned out that the key exchange algorithm negotiated by Internet Explorer and Apache was using Diffie-Hellman. Here’s the relevant portion of the Wireshark SSL debug output file for the “Client Key Exchange” message when the error occurred:
ssl_generate_pre_master_secret: found SSL_HND_CLIENT_KEY_EXCHG, state 17
ssl_decrypt_pre_master_secret session uses DH (17) key exchange, which is impossible to decrypt
ssl_generate_pre_master_secret: can’t decrypt pre master secret
The key material used by the client and server to exchange secrets can’t be obtained using the private key of the server because the exchange uses an ephemeral key pair generated by the server on-the-fly and signed using its private key. This strategy provides a very powerful security feature known as “Perfect Forward Secrecy“. It also means that even if Wireshark has the private key of the server, it still can’t decode the SSL conversation, including validating the handshake MAC.

 

Ericblog3

 

We then configured the server such that it and Internet Explorer would negotiate the exact same cipher suite but using just RSA rather than DHE-RSA for key exchange. We were successful in doing this, but we weren’t able to isolate the problem to Internet Explorer or openssl using this approach because it eliminated the problem. If the problem was in IE, it seems that it had to be related to it use of Diffie-Hellman key exchange.
 

New error duplication
In order to more easily and quickly duplicate the problem, we wrote a test program that uses the same networking Windows API that Internet Explorer uses. The library that provides this API on Windows is called WinInet. Our program did the same thing as our python script did, however it failed to duplicate the problem. A network trace while the program was running showed that it only ever opened one SSL connection to the server and made all requests using that single connection. This was worked around by making the test program only make one HTTPS request but we called it in a loop from the Windows command line. This duplicated the problem. Here’s the program:
#include <windows.h>
#include <wininet.h>
#include <stdio.h>
#include <tchar.h>
 
void ReadUrl(LPCTSTR url) {
    HINTERNET hInternet = NULL;
    HINTERNET hRequest = NULL;
    char buffer[1024];
    DWORD bytesRead = 0L;
    BOOL result = FALSE;
 
    hInternet = InternetOpen(_T(“Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”), INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
    if (NULL == hInternet) {
        _tprintf(_T(“InternetOpen error: %08X\n”), GetLastError());
        return;
    }
    
    hRequest = InternetOpenUrl(hInternet, url, NULL, 0, INTERNET_FLAG_RELOAD | INTERNET_FLAG_SECURE, (DWORD_PTR)NULL);
    if (NULL == hRequest) {
        _tprintf(_T(“InternetOpenUrl error: %08X\n”), GetLastError());
        InternetCloseHandle(hInternet);
        return;
    }
 
    do {
        bytesRead = 0L;
        result = InternetReadFile(hRequest, buffer, (DWORD)sizeof(buffer), &bytesRead);
        if (result && (bytesRead == 0L)) {
            break;
        }
    } while (result);
 
    if (!result) {
        DWORD error = GetLastError();
        TCHAR errorText[1024];
        DWORD length = (DWORD)(sizeof(errorText) / sizeof(TCHAR));
        result = InternetGetLastResponseInfo(&error, errorText, &length);
        if (!result) {
            _tcscpy(errorText, “unknown”);
            length = 7L;
        }
        _tprintf(_T(“error reading request: (%08X) %.*s\n”), error, length, errorText);
    }
 
    InternetCloseHandle(hInternet);
    InternetCloseHandle(hRequest);
}
 
int _tmain(int argc, LPCTSTR * argv) {
    if (argc != 2) {
        _tprintf(_T(“%s usage: %s <URL>\r\n”), argv[0], argv[0]);
        return 1;
    }
 
    ReadUrl(argv[1]);
 
    return 0;
}
When run, in just a few seconds, it produces output similar to the following:
C:\imvu\temp>for /l %i in (1, 1, 1000) do @wininet_test.exe https://localhost.imvu.com/dummy.txt
InternetOpenUrl error: 00000008
InternetOpenUrl error: 00002F7D
InternetOpenUrl error: 00002F7D
^C
Out of 1000 requests, roughly 10-12 will fail. About half of them will fail with error code 8 and the other half will fail with error code 0x2F7D. Error code 8 corresponds to ERROR_NOT_ENOUGH_MEMORY. Error code 0x2F7D corresponds to ERROR_INTERNET_SECURITY_CHANNEL_ERROR. The description of this error is “The application experienced an internal error loading the SSL libraries.” I suspect this description is misleading and is actually a generic error used for SSL handshake errors and possibly other kinds of failures as well. It seems to be associated with the schannel.dll Windows system library.
A network trace capturing the SSL conversations when these errors occur shows that the failures with error code 8 manifest as the client closing the connection with a [FIN, ACK] packet after receiving the “Server Key Exchange” message without ever sending a “Client Key Exchange” message. The server responds by sending an “Unexpected Message” fatal alert message. The failures with error code 0x2F7D manifest as the server sending a “Bad record MAC” fatal alert after the client sends the final SSL handshake message containing a “Change Cipher Spec” message and terminated with the handshake MAC.
 

Duplication with gnutls-serv
In order to finally isolate the problem to either Internet Explorer or openssl, we used a tool called gnutls-serv. Gnutls is an SSL/TLS library from GNU. It comes with a tool called gnutls-serv that will act as a simple HTTPS server. After downloading and compiling the latest version (gnutls-3.3.9), we started it with this command line (after killing Apache) on our web server host:
#LD_LIBRARY_PATH=/usr/local/lib/ src/gnutls-serv -p 443 -g –http -a –x509certfile /etc/apache2/certs/ssl.crt/sandbox.crt –x509keyfile /etc/apache2/certs/ssl.key/sandbox.key –priority “SECURE:-ECDHE-RSA” -d 9999
The –priority “SECURE:-ECDHE-RSA” option is given in order to force the cipher suite negotiated between WinInet and gnutls-serv to be the exact one negotiated between Internet Explorer and Apache. It was initially choosing the Eliptic Curve Diffie-Hellman key exchange algorithm. It turns out that the problem is not reproducible when using Eliptic Curve Diffie-Hellman.
Running the WinInet test program against this server, the problem continued to be observed with the same frequency as when using Apache/openssl. The network trace of these failures look more-or-less identical to the ones generated with Apache/openssl:
Ericblog4
The only significant difference is the missing server name warning. gnutls-serv apparently ignores the SNI TLS extension.
Inspecting the relevant debugging output of gnutls-serv corresponding to the failure shows this:
|<3>| ASSERT: gnutls_cipher.c:728
|<3>| ASSERT: gnutls_cipher.c:167
|<3>| ASSERT: gnutls_record.c:1239
|<0x22ae770>| Discarded message[0] due to invalid decryption
|<3>| ASSERT: gnutls_buffers.c:1358
|<3>| ASSERT: gnutls_handshake.c:1428
|<3>| ASSERT: gnutls_handshake.c:785
|<3>| ASSERT: gnutls_handshake.c:3054
|<3>| ASSERT: gnutls_handshake.c:3200
Error in handshake
|<5>| REC: Sending Alert[2|20] – Bad record MAC
The original failure is reported on line 728 of gnutls_cipher.c which is this bit of code:
    if (unlikely
        (memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {
        /* HMAC was not the same. */
        dummy_wait(params, compressed, pad_failed, pad,
               length + preamble_size);
 
        return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);
    }
In this code, the unlikely() function is just a compiler hint and can be replaced with the condition passed to it.
This shows that the problem could be either the MAC is bad or the padding calculation failed. To narrow down the problem even further we modified the code to be this:
    if (unlikely
        (memcmp(tag, tag_ptr, tag_size) != 0 || pad_failed != 0)) {
        if (memcmp(tag, tag_ptr, tag_size) != 0) {
            char buf[1024];
            _gnutls_assert_log(“ASSERT: %s:%d – mismatching MAC\n”, __FILE__, __LINE__);
            _gnutls_assert_log(“ASSERT: expected MAC: %s\n”,
                _gnutls_bin2hex(tag, tag_size, buf, sizeof(buf), NULL));
            _gnutls_assert_log(“ASSERT: MAC received: %s\n”,
                _gnutls_bin2hex(tag_ptr, tag_size, buf, sizeof(buf), NULL));
        }
        if (pad_failed != 0) {
            gnutls_assert();
        }
        /* HMAC was not the same. */
        dummy_wait(params, compressed, pad_failed, pad,
               length + preamble_size);
 
        return gnutls_assert_val(GNUTLS_E_DECRYPTION_FAILED);
    }
With this change, we get this debug output:
|<3>| ASSERT: gnutls_cipher.c:726 – mismatching MAC
|<3>| ASSERT: expected MAC: ab628ad135b519d2f71686e78d4d84fc
|<3>| ASSERT: MAC received: b053a9400b47c8ed3796ddb317b6a957
|<3>| ASSERT: gnutls_cipher.c:739
|<3>| ASSERT: gnutls_cipher.c:167
|<3>| ASSERT: gnutls_record.c:1239
|<0x156c770>| Discarded message[0] due to invalid decryption
|<3>| ASSERT: gnutls_buffers.c:1358
|<3>| ASSERT: gnutls_handshake.c:1428
|<3>| ASSERT: gnutls_handshake.c:785
|<3>| ASSERT: gnutls_handshake.c:3054
|<3>| ASSERT: gnutls_handshake.c:3200
Error in handshake
|<5>| REC: Sending Alert[2|20] – Bad record MAC
This shows that the actual failure is a mismatch between the MAC calculated over the handshake by the client and server and that there was not a problem detected with the padding.
 

Conclusion
The fact that the problem occurs with both the openssl and gnutls server implementations indicates that this is almost certainly a bug in the WinInet SSL handshake MAC calculation. The fact that the problem only occurs when using DHE-RSA key exchange algorithms indicates that the failure is caused by a bug in the WinInet Diffie-Hellman MAC calculation or else the derivation of the client MAC key when using Diffie-Hellman key exchange.
This analysis has focused only on the “Bad record MAC” failures. There were also another set of failures with roughly the same frequency that didn’t involve “Bad record MAC” fatal alert messages. The fact that these failures occurred after the server had sent its final handshake message suggests that this could actually be another symptom of the same problem. The same or equivalent bug could cause the client to miscalculate the expected server’s handshake MAC. If the client is rejecting the server’s handshake MAC but not sending “Bad record MAC” alerts when this happens then the observed behavior is exactly what we’d expect to see. I have no explanation as to why the error code 8 (ERROR_NOT_ENOUGH_MEMORY) would be emitted in this condition.
The bottom line is that the implementation of Diffie-Hellman key exchange used by Internet Explorer 11 on Windows 7 is broken. We could disable Diffie-Hellman key exchange in our Apache server configuration. The frequency of this problem and the market share of Internet Explorer likely don’t warrant that course of action since “Perfect Forward Secrecy” is rather valuable. Instead, IMVU will be disabling TLS 1.2 in Internet Explorer configuration on selenium test hosts.

Code Reviews: Follow the Data

After years of reviewing other people’s code, I’d like to share a couple practices that improve the effectiveness of code reviews.

Why Review Code?

First, why review code at all? There are a few reasons:

  • Catching bugs
  • Enforce stylistic consistency with the rest of the code
  • Finding opportunities to share code across systems
  • Helping new engineers spin up with the team and project
  • Helping API authors see actual problems that API consumers face
  • Maintain the health of the system overall

It seems most people reach for one of two standard techniques when reviewing code; they either review diffs as they arrive or they review every change in one large chunk. We’ve used both techniques, but I find there’s a more effective way to spend code review time.

Reviewing Diffs

It seems the default code review technique for most people is to sign up for commit emails or proposed patches and read every diff as it goes by. This has some benefits – it’s nice to see when someone is working in an area or that a library had a deficiency needing correction. However, on a large application, diffs become blinding. When all you see is streams of diffs, you can lose sight of how the application’s architecture is evolving.

I’ve ended up in situations where every diff looked perfectly reasonable but, when examining the application at a higher level, key system invariants had been broken.

In addition, diffs tends to emphasize small, less important details over more important integration and design risks. I’ve noticed that, when people only review diffs, they tend to point out things like whitespace style, how iteration over arrays is expressed, and the names of local variables. In some codebases these are important, but higher-level structure is often much more important over the life of the code.

Reviewing diffs can also result in wasted work. Perhaps someone is iterating towards a solution. The code reviewer may waste time reviewing code that its author is intending to rework anyway.

Reviewing Everything

Less often, I’ve seen another code review approach similar to reviewing diffs, but on entire bodies of work at a time. This approach can work, but it’s often mindnumbing. See, there are two types of code: core, foundational code and leaf code. Foundational code sits beneath and between everything else in the application. It’s important that it be correct, extensible, and maintainable. Leaf code is the specific functionality a feature needs. It is likely to be used in a single place and may never be touched again. Leaf code is not that interesting, so most of your code review energy should go towards the core pieces. Reviewing all the code in a project or user story mixes the leaf code in with the foundational code and makes it harder see exactly what’s going on.

I think there’s a better way to run code reviews. It’s not as boring, tends to catch important changes to core systems, and is fairly efficient in terms of time spent.

Follow the Data

My preferred technique for reviewing code is to trace data as it flows through the system. This should be done after a meaningful, but not TOO large, body of work. You want about as much code as you can review in an hour: perhaps more than a user story, but less than an entire feature. Start with a single piece of data, perhaps some text entered on a website form. Then, trace that data all the way through the system to the output. This includes any network protocols, transformation functions, text encoding, decoding, storage in databases, caching, and escaping.

Following data through the code makes its high-level structure apparent. After all, code only exists to transform data. You may even notice scenarios where two transformations can be folded into one. Or perhaps eliminated entirely — sometimes abstraction adds no value at all.

This style of code review frequently covers code that wasn’t written by the person or team that initiated the code review. But that’s okay! It helps people get a bigger picture, and if the goal is to maintain overall system health, new code and existing shouldn’t be treated differently.

It’s also perfectly fine for the code review not to cover every new function. You’ll likely hit most of them while tracing the data (otherwise the functions would be dead code) but it’s better to emphasize the main flows. Once the code’s high-level structure is apparent, it’s usually clear which functions are more important than others.

After experimenting with various code review techniques, this approach has been the most effective and reliable over time. Make sure code reviews are somewhat frequent, however. After completion of every “project” or “story” or “module” or whatever, sit down for an hour with the code’s authors and appropriate tech leads and review the code. If the code review takes longer than an hour, people become too fatigued to add value.

Handling Code Review Follow-Up Tasks

At IMVU in particular, as we’re reviewing the code, someone will rapidly take notes into a shared document. The purpose of the review meeting is to review the code, not discuss the appropriate follow-up actions. It’s important not to interrupt the flow of the review meeting with a drawn-out discussion about what to do about one particular issue.

After the meeting, the team leads should categorize follow-up tasks into one of three categories:

  1. Do it right now. Usually tiny tweaks, for example: rename a function, call a different API, delete some commented-out code, move a function to a different file.
  2. Do it at the top of the next iteration. This is for small or medium-sized tasks that are worth doing. Examples: fix a bug, rework an API a bit, change an important but not-yet-ubiquitous file format.
  3. Would be nice someday. Delete these tasks. Don’t put them in a backlog. Maybe mention them to a research or infrastructure team. Example: “It would be great if our job scheduling system could specify dependencies declaratively.”

Nothing should float around on an amorphous backlog. If they are important, they’ll come up again. Plus, it’s very tempting to say “We’ll get to it” but you never will, and even if you have time, nobody will have context. So either get it done right away or be honest with yourself and consciously drop it.

Now go and review some code! 🙂

The Real-time Web in REST Services at IMVU

By Jon Watte, VP Technology @ IMVU

IMVU has built a rich, graph-shaped REST (REpresentational State Transfer) API (Application Programming Interface) to our data. This data includes a full social network, as well as e-commerce, virtual currencies, and the biggest 3D user generated content catalog in the world. This post discusses how IMVU addresses two of the bigger draw-backs of REST-based service architectures for real-time interactive content: Cache Invalidation (where users want to know about new data as soon as it becomes available,) and Request Chattiness (where request latency kills your performance.)

Cache Invalidation

REST principles like cacheability and hypertext-based documents work great for exposing data to a variety of clients (desktop, web, and mobile,) but runs into trouble when it meets the expectation of real-time interaction. For example, when a user changes their Motto in their profile, they would like for the world to see the new Motto right away — yet, much of the scalability wins of REST principles rely on caching, and the web does not have a good invalidation model. Here is an illustration of the problem:

At 10:03 am, Bob logs in and the client application fetches the profile information about his friend Alice. This data potentially gets cached at several different layers:

  • in application caches at the server end, such as Varnish or Squid
  • in content delivery network caches at the network edge, such as Akamai or Cloudfront
  • in the user’s own browser (every web browser has a local cache)

Let’s say that our service marks the data as cacheable for one hour.

At 10:04 am, Alice updates her Motto to say “there’s no business like show business.”

At 10:05 am, Alice sends a message to Bob asking “how do you like my motto?”

At 10:06 am, Bob looks at Alice’s profile, but because he has a stale version in cache, he sees the old version. Confusion (and, if this were a TV show, hilarity) ensues.

HTTP provides two options to solve this problem. One is to not do caching at all, which certainly “solves” the problem, but then also removes all benefits of a caching architecture. The other is to add the “must-validate” option to the cache control headers of the delivered data. This tells any client that, while it might want to store the data locally, it has to check back with the server to see whether the data has changed or not before re-using it. In the case that data has not changed, this saves on bytes transferred — the data doesn’t need to be sent twice — but it still requires the client to make a request to the server before presenting the data.

In modern web architectures, while saving throughput is nice, the real application performance killer is latency, which means that even the “zero bytes” response of checking in with the server introduces an unacceptable cost in end-user responsiveness. Cache validation and/or E-tags might sound like a big win, but for a piece of data like a “motto,” the overhead in HTTP headers (several kilobytes) dwarfs the savings of 30 bytes of payload — the client might as well just re-get the resource for approximately the same cost.

Another option that’s used in some public APIs is to version all the data, and when data is updated, update the version, which means that the data now has a new URL. A client asking for the latest version of the data would then not get a cached version. Because of HATEOAS (Hypertext As The Engine Of Application State) we would be able to discover the new URL for “Alice’s Profile Information,” and thus read the updated data. Unfortunately, there is no good way to discover that the new version is there — the client running on Bob’s machine would have to walk the tree of data from the start to get back to Alice’s new profile link, which is even more round-trip requests and makes the latency even worse.

A third option is to use REST transfer for the bulk data, but use some other, out-of-band (from the point of view of the HTTP protocol) mechanism to send changes to interested clients. Examples of this approach include the Meteor web framework, and the MQTT based push approach taken by Facebook Mobile Messenger. Meteor doesn’t really scale past a few hundred online users, and has an up-to-10-seconds-delay once it’s put across multiple hosts. Even with multiple hosts and “oplog tailing,” it ends up using a lot of CPU on each server, which means that a large write volume ends up with unacceptably low performance, and a scalability ceiling determined by overall write load, that doesn’t shard. At any time, IMVU has hundreds of thousands of concurrent users, which is a volume Meteor doesn’t support.

As for the MQTT-based mobile data push, Facebook isn’t currently making their solution available on the open market, and hadn’t even begun talking about it when we started our own work. Small components of that solution (such as MQTT middleware) are available for clients that can use direct TCP connections, and could be a building block for a solution to the problem.

The good news is that we at IMVU already have a highly scalable, multi-cast architecture, in the form of IMQ (the IMVU Message Queue.) This queue allows us to send lightweight messages to all connected users in real-time (typical latencies are less than 10 milliseconds plus one-way network delay.) Thus, if we can know what kinds of things that a user is currently interested in seeing, and we can know whether those things change, we can let the user know that the data changed and needs to be re-fetched.

Jonblog1

 

The initial version of IMQ used Google Protocol Buffers on top of a persistent TCP connection for communications. This works great for desktop applications, and may work for some mobile applications as long as the device is persistently connected, but it does not work well for web browsers with no raw TCP connection ability, or intermittently connected mobile devices. To solve for these use cases, we added the ability to connect to IMQ using the websockets protocol, and additionally to fall back to an occasionally polled mail-drop pick-up model over HTTP for the worst-case connectivity situations. Note that this is still much more efficient than polling individual services for updated data — IMQ will buffer all the things that received change notifications across our service stack, and deliver them in a single HTTP response back to the client, when the client manages to make a HTTP request.

To make sure that the data for an endpoint is not stale when it is re-fetched by the client, we then mark the output of real-time updated REST services as non-cacheable by the intermediate caching layers. We have to do this, because we cannot tell the intermediate actors (especially, the browser cache) about the cache invalidation — even though we have JavaScript code running in the browser, and it knows about the invalidation of a particular URL, it cannot tell the browser cache that the data at the end of that URL is now updated.

Instead, we keep a local cache inside the web page. This cache maps URL to JSON payload, and our wrapper on top of XMLHttpRequest will first check this cache, and deliver the data if it’s there. When we receive an invalidation request over IMQ, we mark it stale (although we may still deliver it, for example for offline browsing purposes.)

Request Chattiness (Latency)

Our document-like object model looks like a connected graph with URL links as the edges, and JSON documents as the nodes. When receiving a particular set of data (such as the set of links that comprises my friends list) it is very likely that I will immediately turn around and ask for the data that’s pointed to by those links. If the browser and server both support the SPDY protocol, we could pre-stuff the right answers into the SPDY connection, in anticipation of the client requests. However, not all our clients have this support, and not even popular server-side tools like Nginx or Apache HTTPd support pre-caching, so instead we accomplish the same thing in our REST response envelope.

Instead of responding with just a single JSON document, we respond with a look-up table of URLs to JSON documents, including all the information we believe the client will want, based on the original request. This is entirely optional — the server doesn’t have to add any extra information; the client doesn’t have to pay attention to the extra data; but servers and clients that are in cahoots and pay attention will end up delivering a user experience with more than 30x fewer server round-trips! On internet connections where latency matters more than individual byte counts, this is a huge win. On very narrow-band connections (like 2G cell phones or dial-up modems,) the client can provide a header that tells the server to never send any data more than what’s immediately requested.

Jonblog2

Because the server knows all the data it has sent (including the speculatively pre-loaded, or “denormalized” data,) the server can now make arrangements for the client to receive real-time updates through IMQ when the data backing those documents changes. Thus, when a friend comes online, or when a new catalog item from a creator I’m interested in is released, or when I purchase more credits, the server sends an invalidation message on the appropriate topic through the message queue, and any client that is interested in this topic will receive it, and update its local cache appropriately.

Putting it Together

This, in turn, ties into a reactive UI model. The authority of the data, within the application, lives in the in-process JSON cache, and the IMQ invalidation events are received by this cache. The cache can then know whether any piece of UI is currently displaying this data; if so, it issues a request to the server to fetch it, and once received, it updates the UI. If not, then it can just mark the element as stale, and re-fetch it if it’s later requested by some piece of UI or other application code.

The end-to-end flow is then:

  1. Bob loads Alice’s profile information
  2. Specific elements on the screen are tied to the information such as “name” or “motto”
  3. Bob’s client creates a subscription to updates to Alice’s information
  4. Alice changes her motto
  5. The back-end generates a message saying “Alice’s information changed” to everyone who is subscribed (which includes Bob)
  6. Bob’s client receives the invalidation message
  7. Bob’s client re-requests Alice’s profile information
  8. The underlying data model for Alice’s profile information on Bob’s display page changes
  9. The reactive UI updates the appropriate fields on the screen, so Bob sees the new data

All of these pieces means re-thinking a number of building blocks of the standard web stack, which means more work for our foundational libraries. In return, we get a more reactive web application, where anything you see on the screen is always up to date, and changes respond quickly, both through the user interface, and through the back-end, with minimal per-request overhead.

This might seem complex, but it ends up working really well, and with the proper attention to library design for back- and front-end development, building a reactive application like this is no harder than building an old-style, slow polling (or manually refreshed) application.

It would be great if SPDY (and, future, HTTP2) could support pre-stuffing responses in the real world. It would also be great if the browser DOM had an interface to the local cache, so that the application could tell the browser about a particular URL being invalidated. However, the solution we’ve built up achieves the same benefits, using existing protocols, which goes to show the fantastic flexibility and resilience inherent in the protocols and systems that make up the web!

 

The Case of the Trailing Space

Solve the case of the trailing space with IMVU’s Senior Engineer, Michael Slezak.

In this post, I wanted to discuss a problem we ran into recently dealing with REST authentication in our IMVU client application which ultimately boiled down to small discrepancies between JSON encoders. In this case, I’m going to focus on Python’s JSON library and javascript’s JSON encoder. Before I dive into the problem, I want to provide a quick background on the client application architecture.

Client Architecture

The IMVU client application contains several layers in its architecture ranging from rendering, business logic, and front-end development. We use C++ for rendering and other low level functionality such as windows, call stacks, and interfacing with the business logic. Our business/client logic layer is all in Python. We use Python for communicating to the front-end, pinging our servers, and maintaining advanced chat logic and chat state, among many other things. Finally, we use the Gecko SDK (a.k.a. XULRunner) to handle all of our UI needs. This means we can write our front-end using HTML, Javascript, and CSS. We also have our own library to allow the front-end to call out to Python for things such as user data. Using web technologies for client UI development has allowed us to unify the technology we use for our site such as jQuery, Underscore.js, and even a few in-house libraries, resulting in increased engineering productivity.

With that said, we’ve recently hand rolled our own implementation of Promises as identified by the upcoming ECMAScript 6 proposal in our open-sourced imvu.js library. I decided to drop this implementation into the client for our immediate use. With this change, I was able to also drop in a new REST client to help with chaining our requests in a more synchronous-like fashion (despite the fact that it’s asynchronous). Hooray! Asynchronous programming has just gotten easier for the client! However, that wasn’t the case…

XMLHttpRequest, are you there?

We got our feet wet using this exciting change with a new feature that my team is developing. There was one issue: Not authorized for request.

Uh oh.

Looks like we need an auth token for these requests. Specifically, the POST and DELETE requests. Simple enough. I found that we handle authentication in our Python code within this “securePostRaw” function. Every single request ever made in the client goes through this. To my knowledge, POSTing with XMLHttpRequest has never been used in the client. Ever. Needless to say, this was news to me…

It’s Dangerous to go Alone…

Looking at the securePostRaw function, we seem to take the auth token that the server initially gave us, and hash everything and use that as the new auth token. For example, if there is a request body, we JSON encode it and also utf-8 encode it. We then take the the customer id, the original auth token, the JSON encoded body, and the query parameters (if they exist), and then run a hash function over the whole, concatenated blob

OK… a little odd way of securing a POST request since we are running over HTTPS. But this is legacy code! So, it’s understandable.

I took this hashing function and copied it into another file for our front-end to call directly. I don’t need Python to send off the request, I just want to set the correct auth headers via XMLHttpRequest so that we can use this new REST client. You might be wondering, “Why not just let Python handle it then?” Partly because we have a bigger vision in the near future where we bring in a bigger, hand-rolled front-end library that is now ubiquitous to how we write front-end software at IMVU. To achieve this, I need XHR to work so that it’s easier to just “plop it in”.

Anyways, I finally get the right tokens to put in our headers. We’re on our way! All is right with the world, so we test again and: Not authorized for request

…Take This!

Wait, but I did what you told me. I did all the right things. The old and new client tokens match up! This isn’t fair!!

Luckily, I can run the IMVU client off a local server and dig into what the server is seeing. Through our REST middleware stack on the server, we’re failing authentication! When we attempt to grab the logged in user, it fails to identify us! But the old and new client tokens are the same! I’m in parity! Not so fast…

Here is a dump from the server of when Python makes the request using the securePostRaw function:

slezakblog1

And here is the dump when XHR makes the request:

slezakblog2

sign is what the server comes up and sent is what the client sent (obviously). Why in the hell are they different? The server is playing tricks on us… So, I decide to log the final output of the string the server uses before it runs the hash function on it.

The Case of the Trailing Space

The log looked something like this:

slezakblog3

….There is an extra space right before the "http://" starts!!! , It turns out JSON.stringify doesn’t leave any spaces like this. Since we let Python compute the hash, it also encodes the request body into JSON which means that it’s causing the spacing! Since we do xhr.send(JSON.stringify(body)), we have a mismatch between what the client calculates and what the server calculates because the server technically has a different request body (by one single space!).

Fortunately, the json library in Python has a keyword argument in it’s dump function called separators. So, the code now looks like json.dumps(body, separators=(‘,’, ‘:’)) which gives us a more compact version of the encoding. We are now matching with JSON.stringify.

After this change, we were finally able to come to a solution and it works!

Conclusion

Several lessons learned from this:

  • Hashing things that are dependent on variable data (which is also encoded) can be problematic. Since JSON is flexible in its encoding and allows for spacing, it can throw off the whole hash. Things that are more in our control such as integer values and constant strings are probably better to use.
  • All JSON encoders/decoders aren’t created equal.

If you enjoyed reading this article and are excited about solving problems such as unifying web technologies across multiple platforms, you’re in luck! We’re hiring!

 

 

 

How IMVU Builds Web Services: Part 3

In this 3-part series, IMVU senior engineer Bill Welden describes the means and technology behind IMVU’s web services.

Part 3: Documents and Links

In the previous entry in this series I described how IMVU uses a structured network model to implement the uniform contract for our REST services and showed how they might apply to a set of services for a hypothetical high school scheduling system.

Engblog1102-01

Under this model we have Node Groups (represented by the different colors of circles in this diagram), Nodes (the circles themselves), Edge Groups (the rounded boxes hanging off of the circles) and Edges (the lines connecting the circles).

We model these various notions using HTTP documents linked together with URLs. There are four kinds of documents, corresponding to each of the four concepts.

A Node Group determines the properties and relationships of the Nodes, and the corresponding Node Group Document contains a list of URLs, each locating a Node.

A Node, through its properties and relationships, is the basic repository of information in the network. The Node Group Document contains the values of the Node’s properties, as well as a list of URLs, each locating an Edge Group.

An Edge Group groups together all of the links that go to a particular type of Node. The corresponding  Edge Group Document, specific to a Node, contains a list of URLs, each locating an Edge within that group.

An Edge is a link from its Node to another Node, usually of a different class. Edges can also contain properties, as we saw last time. An Edge Document therefore contains a URL for the Node at the other end of the Edge, as well as the values of any properties of the Edge.

Each type of document has a distinctive URL.

The URL for a Node Group Document consists of a single path segment, a singular noun describing the Node Group:

https://api.imvu.com/course

The URL for a Node consists of two path segments: the Node Group URL extended by a segment consisting of a unique identifier for the Node. The Node identifier can be any string, but we encourage the use of Node Identifiers which contain the name of the Node Group:

https://api.imvu.com/course/course-1035

The URL for an Edge Group consists of three segments: the Node URL extended either by a plural noun naming the Node Group that the Edge Group points off to, or by a noun describing the relationship created by the Edge Group:

https://api.imvu.com/course/course-1035/teachers
https://api.imvu.com/course/course-1035/roster

The URL for an Edge consists of four segments: the Edge Group URL extended by an identifier for the Edge. This identifier need only be unique within the Edge Group. Sometimes (for convenience) it is identical to the unique identifier of the Node that the Edge points to, but this is not required:

https://api.imvu.com/course/course-1035/teachers/1
https://api.imvu.com/course/course-1035/roster/student-5331

These URL formats are a constraint on the service design – a strong suggestion but not a strict requirement.

Note especially that, according to the principle of HATEOAS (Hypertext As The Engine Of Application State), the client cannot depend on any specific format, and is not allowed to construct these URLs or attempt to extract information from them. All URLs required by the client must be obtained whole, as opaque strings, from the response to an earlier service request.

Specifically, URLs returned from the server must be fully qualified, including the protocol and server name (“https://api.imvu.com”). However, in our own internal discussions and here in this presentation, we will often leave off the protocol and server name for clarity. So:

/course
/course/course-1035
/course-1035/teachers

and so forth. Wherever one of these relative URLs appears in the discussion below, the actual implementation will return a fully qualified version.

One of the biggest differences between REST services and the earlier Remote Procedure Call style is that RPC APIs define as many verbs as required by or convenient for implementation of the application. Designing RPC services is primarily about designing new verbs and specifying their parameters and their semantics.

With IMVU Rest services, the verbs are specified, and there are only five of them. Designing a service consists of designing a set of Nodes and Edges, together with their properties and relationships, in such a way that these five verbs can provide all of the required functionality.

The verbs are

GET (any endpoint)
POST (to a Node or Edge),
POST (to a Node Group or an Edge Group),
POST (to a Node Group, including Edges), and
DELETE (a Node or Edge)

GET retrieves the document associated with the URL, which can be a Node Group Document, Node Document, Edge Group Document or an Edge Document. GETs must be nullipotent, which is to say that they can have no side-effects.

POST, when applied to a Node or Edge URL makes changes to the properties and relations of the Node or Edge. Such POSTs must be idempotent, which means that sending to POST twice must have exactly the same effect as sending it once.

When POST is applied to a Node Group or Edge Group, it adds a new Node or a new Edge. Such POSTs will not be idempotent, since sending the POST twice will add two new Nodes or two new Edges.

There is a third sort of POST, a POST specifically to a Node Group which adds a new Node, but which also contains data for a set of new Edges to be added along with the Node (including Edge Groups as necessary).

Finally there is DELETE, which is used to delete a Node or an Edge. The implementation of DELETE must be idempotent.

GET returns the document associated with the URL, but in order to minimize round trips the service is allowed to return any additional documents that it thinks the client may soon need.

You can see this in the format of the JSON document returned by a GET:

{
  "id": "/course/course-1035",
  "status": "success",
  "denormalized":
  {
   "/course/course-1035": { … } 
   "/course/course-1035/teachers": { … }
   "/course/course-1035/teachers/1": { … }
   … etc …
  }
}

There is a status, “success” or “failure”, and if the GET fails, some information about why, but if it succeeds a package of endpoints is included, grouped under the response member “denormalized”.

In this example the client has asked for a Course, and the server has chosen to return not only the Course Node, but the Edge Group and Edges for all of the teachers that teach that course.

Each of the four kinds of documents in these responses has a specific JSON format.

A Node Group Document has a member “nodes” which is an array of URLs, one for each Node.

{
"nodes": [
 "/course/course-1035",  
 "/course/coures-1907", 
 ...
 ]
}

If the client wants to present the list of Nodes in a certain order, it is responsible for sorting them itself. It cannot depend on the order of entries in this array.

A Node Document has two members. The “data” member contains the Node’s properties. The “relations” member contains the URLs that link the Node to its Edge Groups and to other Nodes in the system.

{
 "data": {
 "description": "Algebra I",
 "starting_time": "10:00"
 }
"relations": {
 "teacher": "/teacher/teacher-372",
 "roster": "/course/course-1035/roster"
 }
}

Names in these objects represent a contract with the client. Based on the name, the server guarantees the semantics of the value including its type, the allowed values, the meaning of the value and if it is a link, the Node Group of the Node it points to.

Note that here “teacher” is a link to a Teacher Node. This design precludes the possibility of team teaching where there are two teachers for a class.

Links directly from one Node to another create a one-to-N relationships. One Teacher to many Courses. As a rule, however, relationships are more often N-to-N than not, and such restrictions on cardinality are a red flag. Not wrong, necessarily, but something that may be called out in design reviews.

The design could be made to support many teachers per course by implementing an Edge Group “teachers” for Course Nodes. This is a more flexible design, because the cardinality restrictions, say a maximum of two teachers, or allowing multiple teachers only for certain courses, can be implemented on the server, where they are easier to change.

An Edge Group Document has a member “edges” which is a JSON array of URLs for the Node’s Edges.

{
  "edges": [
   "/course/course-1035/roster/student-5331",
   "/course/course-1035/roster/student-1070",
   ...
  ],
}

Again, the client cannot depend on the order that the Edges come back from the server.

Finally, an Edge Document has an optional “data” member for when the Edge has properties, and a “relations” member which contains the URL for the Node at the other end of the Edge.

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/student/student-5331"
 }
}

Here is an Edge between one Course (course-1035) and one Student (student-513312).

Edges go both ways. Here are the Edges between that same student and his courses.

{
 "edges": [
   "/student/student-53312/schedule/01",
   "/student/student-53312/schedule/02",
   ...
 ],
}

Now it’s not required to implement every Edge Group implied by the Node/Edge model, so it’s acceptable to implement only half of this Edge relationship without its symmetrical partner. When we have an Edge Group in our design, however, we think carefully about the symmetrical Edge Group. It’s often a very interesting view on the data, and it’s seldom very difficult to implement.

Note that the Edge document itself can come back in two different ways, but the underlying database record will be the same. Here is one of the Edge documents linked from the Edge Group above:

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/course/course-1035",
 }
}

Whether you look at this link from the Student or the Course perspective, the count of tardies is the same data element.

Nodes and Edges are updated using the same JSON document format as the response from a GET, though there is no denormalization envelope.

Here is the document for a POST to a Student Node (/student/student-513312) intended to update the student’s birth date and counsellor (documents for POSTs to Edges look pretty much the same):

{
 "data": {
  "birth_date": "5/11/96",
 },
 "relations": {
  "counsellor": "/teacher/teacher-121"
 }
}

If a property is missing from the data or relations sections, it is left unchanged in the Node or Edge.

The response which comes back from a POST to a Node or Edge is the same as the response from a GET to that Node or Edge, including additional denormalized data at the discretion of the server.

New Nodes and Edges are added by posting to the corresponding Node Group or Edge Group  and providing the data and relations for the new Node or Edge in the body of the POST. This is a document POSTed to /course/course-1035/roster to create a new Edge, enrolling a new student in a course:

{
 "relations": {
   "ref": "/student/student-10706"
 }
}

All of the required properties must be included in this kind of POST. A POST creating a new Node looks pretty much the same.

Again, the response to this kind of POST is the same as you would receive from a GET to the newly created Node or Edge (though you will only know the id of the new Node or Edge once the response comes back).

It is often useful to be able to add a Node and a number of Edges in one POST request. The response would include the new Node, a new Edge Group and all of the specified Edges. Here is a POST to /student which adds a new Student along with Enrollment Edges in two courses:

{
"data": {
  "name": "Andrew",
    "birth_date": "7/21/95"
  },
  "edges": {
    "schedule": [ 
      {
        "relations": {
          "ref": "/course/course-1035"
        }
      }, 
      {
        "relations": {
          "ref": "/course/course-2995"
        }
      },
    ]
  }
}

The only other verb is DELETE, which can be applied to a Node or an Edge by providing the URL of the Node or Edge. No document is passed with a DELETE request. DELETEing a Node will also delete all of the Edge Groups and Edges for that Node.

The service implements much of the functionality of an application by implementing business rules which add (within limits) to the semantics of POSTs and DELETEs.

Business rules cannot change the fundamental semantics of these verbs. A successful POST to a Node Group or Edge Group must still add a new Node or Edge. A successful POST to a Node or Edge must still make the specified modifications to the Node or Edge, and a successful DELETE must still result in the specified deletion.

In particular, POST to a Node or Edge must remain idempotent. POSTing twice to a Node or Edge must have exactly the same effect as POSTing once.

Business rules, however, can reject requests which violate a desired constraint. We might want to disallow deletion of students prior to their 21st birthday.

Business rules can also limit operations to specific users or classes of user. Our school system might have an administrator class who are responsible for placing students in classes. An attempt to add a Student Edge to a class would be rejected if the client making the request was not identified as an administrator.

Business rules also have broad authority to make additional changes to the back end data structures based on a POST or DELETE. We could add a property to Student showing the number of classes each student is enrolled in, and then when the client POSTs to the schedule Edge Group to add a new Course, increment this count in the Student Node.

As I mentioned earlier, the client is not allowed to construct URLs, but is required to retrieve them from the server. It is, however, allowed to append query parameters. These must have no effect on the query other than limiting the set of records returned.

Here are some examples.

GET /student?name=piers*
GET /student/student-10706/schedule?tardies=0
GET /student?schedule.tardies=gt.0

In the first two cases the query is based on a property of the Node or Edge. The first is intended to retrieve all students whose name begins with “piers”, the second to get all courses where the given student is currently enrolled and has no tardies.

The third example shows a query that would be achieved with a join in SQL. It is imagined to retrieve all students which have been late for at least one class.

Note that the HTTP query syntax doesn’t really support the kind of database queries we want to perform very well, and because of that we have not yet settled on standards for specifying queries (and different projects have come up with different solutions). We are still striking out into new territory, but remain clear that we want to be working toward a company wide solution in the long term.

Some Node Groups contain a lot of Nodes. At IMVU we have a Node Group for the tens of millions of products in our catalog. We don’t currently support querying our product Node Group, but if we did, we would have to get a response back something like this:

{ …
  {
    "nodes": [
      "/product/product-10057",
      "/product/product-10092",
      ...
      "/product/product-11033",
    ],
    "next": "/product?offset=50",
  }
}

The list of Nodes includes only the first fifty, but provides us with a URL – with a built-in query parameter – which allows us to retrieve another group of fifty.

There are a number of services which do paging like this, but note that offsets don’t work very well for paging when Nodes are often being added and deleted, since the offsets associated with specific Nodes can change. Paging is another area in which our standards are still under development.

Finally, in order to allow clients to cache the responses they receive we provide a way for the service to let the client know when cached URL responses are no longer valid.

This involves the use of IMQ, IMVU’s proprietary system for efficiently pushing data to clients in real time. In our case, the data is simply a notification that an earlier response to a particular URL is no longer valid. We’ll go into detail on IMQ and how it works in a future post.

Most endpoints do not provide invalidation. IMVU product information, for example, doesn’t change often enough to make the overhead associated with invalidation worthwhile.

When invalidation is available, the response to an endpoint will include a member called “updates”, which provides the information necessary to subscribe to the appropriate IMQ queue. Here is a response to a GET of a particular enrollment Edge 
(/student/student-513312/schedule/01):
{
 {
    "data": { 
      "tardies": 1,
    },
    "relations": {
      "ref": "/course/course-1035",
    }
    "updates": "imq://inv.student.student-513312"
  }
}

Invalidations for different endpoints come in on different queues, and it is the server’s prerogative to choose the queue for each endpoint.

Currently, for the endpoints which support invalidation, we have one queue per Node. Invalidations for the Node come in on this queue, but the same queue also provides invalidations for the Node’s Edge Groups and Edges. This design allows us to achieve a balance between the number of queues we create and the number of subscribers to each queue.

In the example above, only clients that have an active interest in Student 513312 will be subscribed to this queue. With this mechanism, if someone POSTs a change to the number of tardies, every client which has this record cached will receive a notification. If the number of tardies is displayed on the screen, it can be updated immediately.

Note that under this scheme, there can be no queue associated with a Node Group. Such a queue would be used to notify all clients that, for example, a new Student Node had been added. To be useful, however, every client application would need to subscribe to the queue, and IMQ is not built to support such a large number of subscribers. There are solutions to this problem, but for the moment we do not support invalidation for Node Groups.

The discipline of REST and the specific uniform contract IMVU has adopted give us the power and flexibility to quickly create, enhance and share back end services. The principles behind REST (and especially the principles of uniform contract, HATEAOS and separation of concerns) provide us with a solid framework as we continue to complete and codify our standards.

We’ll keep you up to date as things develop.

Optimizing WebGL Shaders by Reading D3D Shader Assembly

We are optimizing WebGL shaders for the Intel GMA 950 chipset, which is basically the slowest WebGL-capable device we care about. Unfortunately, it’s a fairly common chipset too. On the plus side, if we run well on the GMA 950, we should basically run well anywhere. 🙂

When you’re writing GLSL in WebGL on Windows, your code is three layers of abstraction away from what actually runs on the GPU. First, ANGLE translates your GLSL into HLSL. Then, D3DX compiles the HLSL into optimized shader assembly bytecode. That shader bytecode is given to the driver, where it’s translated into hardware instructions for execution on the silicon.

Ideally, when writing GLSL, we’d like to see at least the resulting D3D shader assembly.

With a great deal of effort and luck, I have finally succeeded at extracting Direct3D shader instructions from WebGL. I installed NVIDIA Nsight and Visual Studio 2013 in Boot Camp on my Mac. To use Nsight, you need to create a dummy Visual Studio project. Without a project, it won’t launch at all. Once you have your dummy project open, configure Nsight (via Nsight User Properties under Project Settings) to launch Firefox.exe instead. Firefox is easier to debug than Chrome because it runs in a single process.

If you’re lucky — and I’m not sure why it’s so unreliable — the Nsight UI will show up inside Firefox. If it doesn’t show up, try launching it again. Move the window around, open various menus. Eventually you should have the ability to capture a frame. If you try to capture a frame before the Nsight UI shows up in Firefox, Firefox will hang.

Once you capture a frame, find an interesting draw call, make sure the geometry is from your WebGL application, and then begin looking at the shader. (Note: again, this tool is unreliable. Sometimes you get to look at the HLSL that ANGLE produced, which you can compile and disassemble with fxc.exe, and sometimes you get the raw shader assembly.)

Anyway, check this out. We’re going to focus on the array lookup in the following bone skinning GLSL:

attribute vec4 vBoneIndices;
uniform vec4 uBones[3 * 68];

ivec3 i = ivec3(vBoneIndices.xyz) * 3;
vec4 row0 = uBones[i.x];

ANGLE translates the array lookup into HLSL. Note the added bounds check for security. (Why? D3D claims it already bounds-checks array accesses.)

int3 _i = (ivec3(_vBoneIndices) * 3);
float4 _row0 = _uBones[int(clamp(float(_i.x), 0.0, 203.0))]

This generates the following shader instructions:

# NOTE: v0 is vBoneIndices
# The actual load isn't shown here.  This is just index calculation.

def c0, 2.00787401, -1, 3, 0
def c218, 203, 2, 0.5, 0

# r1 = v0, truncated towards zero
slt r1.xyz, v0, -v0
frc r2.xyz, v0
add r3.xyz, -r2, v0
slt r2.xyz, -r2, r2
mad r1.xyz, r1, r2, r3

mul r2.xyz, r1, c0.z # times three

# clamp
max r2.xyz, r2, c0.w
min r2.xyz, r2, c218.x

# get ready to load, using a0.x as index into uBones
mova a0.x, r2.y

That blob of instructions that implements truncation towards zero? Let’s decode that.

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = v0 - floor(v0)
r3.xyz = v0 - r2
r2.xyz = (-r2 < r2) ? 1 : 0
r1.xyz = r1 * r2 + r3

Simplified further:

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = (floor(v0) < v0) ? 1 : 0
r1.xyz = (r1 * r2) + floor(v0)

In short, r1 = floor(v0), UNLESS v0 < 0 and floor(v0) < v0, in which case r1 = floor(v0) + 1.

That’s a lot of instructions just to calculate an array index. After the index is calculated, it’s multiplied by three, clamped to the array boundaries (securitah!), and loaded into the address register.

Can we convince ANGLE and HLSL that the array index will never be negative? (It should know this, since it’s already clamping later on, but whatever.) Perhaps avoid a bunch of generated code? Let’s tweak the GLSL a bit.

ivec3 i = ivec3(max(vec3(0), vBoneIndices.xyz)) * 3;
vec4 row0 = uBones[i.x];

Now the instruction stream is substantially reduced!

def c0, 2.00787401, -1, 0, 3
def c218, 203, 1, 2, 0.5

# clamp v0 positive
max r1, c0.z, v0.xyzx

# r1 = floor(r1)
frc r2, r1.wyzw
add r1, r1, -r2

mul r1, r1, c0.w # times three

# bound-check against array
min r2.xyz, r1.wyzw, c218.x
mova a0.x, r2.y

By clamping the bone indices against zero before converting to integer, the shader optimizer eliminates several instructions.

Can we get rid of the two floor instructions? We know that the mova instruction rounds to the nearest integer when converting a float to an index. Given that knowledge, I tried to eliminate the floor by making my GLSL match mova semantics, but the HLSL compiler didn’t seem smart enough to elide the two floor instructions. If you can figure this out, please let me know!

Either way, I wanted to demonstrate that, by reading the generated Direct3D shader code from your WebGL shaders, you may find small changes that result in big wins.