The Real-time Web in REST Services at IMVU

By Jon Watte, VP Technology @ IMVU

IMVU has built a rich, graph-shaped REST (REpresentational State Transfer) API (Application Programming Interface) to our data. This data includes a full social network, as well as e-commerce, virtual currencies, and the biggest 3D user generated content catalog in the world. This post discusses how IMVU addresses two of the bigger draw-backs of REST-based service architectures for real-time interactive content: Cache Invalidation (where users want to know about new data as soon as it becomes available,) and Request Chattiness (where request latency kills your performance.)

Cache Invalidation

REST principles like cacheability and hypertext-based documents work great for exposing data to a variety of clients (desktop, web, and mobile,) but runs into trouble when it meets the expectation of real-time interaction. For example, when a user changes their Motto in their profile, they would like for the world to see the new Motto right away — yet, much of the scalability wins of REST principles rely on caching, and the web does not have a good invalidation model. Here is an illustration of the problem:

At 10:03 am, Bob logs in and the client application fetches the profile information about his friend Alice. This data potentially gets cached at several different layers:

  • in application caches at the server end, such as Varnish or Squid
  • in content delivery network caches at the network edge, such as Akamai or Cloudfront
  • in the user’s own browser (every web browser has a local cache)

Let’s say that our service marks the data as cacheable for one hour.

At 10:04 am, Alice updates her Motto to say “there’s no business like show business.”

At 10:05 am, Alice sends a message to Bob asking “how do you like my motto?”

At 10:06 am, Bob looks at Alice’s profile, but because he has a stale version in cache, he sees the old version. Confusion (and, if this were a TV show, hilarity) ensues.

HTTP provides two options to solve this problem. One is to not do caching at all, which certainly “solves” the problem, but then also removes all benefits of a caching architecture. The other is to add the “must-validate” option to the cache control headers of the delivered data. This tells any client that, while it might want to store the data locally, it has to check back with the server to see whether the data has changed or not before re-using it. In the case that data has not changed, this saves on bytes transferred — the data doesn’t need to be sent twice — but it still requires the client to make a request to the server before presenting the data.

In modern web architectures, while saving throughput is nice, the real application performance killer is latency, which means that even the “zero bytes” response of checking in with the server introduces an unacceptable cost in end-user responsiveness. Cache validation and/or E-tags might sound like a big win, but for a piece of data like a “motto,” the overhead in HTTP headers (several kilobytes) dwarfs the savings of 30 bytes of payload — the client might as well just re-get the resource for approximately the same cost.

Another option that’s used in some public APIs is to version all the data, and when data is updated, update the version, which means that the data now has a new URL. A client asking for the latest version of the data would then not get a cached version. Because of HATEOAS (Hypertext As The Engine Of Application State) we would be able to discover the new URL for “Alice’s Profile Information,” and thus read the updated data. Unfortunately, there is no good way to discover that the new version is there — the client running on Bob’s machine would have to walk the tree of data from the start to get back to Alice’s new profile link, which is even more round-trip requests and makes the latency even worse.

A third option is to use REST transfer for the bulk data, but use some other, out-of-band (from the point of view of the HTTP protocol) mechanism to send changes to interested clients. Examples of this approach include the Meteor web framework, and the MQTT based push approach taken by Facebook Mobile Messenger. Meteor doesn’t really scale past a few hundred online users, and has an up-to-10-seconds-delay once it’s put across multiple hosts. Even with multiple hosts and “oplog tailing,” it ends up using a lot of CPU on each server, which means that a large write volume ends up with unacceptably low performance, and a scalability ceiling determined by overall write load, that doesn’t shard. At any time, IMVU has hundreds of thousands of concurrent users, which is a volume Meteor doesn’t support.

As for the MQTT-based mobile data push, Facebook isn’t currently making their solution available on the open market, and hadn’t even begun talking about it when we started our own work. Small components of that solution (such as MQTT middleware) are available for clients that can use direct TCP connections, and could be a building block for a solution to the problem.

The good news is that we at IMVU already have a highly scalable, multi-cast architecture, in the form of IMQ (the IMVU Message Queue.) This queue allows us to send lightweight messages to all connected users in real-time (typical latencies are less than 10 milliseconds plus one-way network delay.) Thus, if we can know what kinds of things that a user is currently interested in seeing, and we can know whether those things change, we can let the user know that the data changed and needs to be re-fetched.

Jonblog1

 

The initial version of IMQ used Google Protocol Buffers on top of a persistent TCP connection for communications. This works great for desktop applications, and may work for some mobile applications as long as the device is persistently connected, but it does not work well for web browsers with no raw TCP connection ability, or intermittently connected mobile devices. To solve for these use cases, we added the ability to connect to IMQ using the websockets protocol, and additionally to fall back to an occasionally polled mail-drop pick-up model over HTTP for the worst-case connectivity situations. Note that this is still much more efficient than polling individual services for updated data — IMQ will buffer all the things that received change notifications across our service stack, and deliver them in a single HTTP response back to the client, when the client manages to make a HTTP request.

To make sure that the data for an endpoint is not stale when it is re-fetched by the client, we then mark the output of real-time updated REST services as non-cacheable by the intermediate caching layers. We have to do this, because we cannot tell the intermediate actors (especially, the browser cache) about the cache invalidation — even though we have JavaScript code running in the browser, and it knows about the invalidation of a particular URL, it cannot tell the browser cache that the data at the end of that URL is now updated.

Instead, we keep a local cache inside the web page. This cache maps URL to JSON payload, and our wrapper on top of XMLHttpRequest will first check this cache, and deliver the data if it’s there. When we receive an invalidation request over IMQ, we mark it stale (although we may still deliver it, for example for offline browsing purposes.)

Request Chattiness (Latency)

Our document-like object model looks like a connected graph with URL links as the edges, and JSON documents as the nodes. When receiving a particular set of data (such as the set of links that comprises my friends list) it is very likely that I will immediately turn around and ask for the data that’s pointed to by those links. If the browser and server both support the SPDY protocol, we could pre-stuff the right answers into the SPDY connection, in anticipation of the client requests. However, not all our clients have this support, and not even popular server-side tools like Nginx or Apache HTTPd support pre-caching, so instead we accomplish the same thing in our REST response envelope.

Instead of responding with just a single JSON document, we respond with a look-up table of URLs to JSON documents, including all the information we believe the client will want, based on the original request. This is entirely optional — the server doesn’t have to add any extra information; the client doesn’t have to pay attention to the extra data; but servers and clients that are in cahoots and pay attention will end up delivering a user experience with more than 30x fewer server round-trips! On internet connections where latency matters more than individual byte counts, this is a huge win. On very narrow-band connections (like 2G cell phones or dial-up modems,) the client can provide a header that tells the server to never send any data more than what’s immediately requested.

Jonblog2

Because the server knows all the data it has sent (including the speculatively pre-loaded, or “denormalized” data,) the server can now make arrangements for the client to receive real-time updates through IMQ when the data backing those documents changes. Thus, when a friend comes online, or when a new catalog item from a creator I’m interested in is released, or when I purchase more credits, the server sends an invalidation message on the appropriate topic through the message queue, and any client that is interested in this topic will receive it, and update its local cache appropriately.

Putting it Together

This, in turn, ties into a reactive UI model. The authority of the data, within the application, lives in the in-process JSON cache, and the IMQ invalidation events are received by this cache. The cache can then know whether any piece of UI is currently displaying this data; if so, it issues a request to the server to fetch it, and once received, it updates the UI. If not, then it can just mark the element as stale, and re-fetch it if it’s later requested by some piece of UI or other application code.

The end-to-end flow is then:

  1. Bob loads Alice’s profile information
  2. Specific elements on the screen are tied to the information such as “name” or “motto”
  3. Bob’s client creates a subscription to updates to Alice’s information
  4. Alice changes her motto
  5. The back-end generates a message saying “Alice’s information changed” to everyone who is subscribed (which includes Bob)
  6. Bob’s client receives the invalidation message
  7. Bob’s client re-requests Alice’s profile information
  8. The underlying data model for Alice’s profile information on Bob’s display page changes
  9. The reactive UI updates the appropriate fields on the screen, so Bob sees the new data

All of these pieces means re-thinking a number of building blocks of the standard web stack, which means more work for our foundational libraries. In return, we get a more reactive web application, where anything you see on the screen is always up to date, and changes respond quickly, both through the user interface, and through the back-end, with minimal per-request overhead.

This might seem complex, but it ends up working really well, and with the proper attention to library design for back- and front-end development, building a reactive application like this is no harder than building an old-style, slow polling (or manually refreshed) application.

It would be great if SPDY (and, future, HTTP2) could support pre-stuffing responses in the real world. It would also be great if the browser DOM had an interface to the local cache, so that the application could tell the browser about a particular URL being invalidated. However, the solution we’ve built up achieves the same benefits, using existing protocols, which goes to show the fantastic flexibility and resilience inherent in the protocols and systems that make up the web!

 

How IMVU Builds Web Services: Part 3

In this 3-part series, IMVU senior engineer Bill Welden describes the means and technology behind IMVU’s web services.

Part 3: Documents and Links

In the previous entry in this series I described how IMVU uses a structured network model to implement the uniform contract for our REST services and showed how they might apply to a set of services for a hypothetical high school scheduling system.

Engblog1102-01

Under this model we have Node Groups (represented by the different colors of circles in this diagram), Nodes (the circles themselves), Edge Groups (the rounded boxes hanging off of the circles) and Edges (the lines connecting the circles).

We model these various notions using HTTP documents linked together with URLs. There are four kinds of documents, corresponding to each of the four concepts.

A Node Group determines the properties and relationships of the Nodes, and the corresponding Node Group Document contains a list of URLs, each locating a Node.

A Node, through its properties and relationships, is the basic repository of information in the network. The Node Group Document contains the values of the Node’s properties, as well as a list of URLs, each locating an Edge Group.

An Edge Group groups together all of the links that go to a particular type of Node. The corresponding  Edge Group Document, specific to a Node, contains a list of URLs, each locating an Edge within that group.

An Edge is a link from its Node to another Node, usually of a different class. Edges can also contain properties, as we saw last time. An Edge Document therefore contains a URL for the Node at the other end of the Edge, as well as the values of any properties of the Edge.

Each type of document has a distinctive URL.

The URL for a Node Group Document consists of a single path segment, a singular noun describing the Node Group:

https://api.imvu.com/course

The URL for a Node consists of two path segments: the Node Group URL extended by a segment consisting of a unique identifier for the Node. The Node identifier can be any string, but we encourage the use of Node Identifiers which contain the name of the Node Group:

https://api.imvu.com/course/course-1035

The URL for an Edge Group consists of three segments: the Node URL extended either by a plural noun naming the Node Group that the Edge Group points off to, or by a noun describing the relationship created by the Edge Group:

https://api.imvu.com/course/course-1035/teachers
https://api.imvu.com/course/course-1035/roster

The URL for an Edge consists of four segments: the Edge Group URL extended by an identifier for the Edge. This identifier need only be unique within the Edge Group. Sometimes (for convenience) it is identical to the unique identifier of the Node that the Edge points to, but this is not required:

https://api.imvu.com/course/course-1035/teachers/1
https://api.imvu.com/course/course-1035/roster/student-5331

These URL formats are a constraint on the service design – a strong suggestion but not a strict requirement.

Note especially that, according to the principle of HATEOAS (Hypertext As The Engine Of Application State), the client cannot depend on any specific format, and is not allowed to construct these URLs or attempt to extract information from them. All URLs required by the client must be obtained whole, as opaque strings, from the response to an earlier service request.

Specifically, URLs returned from the server must be fully qualified, including the protocol and server name (“https://api.imvu.com”). However, in our own internal discussions and here in this presentation, we will often leave off the protocol and server name for clarity. So:

/course
/course/course-1035
/course-1035/teachers

and so forth. Wherever one of these relative URLs appears in the discussion below, the actual implementation will return a fully qualified version.

One of the biggest differences between REST services and the earlier Remote Procedure Call style is that RPC APIs define as many verbs as required by or convenient for implementation of the application. Designing RPC services is primarily about designing new verbs and specifying their parameters and their semantics.

With IMVU Rest services, the verbs are specified, and there are only five of them. Designing a service consists of designing a set of Nodes and Edges, together with their properties and relationships, in such a way that these five verbs can provide all of the required functionality.

The verbs are

GET (any endpoint)
POST (to a Node or Edge),
POST (to a Node Group or an Edge Group),
POST (to a Node Group, including Edges), and
DELETE (a Node or Edge)

GET retrieves the document associated with the URL, which can be a Node Group Document, Node Document, Edge Group Document or an Edge Document. GETs must be nullipotent, which is to say that they can have no side-effects.

POST, when applied to a Node or Edge URL makes changes to the properties and relations of the Node or Edge. Such POSTs must be idempotent, which means that sending to POST twice must have exactly the same effect as sending it once.

When POST is applied to a Node Group or Edge Group, it adds a new Node or a new Edge. Such POSTs will not be idempotent, since sending the POST twice will add two new Nodes or two new Edges.

There is a third sort of POST, a POST specifically to a Node Group which adds a new Node, but which also contains data for a set of new Edges to be added along with the Node (including Edge Groups as necessary).

Finally there is DELETE, which is used to delete a Node or an Edge. The implementation of DELETE must be idempotent.

GET returns the document associated with the URL, but in order to minimize round trips the service is allowed to return any additional documents that it thinks the client may soon need.

You can see this in the format of the JSON document returned by a GET:

{
  "id": "/course/course-1035",
  "status": "success",
  "denormalized":
  {
   "/course/course-1035": { … } 
   "/course/course-1035/teachers": { … }
   "/course/course-1035/teachers/1": { … }
   … etc …
  }
}

There is a status, “success” or “failure”, and if the GET fails, some information about why, but if it succeeds a package of endpoints is included, grouped under the response member “denormalized”.

In this example the client has asked for a Course, and the server has chosen to return not only the Course Node, but the Edge Group and Edges for all of the teachers that teach that course.

Each of the four kinds of documents in these responses has a specific JSON format.

A Node Group Document has a member “nodes” which is an array of URLs, one for each Node.

{
"nodes": [
 "/course/course-1035",  
 "/course/coures-1907", 
 ...
 ]
}

If the client wants to present the list of Nodes in a certain order, it is responsible for sorting them itself. It cannot depend on the order of entries in this array.

A Node Document has two members. The “data” member contains the Node’s properties. The “relations” member contains the URLs that link the Node to its Edge Groups and to other Nodes in the system.

{
 "data": {
 "description": "Algebra I",
 "starting_time": "10:00"
 }
"relations": {
 "teacher": "/teacher/teacher-372",
 "roster": "/course/course-1035/roster"
 }
}

Names in these objects represent a contract with the client. Based on the name, the server guarantees the semantics of the value including its type, the allowed values, the meaning of the value and if it is a link, the Node Group of the Node it points to.

Note that here “teacher” is a link to a Teacher Node. This design precludes the possibility of team teaching where there are two teachers for a class.

Links directly from one Node to another create a one-to-N relationships. One Teacher to many Courses. As a rule, however, relationships are more often N-to-N than not, and such restrictions on cardinality are a red flag. Not wrong, necessarily, but something that may be called out in design reviews.

The design could be made to support many teachers per course by implementing an Edge Group “teachers” for Course Nodes. This is a more flexible design, because the cardinality restrictions, say a maximum of two teachers, or allowing multiple teachers only for certain courses, can be implemented on the server, where they are easier to change.

An Edge Group Document has a member “edges” which is a JSON array of URLs for the Node’s Edges.

{
  "edges": [
   "/course/course-1035/roster/student-5331",
   "/course/course-1035/roster/student-1070",
   ...
  ],
}

Again, the client cannot depend on the order that the Edges come back from the server.

Finally, an Edge Document has an optional “data” member for when the Edge has properties, and a “relations” member which contains the URL for the Node at the other end of the Edge.

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/student/student-5331"
 }
}

Here is an Edge between one Course (course-1035) and one Student (student-513312).

Edges go both ways. Here are the Edges between that same student and his courses.

{
 "edges": [
   "/student/student-53312/schedule/01",
   "/student/student-53312/schedule/02",
   ...
 ],
}

Now it’s not required to implement every Edge Group implied by the Node/Edge model, so it’s acceptable to implement only half of this Edge relationship without its symmetrical partner. When we have an Edge Group in our design, however, we think carefully about the symmetrical Edge Group. It’s often a very interesting view on the data, and it’s seldom very difficult to implement.

Note that the Edge document itself can come back in two different ways, but the underlying database record will be the same. Here is one of the Edge documents linked from the Edge Group above:

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/course/course-1035",
 }
}

Whether you look at this link from the Student or the Course perspective, the count of tardies is the same data element.

Nodes and Edges are updated using the same JSON document format as the response from a GET, though there is no denormalization envelope.

Here is the document for a POST to a Student Node (/student/student-513312) intended to update the student’s birth date and counsellor (documents for POSTs to Edges look pretty much the same):

{
 "data": {
  "birth_date": "5/11/96",
 },
 "relations": {
  "counsellor": "/teacher/teacher-121"
 }
}

If a property is missing from the data or relations sections, it is left unchanged in the Node or Edge.

The response which comes back from a POST to a Node or Edge is the same as the response from a GET to that Node or Edge, including additional denormalized data at the discretion of the server.

New Nodes and Edges are added by posting to the corresponding Node Group or Edge Group  and providing the data and relations for the new Node or Edge in the body of the POST. This is a document POSTed to /course/course-1035/roster to create a new Edge, enrolling a new student in a course:

{
 "relations": {
   "ref": "/student/student-10706"
 }
}

All of the required properties must be included in this kind of POST. A POST creating a new Node looks pretty much the same.

Again, the response to this kind of POST is the same as you would receive from a GET to the newly created Node or Edge (though you will only know the id of the new Node or Edge once the response comes back).

It is often useful to be able to add a Node and a number of Edges in one POST request. The response would include the new Node, a new Edge Group and all of the specified Edges. Here is a POST to /student which adds a new Student along with Enrollment Edges in two courses:

{
"data": {
  "name": "Andrew",
    "birth_date": "7/21/95"
  },
  "edges": {
    "schedule": [ 
      {
        "relations": {
          "ref": "/course/course-1035"
        }
      }, 
      {
        "relations": {
          "ref": "/course/course-2995"
        }
      },
    ]
  }
}

The only other verb is DELETE, which can be applied to a Node or an Edge by providing the URL of the Node or Edge. No document is passed with a DELETE request. DELETEing a Node will also delete all of the Edge Groups and Edges for that Node.

The service implements much of the functionality of an application by implementing business rules which add (within limits) to the semantics of POSTs and DELETEs.

Business rules cannot change the fundamental semantics of these verbs. A successful POST to a Node Group or Edge Group must still add a new Node or Edge. A successful POST to a Node or Edge must still make the specified modifications to the Node or Edge, and a successful DELETE must still result in the specified deletion.

In particular, POST to a Node or Edge must remain idempotent. POSTing twice to a Node or Edge must have exactly the same effect as POSTing once.

Business rules, however, can reject requests which violate a desired constraint. We might want to disallow deletion of students prior to their 21st birthday.

Business rules can also limit operations to specific users or classes of user. Our school system might have an administrator class who are responsible for placing students in classes. An attempt to add a Student Edge to a class would be rejected if the client making the request was not identified as an administrator.

Business rules also have broad authority to make additional changes to the back end data structures based on a POST or DELETE. We could add a property to Student showing the number of classes each student is enrolled in, and then when the client POSTs to the schedule Edge Group to add a new Course, increment this count in the Student Node.

As I mentioned earlier, the client is not allowed to construct URLs, but is required to retrieve them from the server. It is, however, allowed to append query parameters. These must have no effect on the query other than limiting the set of records returned.

Here are some examples.

GET /student?name=piers*
GET /student/student-10706/schedule?tardies=0
GET /student?schedule.tardies=gt.0

In the first two cases the query is based on a property of the Node or Edge. The first is intended to retrieve all students whose name begins with “piers”, the second to get all courses where the given student is currently enrolled and has no tardies.

The third example shows a query that would be achieved with a join in SQL. It is imagined to retrieve all students which have been late for at least one class.

Note that the HTTP query syntax doesn’t really support the kind of database queries we want to perform very well, and because of that we have not yet settled on standards for specifying queries (and different projects have come up with different solutions). We are still striking out into new territory, but remain clear that we want to be working toward a company wide solution in the long term.

Some Node Groups contain a lot of Nodes. At IMVU we have a Node Group for the tens of millions of products in our catalog. We don’t currently support querying our product Node Group, but if we did, we would have to get a response back something like this:

{ …
  {
    "nodes": [
      "/product/product-10057",
      "/product/product-10092",
      ...
      "/product/product-11033",
    ],
    "next": "/product?offset=50",
  }
}

The list of Nodes includes only the first fifty, but provides us with a URL – with a built-in query parameter – which allows us to retrieve another group of fifty.

There are a number of services which do paging like this, but note that offsets don’t work very well for paging when Nodes are often being added and deleted, since the offsets associated with specific Nodes can change. Paging is another area in which our standards are still under development.

Finally, in order to allow clients to cache the responses they receive we provide a way for the service to let the client know when cached URL responses are no longer valid.

This involves the use of IMQ, IMVU’s proprietary system for efficiently pushing data to clients in real time. In our case, the data is simply a notification that an earlier response to a particular URL is no longer valid. We’ll go into detail on IMQ and how it works in a future post.

Most endpoints do not provide invalidation. IMVU product information, for example, doesn’t change often enough to make the overhead associated with invalidation worthwhile.

When invalidation is available, the response to an endpoint will include a member called “updates”, which provides the information necessary to subscribe to the appropriate IMQ queue. Here is a response to a GET of a particular enrollment Edge 
(/student/student-513312/schedule/01):
{
 {
    "data": { 
      "tardies": 1,
    },
    "relations": {
      "ref": "/course/course-1035",
    }
    "updates": "imq://inv.student.student-513312"
  }
}

Invalidations for different endpoints come in on different queues, and it is the server’s prerogative to choose the queue for each endpoint.

Currently, for the endpoints which support invalidation, we have one queue per Node. Invalidations for the Node come in on this queue, but the same queue also provides invalidations for the Node’s Edge Groups and Edges. This design allows us to achieve a balance between the number of queues we create and the number of subscribers to each queue.

In the example above, only clients that have an active interest in Student 513312 will be subscribed to this queue. With this mechanism, if someone POSTs a change to the number of tardies, every client which has this record cached will receive a notification. If the number of tardies is displayed on the screen, it can be updated immediately.

Note that under this scheme, there can be no queue associated with a Node Group. Such a queue would be used to notify all clients that, for example, a new Student Node had been added. To be useful, however, every client application would need to subscribe to the queue, and IMQ is not built to support such a large number of subscribers. There are solutions to this problem, but for the moment we do not support invalidation for Node Groups.

The discipline of REST and the specific uniform contract IMVU has adopted give us the power and flexibility to quickly create, enhance and share back end services. The principles behind REST (and especially the principles of uniform contract, HATEAOS and separation of concerns) provide us with a solid framework as we continue to complete and codify our standards.

We’ll keep you up to date as things develop.

What it’s like to use Haskell

By Andy Friesen

Since early 2013, we at IMVU have used Haskell to build several of the REST APIs that power our service.

When the company started, we chose PHP as our application server language, in part, because the founders expected the website to only be a small part of the business!  IMVU was primarily about a downloadable 3D client.  We needed “a website or something” to give users a place to download our client from, but didn’t expect it would have to be much more than that. This shows that predicting the future is hard.
Years later, we have quite a lot of customers, and we primarily use PHP to serve them.  We’re big enough that we run multiple subteams on separate initiatives at the same time.  Performance is becoming important to us not just because it matters to our customers, but because it can easily make the difference between buying 4 servers and buying 40 servers to support some new feature.

So, early in 2012, we found ourselves ready to look for an alternative that would help us be more rigorous.  In particular, we were ready for the idea that sacrificing a tiny bit of short term, straight-line time to market might actually speed us up in the long run.

How We Got Here

I started learning Haskell in my spare time in part because Haskell seems like the exact opposite of PHP: Natively compiled, statically typed, and very principled.

My initial exploration left me interested in evaluating Haskell at real scale.  A year later, we did a live-fire test in which we taught multiple teammates Haskell while delivering an important new feature under a deadline.

Today, a lot of our backend code is still driven by PHP, but we have a growing amount of Haskell that powers newer features. The process has been exciting not only because we got to actually answer a lot of the questions that keep many people from choosing not to try Haskell, but also because it’s simply a better solution.

The experiment to start developing in Haskell took a lot of internal courage and dedication, and we had to overcome a number of, quite rational, concerns related to adopting a whole new language. Here are the main ones and how they worked out for us:

Scalability

The first thing we did was to replace a single service with a Haskell implementation.  We picked a service that was high-volume but was not mission critical.

We didn’t do any particular optimization of this new service, but it nevertheless showed excellent performance characteristics in the field.  Our little Haskell server was running on a pair of spare servers that were otherwise set for retirement, and despite this, each machine was handling about 20x as many requests as one of our high-spec PHP servers could manage.

Reliability

The second thing we did was to take our hands off the Haskell service and leave it running until it fell over.  It ran for months without intervention.

Training

After the reliability test, we were ready to try a live fire exercise, but we had to wait a bit for the right project.  We got our chance in early 2013.

The rules of the experiment were simple: Train 3 engineers to write the backend for an important new project and keep up with a separate frontend team.  Most of the code was to be new, so there was relatively little room for legacy complications.

We very quickly learned that we had also signed up for a lot of catch-up work to bring the Haskell infrastructure inline with what we’ve had for years in PHP.  We were very busy for awhile, but once we got this infrastructure out of the way, the tables turned and the front-end team became the limiting factor.

Today, training an engineer to be productive in our Haskell code is not much harder than training someone to be productive in our PHP environment.  People who have prior functional programming knowledge seem to find their stride in just a few days.

Testing

Correctness is becoming very important for us because we sometimes have to change code that predates every current developer.  We have enough users that mistakes become very costly, very quickly.  Solving these sorts of issues in PHP is sometimes achievable but always difficult.  We usually solve them with unit tests and production alerts, but these approaches aren’t sufficient for all cases.

Unit tests are incredible and great, but you’re always at the mercy of the level of discipline of every engineer at every moment. It’s easy to tell your teammates to write tests for everything, but this basically boils down to asking everyone to be at their very best every day.  People make mistakes and things slip through the cracks.

When using Haskell, we actually remove an entire class of defects that we have to write tests for. Thus, the number of tests we have to write is smaller, and thus there are fewer cases we can forget to write tests for.

We like unit testing and test-driven development (TDD) at IMVU and we’ve found that Haskell is better with TDD, but also that TDD is better with Haskell.  It takes fewer tests to get the same degree of reliability out of Haskell.  The static verification takes care of quite a lot of error checking that has to be manually implemented (or forgotten) in PHP.  The Haskell QuickCheck tool is also a wonderful help for developers.
The way Haskell separates pure computations from side effects let us build something that isn’t practical with other languages: We built a custom monad that lets us “switch off” side effects in our tests.  This is incredible because it means that trying to escape the testing sandbox breaks compilation. While we have had to fight intermittent test failures for eight years in PHP (and at times have had multiple engineers simultaneously dedicated to the problem of test intermittency,) our unit tests in Haskell cannot intermittently fail.

Deployment

Deployment is great. At IMVU, we do continuous deployment, and Haskell is no exception. We build our application as a statically linked executable, and rsync it out to our servers. We can also keep old versions around, so we can switch back, should a deployment result in unexpected errors.

I wouldn’t write an OS kernel in it, but Haskell is way better than PHP as a systems language. We needed a Memcached client for our Haskell code, and rather than try to talk to a C implementation, we just wrote one in Haskell.  It took about a half day to write and performs really well. And, as a side effect, if we ever read back some data we don’t expect from memcached (say, because of an unexpected version change) then Haskell will automatically detect and reject this data.

We’ve consistently found that we unmake whole classes of bugs by defining new data types for concepts to wrap primitive types like integers and strings.  For instance, we have two lines of code that say that “customer IDs” and “product IDs” are represented to the hardware as numbers, but they are not mutually convertible.  Setting up these new types doesn’t take very much work and it makes the type checker a LOT more helpful. PHP, and other popular dynamic server languages like Javascript or Ruby, make doing the same very hard.

Refactoring is a breeze.  We just write the change we want and follow the compile errors.  If it builds, it almost certainly also passes tests.

Not All Sunshine and Rainbows

Resource leaks in Haskell are nasty.  We once had a bug where an unevaluated dictionary was the source of a space leak that would eventually take our servers down.  We also ran into an issue where an upstream library opened /dev/urandom for randomness, but never closed the file handle.  These issues don’t happen in PHP, with its process-per-request model, and they were more difficult to track down and resolve than they would have been in C++.

The Haskell package manager, Cabal, ended up getting in the way of our development. It lets you specify version ranges of particular packages you want, but it’s important for everyone on the team to have exactly the same versions of every package.  That means controlling transitive dependencies, and Cabal doesn’t really offer a way to handle this precisely. For a language that is so very principled on type algebra, it’s surprising that the package manager doesn’t follow suit regarding package versioning. Instead, we use Cabal for basic package installation, and a custom build tool (written in Haskell.)

Hiring

I’ll admit that I was very worried that we wouldn’t be able to hire great people if our criteria was expertise in an uncommon language without a comparatively sparse industrial track record, but the honest truth is that we found a great Haskell hacker in the Bay area after about 4 days of looking.

We had a chance to hire him because we were using Haskell, not in spite of it.

Final Thoughts

While it’s usually difficult to objectively measure things like choice of programming language or softwarestack, we’re now seeing fantastic, obvious productivity and efficiency gains.  Even a year later, all the Haskell code we have runs on just a tiny number of servers and, when we have to make changes to the code, we can do so quickly and confidently.

REST and Games Don’t Mix

By Jon Watte, Technical Director at IMVU, Inc.

In web development, the REST (REpresentational State Transfer) approach to services often works well to build scalable, robust systems. Unfortunately, the principles behind REST collide with some necessary requirements of games, and the end result is that many problems faced by multiplayer games cannot be solved using REST techniques.

One pillar of REST design is that all necessary data for servicing a request is included in the request, and a client imposes no load on a server when not within the context of a request. For example, a web server serves a file to you when you pass it a URL; how to serve that file may be governed by cookies sent by your web browser; but once the file is served, the server conceptually needs to do nothing else to satisfy the client. If you are building, say, a photo sharing site, then this makes a lot of sense. A request may upload a new photo for some particular user, identified through URL tokens or cookies. Another request may list photos that match some particular criteria. A third request may remove or otherwise edit metadata surrounding the photos that are uploaded, by uploading some JSON or XML document to go with the photo. All in all, each request is isolated, and the server needs to do nothing (other than store the photos on disk) between requests.

Compare to a game of chess: Each move in a game of chess isn’t a complete new document (or chess board), but instead a delta over some previous state. Moves are checked for validity before they are accepted, by being interpreted in the context of the board. You could, conceivably, remember the old board state, and have a player upload a new board state that represents the state after his move, and derive what his move was by running a diff to the board. Unfortunately, even that is not enough to properly implement the rules of chess, because certain moves (castling, en passant pawn capture) depend on historical state of the board. To work around this, we can include a full set of historical state with the board, as well as the actual board state, and still implement a chess game using REST techniques.

All that remains, then, is to figure out how your opponent learns about a move having taken place, and knows how to download the new board state to display it. This is something that REST techniques simply do not solve, because the infrastructure of REST depends on the direction of requests, the completeness of state transfer, and the cacheability of data. To get rid of cache problems in the chess game above, you’d have to also keep a “next move id” counter as part of the request URL, so that the state of the board, including history, is addressed by a separate URL for each new move.

Another problem with the implementation of the chess game as a REST service is that it transfers the entire game state, including the board, for each move. This uses more network bandwidth than necessary. For something simple, like a chess board, this doesn’t really matter that much. Broadband makes bandwidth cheap, and even a cell phone will transfer 50 kilobytes of data in less than a second these days. However, for more complex games, like a 3D role-playing game or a real-time military simulation game, the amount of data transferred per frame, and the rate at which frames need to be transferred, makes REST bandwidth requirements a real problem.

However, that’s not the worst part. Some games have rules that require secrecy. For example, in a game of Texas Hold’em poker, some state is known, such as what the community cards are and what the amounts of the pot and each player’s chip stack is. Other state, though, is hidden. I do not know what cards you have in your hand, and to avoid cheating, I should have no way of finding this out until and unless you reveal your hand at the end of the game. What cards each of us have in our hands is state that is shared between only us and the server. This means that a player cannot upload a “new game state” that includes the cards in each hand. Yet, to determine the winner, all the state is required. Additionally, the deck that cards are being dealt from is secret state that only the server knows. Thus, at a minimum, to apply REST to our poker game, we’d have to decompose the game into many separate states (each hand, the deck, the pot, each player’s chip stack, etc), and we’d have to enforce separate access control to each of these values. More troubling is the fact that the cards in a player’s hand are changed by the dealer, which means it’s an action someone else takes that affects what my state is. Again, REST does not support server to client push, nor does it generally support “active” state, so implementing even a simple game like Poker requires us to break the REST mold.

The World Wide Web is a large instance of largely REST-ful services. Various work-arounds exist to get around the limitations of the REST architecture, such as long-poll XMLHttpRequests from Javascript, chunked-encoding IFRAME script execution, etc. The main reason to use the Flash player or other browser plug-ins is to break the REST mold. Once games expand outside the simple rules of something like poker or chess, the contortions necessary to work around the limitations of REST become byzantine enough that the benefits of running within a browser may be outweighed by the costs, both in development, deployment and operation. You simply can’t describe a typical server instance of a game like World of Warcraft using a REST API and expect any normal consumer network or computer hardware to keep up.

Why does this matter? The World Wide Web is built on REST. New web standards that allow us to do more and more with the internet browser are driven by the same people who developed REST. As a case in point, the HTML5 standards umbrella does not address streaming media. You can “play” a sound or video URL, and the implementation of those URLs may optimize by not delivering the data until you need it, but the only way to “record” sound or video is to record into a static piece of data, which can later be uploaded. This means that voice or video chat cannot be done in a web browser without relying on third party plug-ins, such as the Flash player. Additionally, HTML5 describes a mechanism called Web Sockets to try to work around the server push limitations of REST, but that technology has run into significant implementation problems, and will not be part of the round of web browsers releasing in the first half of 2011.

It’s clear that people want to play games and do other interactive, bidirectional activities with their network-connected computers. The success of networked games and the success of even non-interactive games like FarmVille or Backyard Monsters leave this without a doubt. It would be great if those who understand interactive entertainment (such as game makers) could make friends with those who are currently driving the web, and find common ground to allow for a bidirectional, interactive, real-time world wide web of human communications. HTML6, anyone? 🙂