The Real-time Web in REST Services at IMVU

By Jon Watte, VP Technology @ IMVU

IMVU has built a rich, graph-shaped REST (REpresentational State Transfer) API (Application Programming Interface) to our data. This data includes a full social network, as well as e-commerce, virtual currencies, and the biggest 3D user generated content catalog in the world. This post discusses how IMVU addresses two of the bigger draw-backs of REST-based service architectures for real-time interactive content: Cache Invalidation (where users want to know about new data as soon as it becomes available,) and Request Chattiness (where request latency kills your performance.)

Cache Invalidation

REST principles like cacheability and hypertext-based documents work great for exposing data to a variety of clients (desktop, web, and mobile,) but runs into trouble when it meets the expectation of real-time interaction. For example, when a user changes their Motto in their profile, they would like for the world to see the new Motto right away — yet, much of the scalability wins of REST principles rely on caching, and the web does not have a good invalidation model. Here is an illustration of the problem:

At 10:03 am, Bob logs in and the client application fetches the profile information about his friend Alice. This data potentially gets cached at several different layers:

  • in application caches at the server end, such as Varnish or Squid
  • in content delivery network caches at the network edge, such as Akamai or Cloudfront
  • in the user’s own browser (every web browser has a local cache)

Let’s say that our service marks the data as cacheable for one hour.

At 10:04 am, Alice updates her Motto to say “there’s no business like show business.”

At 10:05 am, Alice sends a message to Bob asking “how do you like my motto?”

At 10:06 am, Bob looks at Alice’s profile, but because he has a stale version in cache, he sees the old version. Confusion (and, if this were a TV show, hilarity) ensues.

HTTP provides two options to solve this problem. One is to not do caching at all, which certainly “solves” the problem, but then also removes all benefits of a caching architecture. The other is to add the “must-validate” option to the cache control headers of the delivered data. This tells any client that, while it might want to store the data locally, it has to check back with the server to see whether the data has changed or not before re-using it. In the case that data has not changed, this saves on bytes transferred — the data doesn’t need to be sent twice — but it still requires the client to make a request to the server before presenting the data.

In modern web architectures, while saving throughput is nice, the real application performance killer is latency, which means that even the “zero bytes” response of checking in with the server introduces an unacceptable cost in end-user responsiveness. Cache validation and/or E-tags might sound like a big win, but for a piece of data like a “motto,” the overhead in HTTP headers (several kilobytes) dwarfs the savings of 30 bytes of payload — the client might as well just re-get the resource for approximately the same cost.

Another option that’s used in some public APIs is to version all the data, and when data is updated, update the version, which means that the data now has a new URL. A client asking for the latest version of the data would then not get a cached version. Because of HATEOAS (Hypertext As The Engine Of Application State) we would be able to discover the new URL for “Alice’s Profile Information,” and thus read the updated data. Unfortunately, there is no good way to discover that the new version is there — the client running on Bob’s machine would have to walk the tree of data from the start to get back to Alice’s new profile link, which is even more round-trip requests and makes the latency even worse.

A third option is to use REST transfer for the bulk data, but use some other, out-of-band (from the point of view of the HTTP protocol) mechanism to send changes to interested clients. Examples of this approach include the Meteor web framework, and the MQTT based push approach taken by Facebook Mobile Messenger. Meteor doesn’t really scale past a few hundred online users, and has an up-to-10-seconds-delay once it’s put across multiple hosts. Even with multiple hosts and “oplog tailing,” it ends up using a lot of CPU on each server, which means that a large write volume ends up with unacceptably low performance, and a scalability ceiling determined by overall write load, that doesn’t shard. At any time, IMVU has hundreds of thousands of concurrent users, which is a volume Meteor doesn’t support.

As for the MQTT-based mobile data push, Facebook isn’t currently making their solution available on the open market, and hadn’t even begun talking about it when we started our own work. Small components of that solution (such as MQTT middleware) are available for clients that can use direct TCP connections, and could be a building block for a solution to the problem.

The good news is that we at IMVU already have a highly scalable, multi-cast architecture, in the form of IMQ (the IMVU Message Queue.) This queue allows us to send lightweight messages to all connected users in real-time (typical latencies are less than 10 milliseconds plus one-way network delay.) Thus, if we can know what kinds of things that a user is currently interested in seeing, and we can know whether those things change, we can let the user know that the data changed and needs to be re-fetched.

Jonblog1

 

The initial version of IMQ used Google Protocol Buffers on top of a persistent TCP connection for communications. This works great for desktop applications, and may work for some mobile applications as long as the device is persistently connected, but it does not work well for web browsers with no raw TCP connection ability, or intermittently connected mobile devices. To solve for these use cases, we added the ability to connect to IMQ using the websockets protocol, and additionally to fall back to an occasionally polled mail-drop pick-up model over HTTP for the worst-case connectivity situations. Note that this is still much more efficient than polling individual services for updated data — IMQ will buffer all the things that received change notifications across our service stack, and deliver them in a single HTTP response back to the client, when the client manages to make a HTTP request.

To make sure that the data for an endpoint is not stale when it is re-fetched by the client, we then mark the output of real-time updated REST services as non-cacheable by the intermediate caching layers. We have to do this, because we cannot tell the intermediate actors (especially, the browser cache) about the cache invalidation — even though we have JavaScript code running in the browser, and it knows about the invalidation of a particular URL, it cannot tell the browser cache that the data at the end of that URL is now updated.

Instead, we keep a local cache inside the web page. This cache maps URL to JSON payload, and our wrapper on top of XMLHttpRequest will first check this cache, and deliver the data if it’s there. When we receive an invalidation request over IMQ, we mark it stale (although we may still deliver it, for example for offline browsing purposes.)

Request Chattiness (Latency)

Our document-like object model looks like a connected graph with URL links as the edges, and JSON documents as the nodes. When receiving a particular set of data (such as the set of links that comprises my friends list) it is very likely that I will immediately turn around and ask for the data that’s pointed to by those links. If the browser and server both support the SPDY protocol, we could pre-stuff the right answers into the SPDY connection, in anticipation of the client requests. However, not all our clients have this support, and not even popular server-side tools like Nginx or Apache HTTPd support pre-caching, so instead we accomplish the same thing in our REST response envelope.

Instead of responding with just a single JSON document, we respond with a look-up table of URLs to JSON documents, including all the information we believe the client will want, based on the original request. This is entirely optional — the server doesn’t have to add any extra information; the client doesn’t have to pay attention to the extra data; but servers and clients that are in cahoots and pay attention will end up delivering a user experience with more than 30x fewer server round-trips! On internet connections where latency matters more than individual byte counts, this is a huge win. On very narrow-band connections (like 2G cell phones or dial-up modems,) the client can provide a header that tells the server to never send any data more than what’s immediately requested.

Jonblog2

Because the server knows all the data it has sent (including the speculatively pre-loaded, or “denormalized” data,) the server can now make arrangements for the client to receive real-time updates through IMQ when the data backing those documents changes. Thus, when a friend comes online, or when a new catalog item from a creator I’m interested in is released, or when I purchase more credits, the server sends an invalidation message on the appropriate topic through the message queue, and any client that is interested in this topic will receive it, and update its local cache appropriately.

Putting it Together

This, in turn, ties into a reactive UI model. The authority of the data, within the application, lives in the in-process JSON cache, and the IMQ invalidation events are received by this cache. The cache can then know whether any piece of UI is currently displaying this data; if so, it issues a request to the server to fetch it, and once received, it updates the UI. If not, then it can just mark the element as stale, and re-fetch it if it’s later requested by some piece of UI or other application code.

The end-to-end flow is then:

  1. Bob loads Alice’s profile information
  2. Specific elements on the screen are tied to the information such as “name” or “motto”
  3. Bob’s client creates a subscription to updates to Alice’s information
  4. Alice changes her motto
  5. The back-end generates a message saying “Alice’s information changed” to everyone who is subscribed (which includes Bob)
  6. Bob’s client receives the invalidation message
  7. Bob’s client re-requests Alice’s profile information
  8. The underlying data model for Alice’s profile information on Bob’s display page changes
  9. The reactive UI updates the appropriate fields on the screen, so Bob sees the new data

All of these pieces means re-thinking a number of building blocks of the standard web stack, which means more work for our foundational libraries. In return, we get a more reactive web application, where anything you see on the screen is always up to date, and changes respond quickly, both through the user interface, and through the back-end, with minimal per-request overhead.

This might seem complex, but it ends up working really well, and with the proper attention to library design for back- and front-end development, building a reactive application like this is no harder than building an old-style, slow polling (or manually refreshed) application.

It would be great if SPDY (and, future, HTTP2) could support pre-stuffing responses in the real world. It would also be great if the browser DOM had an interface to the local cache, so that the application could tell the browser about a particular URL being invalidated. However, the solution we’ve built up achieves the same benefits, using existing protocols, which goes to show the fantastic flexibility and resilience inherent in the protocols and systems that make up the web!

 

How IMVU Builds Web Services: Part 3

In this 3-part series, IMVU senior engineer Bill Welden describes the means and technology behind IMVU’s web services.

Part 3: Documents and Links

In the previous entry in this series I described how IMVU uses a structured network model to implement the uniform contract for our REST services and showed how they might apply to a set of services for a hypothetical high school scheduling system.

Engblog1102-01

Under this model we have Node Groups (represented by the different colors of circles in this diagram), Nodes (the circles themselves), Edge Groups (the rounded boxes hanging off of the circles) and Edges (the lines connecting the circles).

We model these various notions using HTTP documents linked together with URLs. There are four kinds of documents, corresponding to each of the four concepts.

A Node Group determines the properties and relationships of the Nodes, and the corresponding Node Group Document contains a list of URLs, each locating a Node.

A Node, through its properties and relationships, is the basic repository of information in the network. The Node Group Document contains the values of the Node’s properties, as well as a list of URLs, each locating an Edge Group.

An Edge Group groups together all of the links that go to a particular type of Node. The corresponding  Edge Group Document, specific to a Node, contains a list of URLs, each locating an Edge within that group.

An Edge is a link from its Node to another Node, usually of a different class. Edges can also contain properties, as we saw last time. An Edge Document therefore contains a URL for the Node at the other end of the Edge, as well as the values of any properties of the Edge.

Each type of document has a distinctive URL.

The URL for a Node Group Document consists of a single path segment, a singular noun describing the Node Group:

https://api.imvu.com/course

The URL for a Node consists of two path segments: the Node Group URL extended by a segment consisting of a unique identifier for the Node. The Node identifier can be any string, but we encourage the use of Node Identifiers which contain the name of the Node Group:

https://api.imvu.com/course/course-1035

The URL for an Edge Group consists of three segments: the Node URL extended either by a plural noun naming the Node Group that the Edge Group points off to, or by a noun describing the relationship created by the Edge Group:

https://api.imvu.com/course/course-1035/teachers
https://api.imvu.com/course/course-1035/roster

The URL for an Edge consists of four segments: the Edge Group URL extended by an identifier for the Edge. This identifier need only be unique within the Edge Group. Sometimes (for convenience) it is identical to the unique identifier of the Node that the Edge points to, but this is not required:

https://api.imvu.com/course/course-1035/teachers/1
https://api.imvu.com/course/course-1035/roster/student-5331

These URL formats are a constraint on the service design – a strong suggestion but not a strict requirement.

Note especially that, according to the principle of HATEOAS (Hypertext As The Engine Of Application State), the client cannot depend on any specific format, and is not allowed to construct these URLs or attempt to extract information from them. All URLs required by the client must be obtained whole, as opaque strings, from the response to an earlier service request.

Specifically, URLs returned from the server must be fully qualified, including the protocol and server name (“https://api.imvu.com”). However, in our own internal discussions and here in this presentation, we will often leave off the protocol and server name for clarity. So:

/course
/course/course-1035
/course-1035/teachers

and so forth. Wherever one of these relative URLs appears in the discussion below, the actual implementation will return a fully qualified version.

One of the biggest differences between REST services and the earlier Remote Procedure Call style is that RPC APIs define as many verbs as required by or convenient for implementation of the application. Designing RPC services is primarily about designing new verbs and specifying their parameters and their semantics.

With IMVU Rest services, the verbs are specified, and there are only five of them. Designing a service consists of designing a set of Nodes and Edges, together with their properties and relationships, in such a way that these five verbs can provide all of the required functionality.

The verbs are

GET (any endpoint)
POST (to a Node or Edge),
POST (to a Node Group or an Edge Group),
POST (to a Node Group, including Edges), and
DELETE (a Node or Edge)

GET retrieves the document associated with the URL, which can be a Node Group Document, Node Document, Edge Group Document or an Edge Document. GETs must be nullipotent, which is to say that they can have no side-effects.

POST, when applied to a Node or Edge URL makes changes to the properties and relations of the Node or Edge. Such POSTs must be idempotent, which means that sending to POST twice must have exactly the same effect as sending it once.

When POST is applied to a Node Group or Edge Group, it adds a new Node or a new Edge. Such POSTs will not be idempotent, since sending the POST twice will add two new Nodes or two new Edges.

There is a third sort of POST, a POST specifically to a Node Group which adds a new Node, but which also contains data for a set of new Edges to be added along with the Node (including Edge Groups as necessary).

Finally there is DELETE, which is used to delete a Node or an Edge. The implementation of DELETE must be idempotent.

GET returns the document associated with the URL, but in order to minimize round trips the service is allowed to return any additional documents that it thinks the client may soon need.

You can see this in the format of the JSON document returned by a GET:

{
  "id": "/course/course-1035",
  "status": "success",
  "denormalized":
  {
   "/course/course-1035": { … } 
   "/course/course-1035/teachers": { … }
   "/course/course-1035/teachers/1": { … }
   … etc …
  }
}

There is a status, “success” or “failure”, and if the GET fails, some information about why, but if it succeeds a package of endpoints is included, grouped under the response member “denormalized”.

In this example the client has asked for a Course, and the server has chosen to return not only the Course Node, but the Edge Group and Edges for all of the teachers that teach that course.

Each of the four kinds of documents in these responses has a specific JSON format.

A Node Group Document has a member “nodes” which is an array of URLs, one for each Node.

{
"nodes": [
 "/course/course-1035",  
 "/course/coures-1907", 
 ...
 ]
}

If the client wants to present the list of Nodes in a certain order, it is responsible for sorting them itself. It cannot depend on the order of entries in this array.

A Node Document has two members. The “data” member contains the Node’s properties. The “relations” member contains the URLs that link the Node to its Edge Groups and to other Nodes in the system.

{
 "data": {
 "description": "Algebra I",
 "starting_time": "10:00"
 }
"relations": {
 "teacher": "/teacher/teacher-372",
 "roster": "/course/course-1035/roster"
 }
}

Names in these objects represent a contract with the client. Based on the name, the server guarantees the semantics of the value including its type, the allowed values, the meaning of the value and if it is a link, the Node Group of the Node it points to.

Note that here “teacher” is a link to a Teacher Node. This design precludes the possibility of team teaching where there are two teachers for a class.

Links directly from one Node to another create a one-to-N relationships. One Teacher to many Courses. As a rule, however, relationships are more often N-to-N than not, and such restrictions on cardinality are a red flag. Not wrong, necessarily, but something that may be called out in design reviews.

The design could be made to support many teachers per course by implementing an Edge Group “teachers” for Course Nodes. This is a more flexible design, because the cardinality restrictions, say a maximum of two teachers, or allowing multiple teachers only for certain courses, can be implemented on the server, where they are easier to change.

An Edge Group Document has a member “edges” which is a JSON array of URLs for the Node’s Edges.

{
  "edges": [
   "/course/course-1035/roster/student-5331",
   "/course/course-1035/roster/student-1070",
   ...
  ],
}

Again, the client cannot depend on the order that the Edges come back from the server.

Finally, an Edge Document has an optional “data” member for when the Edge has properties, and a “relations” member which contains the URL for the Node at the other end of the Edge.

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/student/student-5331"
 }
}

Here is an Edge between one Course (course-1035) and one Student (student-513312).

Edges go both ways. Here are the Edges between that same student and his courses.

{
 "edges": [
   "/student/student-53312/schedule/01",
   "/student/student-53312/schedule/02",
   ...
 ],
}

Now it’s not required to implement every Edge Group implied by the Node/Edge model, so it’s acceptable to implement only half of this Edge relationship without its symmetrical partner. When we have an Edge Group in our design, however, we think carefully about the symmetrical Edge Group. It’s often a very interesting view on the data, and it’s seldom very difficult to implement.

Note that the Edge document itself can come back in two different ways, but the underlying database record will be the same. Here is one of the Edge documents linked from the Edge Group above:

{
 "data": {
  "tardies": 1,
 },
 "relations": {
  "ref": "/course/course-1035",
 }
}

Whether you look at this link from the Student or the Course perspective, the count of tardies is the same data element.

Nodes and Edges are updated using the same JSON document format as the response from a GET, though there is no denormalization envelope.

Here is the document for a POST to a Student Node (/student/student-513312) intended to update the student’s birth date and counsellor (documents for POSTs to Edges look pretty much the same):

{
 "data": {
  "birth_date": "5/11/96",
 },
 "relations": {
  "counsellor": "/teacher/teacher-121"
 }
}

If a property is missing from the data or relations sections, it is left unchanged in the Node or Edge.

The response which comes back from a POST to a Node or Edge is the same as the response from a GET to that Node or Edge, including additional denormalized data at the discretion of the server.

New Nodes and Edges are added by posting to the corresponding Node Group or Edge Group  and providing the data and relations for the new Node or Edge in the body of the POST. This is a document POSTed to /course/course-1035/roster to create a new Edge, enrolling a new student in a course:

{
 "relations": {
   "ref": "/student/student-10706"
 }
}

All of the required properties must be included in this kind of POST. A POST creating a new Node looks pretty much the same.

Again, the response to this kind of POST is the same as you would receive from a GET to the newly created Node or Edge (though you will only know the id of the new Node or Edge once the response comes back).

It is often useful to be able to add a Node and a number of Edges in one POST request. The response would include the new Node, a new Edge Group and all of the specified Edges. Here is a POST to /student which adds a new Student along with Enrollment Edges in two courses:

{
"data": {
  "name": "Andrew",
    "birth_date": "7/21/95"
  },
  "edges": {
    "schedule": [ 
      {
        "relations": {
          "ref": "/course/course-1035"
        }
      }, 
      {
        "relations": {
          "ref": "/course/course-2995"
        }
      },
    ]
  }
}

The only other verb is DELETE, which can be applied to a Node or an Edge by providing the URL of the Node or Edge. No document is passed with a DELETE request. DELETEing a Node will also delete all of the Edge Groups and Edges for that Node.

The service implements much of the functionality of an application by implementing business rules which add (within limits) to the semantics of POSTs and DELETEs.

Business rules cannot change the fundamental semantics of these verbs. A successful POST to a Node Group or Edge Group must still add a new Node or Edge. A successful POST to a Node or Edge must still make the specified modifications to the Node or Edge, and a successful DELETE must still result in the specified deletion.

In particular, POST to a Node or Edge must remain idempotent. POSTing twice to a Node or Edge must have exactly the same effect as POSTing once.

Business rules, however, can reject requests which violate a desired constraint. We might want to disallow deletion of students prior to their 21st birthday.

Business rules can also limit operations to specific users or classes of user. Our school system might have an administrator class who are responsible for placing students in classes. An attempt to add a Student Edge to a class would be rejected if the client making the request was not identified as an administrator.

Business rules also have broad authority to make additional changes to the back end data structures based on a POST or DELETE. We could add a property to Student showing the number of classes each student is enrolled in, and then when the client POSTs to the schedule Edge Group to add a new Course, increment this count in the Student Node.

As I mentioned earlier, the client is not allowed to construct URLs, but is required to retrieve them from the server. It is, however, allowed to append query parameters. These must have no effect on the query other than limiting the set of records returned.

Here are some examples.

GET /student?name=piers*
GET /student/student-10706/schedule?tardies=0
GET /student?schedule.tardies=gt.0

In the first two cases the query is based on a property of the Node or Edge. The first is intended to retrieve all students whose name begins with “piers”, the second to get all courses where the given student is currently enrolled and has no tardies.

The third example shows a query that would be achieved with a join in SQL. It is imagined to retrieve all students which have been late for at least one class.

Note that the HTTP query syntax doesn’t really support the kind of database queries we want to perform very well, and because of that we have not yet settled on standards for specifying queries (and different projects have come up with different solutions). We are still striking out into new territory, but remain clear that we want to be working toward a company wide solution in the long term.

Some Node Groups contain a lot of Nodes. At IMVU we have a Node Group for the tens of millions of products in our catalog. We don’t currently support querying our product Node Group, but if we did, we would have to get a response back something like this:

{ …
  {
    "nodes": [
      "/product/product-10057",
      "/product/product-10092",
      ...
      "/product/product-11033",
    ],
    "next": "/product?offset=50",
  }
}

The list of Nodes includes only the first fifty, but provides us with a URL – with a built-in query parameter – which allows us to retrieve another group of fifty.

There are a number of services which do paging like this, but note that offsets don’t work very well for paging when Nodes are often being added and deleted, since the offsets associated with specific Nodes can change. Paging is another area in which our standards are still under development.

Finally, in order to allow clients to cache the responses they receive we provide a way for the service to let the client know when cached URL responses are no longer valid.

This involves the use of IMQ, IMVU’s proprietary system for efficiently pushing data to clients in real time. In our case, the data is simply a notification that an earlier response to a particular URL is no longer valid. We’ll go into detail on IMQ and how it works in a future post.

Most endpoints do not provide invalidation. IMVU product information, for example, doesn’t change often enough to make the overhead associated with invalidation worthwhile.

When invalidation is available, the response to an endpoint will include a member called “updates”, which provides the information necessary to subscribe to the appropriate IMQ queue. Here is a response to a GET of a particular enrollment Edge 
(/student/student-513312/schedule/01):
{
 {
    "data": { 
      "tardies": 1,
    },
    "relations": {
      "ref": "/course/course-1035",
    }
    "updates": "imq://inv.student.student-513312"
  }
}

Invalidations for different endpoints come in on different queues, and it is the server’s prerogative to choose the queue for each endpoint.

Currently, for the endpoints which support invalidation, we have one queue per Node. Invalidations for the Node come in on this queue, but the same queue also provides invalidations for the Node’s Edge Groups and Edges. This design allows us to achieve a balance between the number of queues we create and the number of subscribers to each queue.

In the example above, only clients that have an active interest in Student 513312 will be subscribed to this queue. With this mechanism, if someone POSTs a change to the number of tardies, every client which has this record cached will receive a notification. If the number of tardies is displayed on the screen, it can be updated immediately.

Note that under this scheme, there can be no queue associated with a Node Group. Such a queue would be used to notify all clients that, for example, a new Student Node had been added. To be useful, however, every client application would need to subscribe to the queue, and IMQ is not built to support such a large number of subscribers. There are solutions to this problem, but for the moment we do not support invalidation for Node Groups.

The discipline of REST and the specific uniform contract IMVU has adopted give us the power and flexibility to quickly create, enhance and share back end services. The principles behind REST (and especially the principles of uniform contract, HATEAOS and separation of concerns) provide us with a solid framework as we continue to complete and codify our standards.

We’ll keep you up to date as things develop.

How IMVU Builds Web Services: Part 1

In this 3-part series, IMVU senior engineer Bill Welden describes the means and technology behind IMVU’s web services.

Part 1: REST

REST, or Representational State Transfer, is the model on which the protocols of the World Wide Web are built. It was originally described in the year 2000 in Roy Fielding’s doctoral dissertation, and was developed in order to impose some discipline on distributed hypermedia systems.

The benefits of REST have since proven valuable in defining APIs. Everybody has them these days: Twitter, Facebook, eBay, Paypal. And IMVU has been using REST as a standard for its back end services for many years.

Because the fit between REST (which is framed in terms of documents and links) and database-style applications (tables and keys) is not perfect, everybody means something slightly different when they talk about REST services.

Here is a rundown of the things that IMVU does under the aegis of REST, and the benefits that accrue:

  1. Virtual State Machine

In his dissertation, Fielding’s describes a REST API as a virtual state machine:

The name ‘Representational State Transfer’ is intended to evoke an image of how a well-designed Web application behaves: a network of Web pages (a virtual state-machine), where the user progresses through the application by selecting links (state transitions), resulting in the next page (representing the next state of the application) being transferred to the user and rendered for their use.

With each network request, a representation of the application state is transferred, first to the server, and then (as modified) back to the server. This is “representational state transfer” or  REST.

Note that this is a two-phase thing. Once a request goes out from the client, the state of the application is “in transition”. When the response comes back from the server, the state of the client is “at REST”.

By defining a rigorous protocol which frames outgoing requests from clients in terms of following structured links, and the server’s response in terms of structured documents, REST makes it possible to build general, powerful support layers on both the client and the server. These layers then offload much of the work of implementing new services and clients.

  1. Separation of Concerns

Separation of concerns means clients are responsible for interface with the user and servers are responsible for the storage and integrity of data.

This is important in building useful (and portable) services, but it is not always easy to achieve. The aspect ratio, and even the resolution of graphic images ought to be under the control of the client. However, it has been easy at times to design a back-end service for a specific application. For example, static images such as product thumbnails might be provided only in one specific aspect ratio,  limiting the service’s usefulness for other applications. This narrow view of the service limits the ability to roll out new versions and to share the service across products.

When we designed our Server Side Rendering service (which delivers a two-dimensional snapshot of a specific three-dimensional product model), we went to some pains to place this control – height and width of desired image – in the hands of the client through a custom HTTP header included with the service request.

  1. Uniform Contract

A uniform contract means that our services comply with a set of standards that we publish, including a consistent URI syntax and a limited set of verbs. I will talk in more detail about these standards in a future post, but for now note that they allow much of what would otherwise be service specific code – protocols for navigation, for manipulation of data, for security and so forth – to be implemented in generic layers. Just as important, they allow much of the documentation for our services to be consolidated in one place, so that designing becomes a more streamlined and focused process.

With a couple of very specific exceptions, the documents we send and receive are JSON, and structured in a very specific way. In particular, links are gathered together in one place within the document so that the generic software layers are better able to support the server code in generating them, and the application code in following them. This stands in contrast to web browsers which must be prepared to deal with a huge variety of text, graphic and other embedded object formats, and to find links scattered randomly throughout them.

For the API designer, these standards limit the ability to structure data in responses as freely as we might want. Our standards are still in a bit of flux as we try to navigate this tension between freedom of API design and the needs of the generic software layers we have built.

The fact, however, that our standards are based on JSON documents and are requested and provided using HTTP protocols means that the resulting services can be implemented using varied technologies (we have back end services in PHP, Haskell, Python and C++) and accessed from a broad variety of platforms.

The uniform contract also allows ancillary agents to extract generic information from messages, without knowing the specifics of the service implementation. For example, with a single piece of code, IMVU has configured its Istatd statistics collection system to collect real-time data on the amount of time each of our REST points is taking to do its work, and this data collection will now occur for every future endpoint without any initiative on the part of service implementers. The ready availability of these statistics allows for greater reliability through improved response time to outages.

In addition, the design of this uniform contract means that each service addresses a well-defined slice of functionality, allowing new services can be added in parallel with a minimum of disruption to existing code.

  1. HATEOAS

HATEOAS (Hypertext As The Engine Of Application State) is the discipline of treating all resource links as opaque to the clients that use them. Except for one well-known root URL, URLs are not hard coded, and they’re not constructed or parsed by clients. Links are retrieved from the server, and all navigation is done by following links. This allows us to move resources dynamically around our cluster (or even outside if necessary) and to add, refactor and extend services even while they are in active use, changing back-end software only, so that new releases of client software are required less often.

Stateless

Under the REST discipline, applications are stateless – or rather the entire state needed to process a service request is sent with the request (much of it in the form of an identity token sent as a cookie). In this way, between one request and the next, the server needs to know nothing about the client’s application state, and servers do not need to retain state for every active client, which allows us to distribute requests across our cluster according to load.

In this way servers do not need to know what is going on with every active client. Now servers no longer depend on the number and states of all the clients they might be asked to service, allowing the server code to be simpler and to scale well. In addition, we can cache and stage requests at different points in our cluster, keeping them close to where they will be serviced. Responses can also be cached (keyed again by the URL of the request).

Finally, though it is not a stipulation of REST, we use IMVU’s real-time IMQ message queue to push notifications to clients (keyed by the service URL) when their locally cached data becomes invalid. This gives client the information needed to update stale data, but also allows real-time updating of data that is displayed to the user. When one user changes the outfit of their avatar, for example, all of the users in that chat room will see the updated look.

  1. Internet Scale

The internet comes with challenges, and REST provides us with a framework for addressing those challenges in a systematic way, but it is no panacea.

Fielding uses the term “anarchic scalability” to describe these challenges – by which he means the need to continue operating in the face of unanticipated load, or when given malformed or maliciously constructed data.

At IMVU we take issue of load into account from the outset when designing our services, but the internet, as an interacting set of heterogeneous servers and services, often displays complex emergent behavior. Even within our own cluster we have over fifty different kinds of servers (what we call roles), each kind talking to a specific set of other roles, depending on its needs. Under load, failures can cascade and feedback loops can keep the dysfunctional behavior in place even after the broader internet has stopped stressing the system.

Every failure triggers a post-mortem inquiry, bringing together all of the relevant parties (IT, engineering, product management) to establish the history and the impact of the failure. The statistics collected and recorded by Istatd are invaluable in this process of diagnosis.

Remedies are designed to address not only the immediate dysfunctional behavior, but a range of possible similar problems in the future, and can include adding new hardware or modifying software to throttle or redirect traffic. Out of these post-mortems, our design review process continues to mature, adding checklist items and questions which require designers to think about their prospective services from many different perspectives, and particularly through the lens of past failures.

There are times when a failure cannot be fully understood, even with the diagnostic history available to us. Istatd makes it easy to add new metrics which may help in understanding future failures of a similar type. We monitor more than a hundred thousand metrics, and the number is growing.

The chaos injected by the internet includes the contents of packets. Our code, on both the client and server side, is written so that it makes no assumptions about the structure of the data it contains. Even in the case when a document is constructed by our service and consumed by a client we have written, the data may arrive damaged, or even deliberately modified. There is backbone hardware, for example, which attempts to insert advertisements into web pages.

Up next: Part 2 — Nodes and Edges

In my next post I will describe the process we go through to develop a service concept, leading up to implementation.