How IMVU Builds Web Services : Part 2

In this 3-part series, IMVU senior engineer Bill Welden describes the means and technology behind IMVU’s web services.

Part 2: Nodes and Edges

The Uniform Contract constraint of the REST discipline means that IMVU web applications are able to provide functionality to many different web applications in a robust and reusable manner.

There are, of course, many different ways of implementing a uniform contract. At IMVU, we have chosen a structured network model – a graph consisting of nodes, which represent the objects of interest to our users, and edges representing the relationships among those objects. The functionality of our web services is then framed in terms of inspecting and manipulating this graph.

For example, here is a network diagram showing some high school enrollment data.

corpblogeng

Two students, Jan and Chris are each enrolled in two classes. They are both in the same Algebra class, but in different English classes.

Each of the students, and each of the classes are represented by nodes, shown here as circles. The relationship between the students and the classes are represented by edges, shown here as lines connecting the circles.

Nodes come in different types. In this diagram we have student nodes and class nodes connected by edges representing course selections.

Nodes have properties, and the type of the node determines the properties. Each student, for example, has a name and an age. Each class has a starting time.

All of the nodes of a given type constitute a node group. This example has two node groups so far, one for students and one for classes. The node group determines the properties of the nodes in the group.

Node groups are important — they distinguish a structured network model from an unstructured one. Network models are usually assumed to be unstructured, where any sort of data can be associated with any node. In fact, data and links are often intermixed, as in a hyperlinked network of text documents.

IMVU REST services are built on a network model, but a structured one, where every node belongs to a node group, and it is the node group – not the node – that determines the specific properties found in each node. In this way, a structured network model is like a relational data model.

Let’s add some teachers.

Engblog2

Janos teaches the 10 AM Algebra class and Mariska teaches both of the English classes. It is useful to group the edges, as they attach to the node, into edge groups:

Engblog3

All of the edges in an edge group attach, at the opposite end, to the same type of node. Each class node, for example, has two edge groups: one for students and one for teachers.

Because edges create relationships between nodes, the edge groups are sometimes named according to the role that the links play. Here we call the edge group listing the students in the class the “roster”, and the symmetrical view the student’s “schedule”. More often, though, edge groups are named simply by the type of node they link to, as with “teachers” here.

We will see later that edges are implemented using directional links – but the edges themselves are not directional, and can be viewed from either end. Jan is enrolled in Algebra and Algebra has Jan as a student.

Nodes have properties, but edges can also have properties. The number of times the student has been late to class is a property of the edge between the student and the class. It is not a property of the student, because the student will have many classes and may not be late to all of them, nor is it a property of the class, because many students will be enrolled in the class, and not all of them will be late.

The edge group determines what kind of node is at the other edge, whether the node has properties, and if so what the properties are.

We model node groups, nodes, edge groups and edges using JSON documents served over HTTPS, and link them using URLs. In my next post, I will go into detail on the structure of these URLs and documents, and follow through on the high school enrollment example.

Up next: Part 3 — Documents and Links

How IMVU Builds Web Services: Part 1

In this 3-part series, IMVU senior engineer Bill Welden describes the means and technology behind IMVU’s web services.

Part 1: REST

REST, or Representational State Transfer, is the model on which the protocols of the World Wide Web are built. It was originally described in the year 2000 in Roy Fielding’s doctoral dissertation, and was developed in order to impose some discipline on distributed hypermedia systems.

The benefits of REST have since proven valuable in defining APIs. Everybody has them these days: Twitter, Facebook, eBay, Paypal. And IMVU has been using REST as a standard for its back end services for many years.

Because the fit between REST (which is framed in terms of documents and links) and database-style applications (tables and keys) is not perfect, everybody means something slightly different when they talk about REST services.

Here is a rundown of the things that IMVU does under the aegis of REST, and the benefits that accrue:

  1. Virtual State Machine

In his dissertation, Fielding’s describes a REST API as a virtual state machine:

The name ‘Representational State Transfer’ is intended to evoke an image of how a well-designed Web application behaves: a network of Web pages (a virtual state-machine), where the user progresses through the application by selecting links (state transitions), resulting in the next page (representing the next state of the application) being transferred to the user and rendered for their use.

With each network request, a representation of the application state is transferred, first to the server, and then (as modified) back to the server. This is “representational state transfer” or  REST.

Note that this is a two-phase thing. Once a request goes out from the client, the state of the application is “in transition”. When the response comes back from the server, the state of the client is “at REST”.

By defining a rigorous protocol which frames outgoing requests from clients in terms of following structured links, and the server’s response in terms of structured documents, REST makes it possible to build general, powerful support layers on both the client and the server. These layers then offload much of the work of implementing new services and clients.

  1. Separation of Concerns

Separation of concerns means clients are responsible for interface with the user and servers are responsible for the storage and integrity of data.

This is important in building useful (and portable) services, but it is not always easy to achieve. The aspect ratio, and even the resolution of graphic images ought to be under the control of the client. However, it has been easy at times to design a back-end service for a specific application. For example, static images such as product thumbnails might be provided only in one specific aspect ratio,  limiting the service’s usefulness for other applications. This narrow view of the service limits the ability to roll out new versions and to share the service across products.

When we designed our Server Side Rendering service (which delivers a two-dimensional snapshot of a specific three-dimensional product model), we went to some pains to place this control – height and width of desired image – in the hands of the client through a custom HTTP header included with the service request.

  1. Uniform Contract

A uniform contract means that our services comply with a set of standards that we publish, including a consistent URI syntax and a limited set of verbs. I will talk in more detail about these standards in a future post, but for now note that they allow much of what would otherwise be service specific code – protocols for navigation, for manipulation of data, for security and so forth – to be implemented in generic layers. Just as important, they allow much of the documentation for our services to be consolidated in one place, so that designing becomes a more streamlined and focused process.

With a couple of very specific exceptions, the documents we send and receive are JSON, and structured in a very specific way. In particular, links are gathered together in one place within the document so that the generic software layers are better able to support the server code in generating them, and the application code in following them. This stands in contrast to web browsers which must be prepared to deal with a huge variety of text, graphic and other embedded object formats, and to find links scattered randomly throughout them.

For the API designer, these standards limit the ability to structure data in responses as freely as we might want. Our standards are still in a bit of flux as we try to navigate this tension between freedom of API design and the needs of the generic software layers we have built.

The fact, however, that our standards are based on JSON documents and are requested and provided using HTTP protocols means that the resulting services can be implemented using varied technologies (we have back end services in PHP, Haskell, Python and C++) and accessed from a broad variety of platforms.

The uniform contract also allows ancillary agents to extract generic information from messages, without knowing the specifics of the service implementation. For example, with a single piece of code, IMVU has configured its Istatd statistics collection system to collect real-time data on the amount of time each of our REST points is taking to do its work, and this data collection will now occur for every future endpoint without any initiative on the part of service implementers. The ready availability of these statistics allows for greater reliability through improved response time to outages.

In addition, the design of this uniform contract means that each service addresses a well-defined slice of functionality, allowing new services can be added in parallel with a minimum of disruption to existing code.

  1. HATEOAS

HATEOAS (Hypertext As The Engine Of Application State) is the discipline of treating all resource links as opaque to the clients that use them. Except for one well-known root URL, URLs are not hard coded, and they’re not constructed or parsed by clients. Links are retrieved from the server, and all navigation is done by following links. This allows us to move resources dynamically around our cluster (or even outside if necessary) and to add, refactor and extend services even while they are in active use, changing back-end software only, so that new releases of client software are required less often.

Stateless

Under the REST discipline, applications are stateless – or rather the entire state needed to process a service request is sent with the request (much of it in the form of an identity token sent as a cookie). In this way, between one request and the next, the server needs to know nothing about the client’s application state, and servers do not need to retain state for every active client, which allows us to distribute requests across our cluster according to load.

In this way servers do not need to know what is going on with every active client. Now servers no longer depend on the number and states of all the clients they might be asked to service, allowing the server code to be simpler and to scale well. In addition, we can cache and stage requests at different points in our cluster, keeping them close to where they will be serviced. Responses can also be cached (keyed again by the URL of the request).

Finally, though it is not a stipulation of REST, we use IMVU’s real-time IMQ message queue to push notifications to clients (keyed by the service URL) when their locally cached data becomes invalid. This gives client the information needed to update stale data, but also allows real-time updating of data that is displayed to the user. When one user changes the outfit of their avatar, for example, all of the users in that chat room will see the updated look.

  1. Internet Scale

The internet comes with challenges, and REST provides us with a framework for addressing those challenges in a systematic way, but it is no panacea.

Fielding uses the term “anarchic scalability” to describe these challenges – by which he means the need to continue operating in the face of unanticipated load, or when given malformed or maliciously constructed data.

At IMVU we take issue of load into account from the outset when designing our services, but the internet, as an interacting set of heterogeneous servers and services, often displays complex emergent behavior. Even within our own cluster we have over fifty different kinds of servers (what we call roles), each kind talking to a specific set of other roles, depending on its needs. Under load, failures can cascade and feedback loops can keep the dysfunctional behavior in place even after the broader internet has stopped stressing the system.

Every failure triggers a post-mortem inquiry, bringing together all of the relevant parties (IT, engineering, product management) to establish the history and the impact of the failure. The statistics collected and recorded by Istatd are invaluable in this process of diagnosis.

Remedies are designed to address not only the immediate dysfunctional behavior, but a range of possible similar problems in the future, and can include adding new hardware or modifying software to throttle or redirect traffic. Out of these post-mortems, our design review process continues to mature, adding checklist items and questions which require designers to think about their prospective services from many different perspectives, and particularly through the lens of past failures.

There are times when a failure cannot be fully understood, even with the diagnostic history available to us. Istatd makes it easy to add new metrics which may help in understanding future failures of a similar type. We monitor more than a hundred thousand metrics, and the number is growing.

The chaos injected by the internet includes the contents of packets. Our code, on both the client and server side, is written so that it makes no assumptions about the structure of the data it contains. Even in the case when a document is constructed by our service and consumed by a client we have written, the data may arrive damaged, or even deliberately modified. There is backbone hardware, for example, which attempts to insert advertisements into web pages.

Up next: Part 2 — Nodes and Edges

In my next post I will describe the process we go through to develop a service concept, leading up to implementation.

 

Web Platform Limitations, Part 1 – XMLHttpRequest Priority

The web is inching towards being a general-purpose applications platform that rivals native apps for performance and functionality. However, to this day, API gaps make certain use cases hard or impossible.

Today I want to talk about streaming network resources for a 3D world.

Streaming Data

In IMVU, when joining a 3D room, hundreds of resources need to be downloaded: object descriptions, 3D mesh geometry, textures, animations, and skeletons. The size of this data is nontrivial: a room might contain 40 MB of data, even after gzip. The largest assets are textures.

To improve load times, we first request low-resolution thumbnail versions of each texture. Then, once the scene is fully loaded and interactive, high-resolution textures are downloaded in the background. High-resolution textures are larger and not critical for interactivity. That is, each resource is requested with a priority:

High Priority Low Priority
Skeletons High-resolution textures
Meshes
Low-resolution textures
Animations

Browser Connection Limits

Let’s imagine what would happen if we opened an XMLHttpRequest for each resource right away. What happens depends on whether the browser is using plain HTTP or SPDY.

HTTP

Browsers limit the number of simultaneous TCP connections to each domain. That is, if the browser’s limit is 8 connections per domain, and we open 50 XMLHttpRequests, the 9th would not even submit its request until the 8th finished. (In theory, HTTP pipelining helps, but browsers don’t enable it by default.) Since there is no way to indicate priority in the XMLHttpRequest API, you would have to be careful to issue XMLHttpRequests in order of decreasing priority, and hope no higher-priority requests would arrive later. (Otherwise, they would be blocked by low-priority requests.) This assumes the browser issues requests in sequence, of course. If not, all bets are off.

There is a way to approximate a prioritizing network atop HTTP XMLHttpRequest. At the application layer, limit the number of open XMLHttpRequests to 8 or so and have them pull the next request from a priority queue as requests finish.

Soon I’ll show why this doesn’t work that well in practice.

SPDY

If the browser and server both support SPDY, then there is no theoretical limit on the number of simultaneous requests to a server. The browser could happily blast out hundreds of HTTP requests, and the responses will make full use of your bandwidth. However, a low-priority response might burn bandwidth that could otherwise be spent on a high-priority response.

SPDY has a mechanism for prioritizing requests. However, that mechanism is not exposed to JavaScript, so, like HTTP, you either issue requests from high priority to low priority or you build the prioritizing network approximation described above.

However, the prioritizing network reduces effective bandwidth utilization by limiting the number of concurrent requests at any given time.

Prioritizing Network Approximation

Let’s consider the prioritizing network implementation described above. Besides the fact that it doesn’t make good use of the browser’s available bandwidth, it has another serious problem: imagine we’re loading a 3D scene with some meshes, 100 low-resolution textures, and 100 high-resolution textures. Imagine the high-resolution textures are already in the browser’s disk cache, but the low-resolution textures aren’t.

Even though the high-resolution textures could be displayed immediately (and would trump the low-resolution textures), because they have low priority, the prioritizing network won’t even check the disk cache until all low-resolution textures have been downloaded.

That is, even though the customer’s computer has all of the high-resolution textures on disk, they won’t show up for several seconds! This is an unnecessarily poor experience.

Browser Knows Best

In short, the browser has all of the information needed to effectively prioritize HTTP requests. It knows whether it’s using HTTP or SPDY. It knows what’s in cache and not.

It would be super fantastic if browsers let you tell them. I’ve seen some discussions about adding priority hints, but they seem to have languished.

tl;dr Not being able to tell the browser about request priority prevents us from making effective use of available bandwidth.

FAQ

Why not download all resources in one large zip file or package?

Each resource lives at its own URL, which maximizes utilization of HTTP caches and data sharing. If we downloaded resources in a zip file, we wouldn’t be able to leverage CDNs and the rest of the HTTP ecosystem. In addition, HTTP allows trivially sharing resources across objects. Plus, with protocols like SPDY, per-request overhead is greatly reduced.

What it’s like to use Haskell

By Andy Friesen

Since early 2013, we at IMVU have used Haskell to build several of the REST APIs that power our service.

When the company started, we chose PHP as our application server language, in part, because the founders expected the website to only be a small part of the business!  IMVU was primarily about a downloadable 3D client.  We needed “a website or something” to give users a place to download our client from, but didn’t expect it would have to be much more than that. This shows that predicting the future is hard.
Years later, we have quite a lot of customers, and we primarily use PHP to serve them.  We’re big enough that we run multiple subteams on separate initiatives at the same time.  Performance is becoming important to us not just because it matters to our customers, but because it can easily make the difference between buying 4 servers and buying 40 servers to support some new feature.

So, early in 2012, we found ourselves ready to look for an alternative that would help us be more rigorous.  In particular, we were ready for the idea that sacrificing a tiny bit of short term, straight-line time to market might actually speed us up in the long run.

How We Got Here

I started learning Haskell in my spare time in part because Haskell seems like the exact opposite of PHP: Natively compiled, statically typed, and very principled.

My initial exploration left me interested in evaluating Haskell at real scale.  A year later, we did a live-fire test in which we taught multiple teammates Haskell while delivering an important new feature under a deadline.

Today, a lot of our backend code is still driven by PHP, but we have a growing amount of Haskell that powers newer features. The process has been exciting not only because we got to actually answer a lot of the questions that keep many people from choosing not to try Haskell, but also because it’s simply a better solution.

The experiment to start developing in Haskell took a lot of internal courage and dedication, and we had to overcome a number of, quite rational, concerns related to adopting a whole new language. Here are the main ones and how they worked out for us:

Scalability

The first thing we did was to replace a single service with a Haskell implementation.  We picked a service that was high-volume but was not mission critical.

We didn’t do any particular optimization of this new service, but it nevertheless showed excellent performance characteristics in the field.  Our little Haskell server was running on a pair of spare servers that were otherwise set for retirement, and despite this, each machine was handling about 20x as many requests as one of our high-spec PHP servers could manage.

Reliability

The second thing we did was to take our hands off the Haskell service and leave it running until it fell over.  It ran for months without intervention.

Training

After the reliability test, we were ready to try a live fire exercise, but we had to wait a bit for the right project.  We got our chance in early 2013.

The rules of the experiment were simple: Train 3 engineers to write the backend for an important new project and keep up with a separate frontend team.  Most of the code was to be new, so there was relatively little room for legacy complications.

We very quickly learned that we had also signed up for a lot of catch-up work to bring the Haskell infrastructure inline with what we’ve had for years in PHP.  We were very busy for awhile, but once we got this infrastructure out of the way, the tables turned and the front-end team became the limiting factor.

Today, training an engineer to be productive in our Haskell code is not much harder than training someone to be productive in our PHP environment.  People who have prior functional programming knowledge seem to find their stride in just a few days.

Testing

Correctness is becoming very important for us because we sometimes have to change code that predates every current developer.  We have enough users that mistakes become very costly, very quickly.  Solving these sorts of issues in PHP is sometimes achievable but always difficult.  We usually solve them with unit tests and production alerts, but these approaches aren’t sufficient for all cases.

Unit tests are incredible and great, but you’re always at the mercy of the level of discipline of every engineer at every moment. It’s easy to tell your teammates to write tests for everything, but this basically boils down to asking everyone to be at their very best every day.  People make mistakes and things slip through the cracks.

When using Haskell, we actually remove an entire class of defects that we have to write tests for. Thus, the number of tests we have to write is smaller, and thus there are fewer cases we can forget to write tests for.

We like unit testing and test-driven development (TDD) at IMVU and we’ve found that Haskell is better with TDD, but also that TDD is better with Haskell.  It takes fewer tests to get the same degree of reliability out of Haskell.  The static verification takes care of quite a lot of error checking that has to be manually implemented (or forgotten) in PHP.  The Haskell QuickCheck tool is also a wonderful help for developers.
The way Haskell separates pure computations from side effects let us build something that isn’t practical with other languages: We built a custom monad that lets us “switch off” side effects in our tests.  This is incredible because it means that trying to escape the testing sandbox breaks compilation. While we have had to fight intermittent test failures for eight years in PHP (and at times have had multiple engineers simultaneously dedicated to the problem of test intermittency,) our unit tests in Haskell cannot intermittently fail.

Deployment

Deployment is great. At IMVU, we do continuous deployment, and Haskell is no exception. We build our application as a statically linked executable, and rsync it out to our servers. We can also keep old versions around, so we can switch back, should a deployment result in unexpected errors.

I wouldn’t write an OS kernel in it, but Haskell is way better than PHP as a systems language. We needed a Memcached client for our Haskell code, and rather than try to talk to a C implementation, we just wrote one in Haskell.  It took about a half day to write and performs really well. And, as a side effect, if we ever read back some data we don’t expect from memcached (say, because of an unexpected version change) then Haskell will automatically detect and reject this data.

We’ve consistently found that we unmake whole classes of bugs by defining new data types for concepts to wrap primitive types like integers and strings.  For instance, we have two lines of code that say that “customer IDs” and “product IDs” are represented to the hardware as numbers, but they are not mutually convertible.  Setting up these new types doesn’t take very much work and it makes the type checker a LOT more helpful. PHP, and other popular dynamic server languages like Javascript or Ruby, make doing the same very hard.

Refactoring is a breeze.  We just write the change we want and follow the compile errors.  If it builds, it almost certainly also passes tests.

Not All Sunshine and Rainbows

Resource leaks in Haskell are nasty.  We once had a bug where an unevaluated dictionary was the source of a space leak that would eventually take our servers down.  We also ran into an issue where an upstream library opened /dev/urandom for randomness, but never closed the file handle.  These issues don’t happen in PHP, with its process-per-request model, and they were more difficult to track down and resolve than they would have been in C++.

The Haskell package manager, Cabal, ended up getting in the way of our development. It lets you specify version ranges of particular packages you want, but it’s important for everyone on the team to have exactly the same versions of every package.  That means controlling transitive dependencies, and Cabal doesn’t really offer a way to handle this precisely. For a language that is so very principled on type algebra, it’s surprising that the package manager doesn’t follow suit regarding package versioning. Instead, we use Cabal for basic package installation, and a custom build tool (written in Haskell.)

Hiring

I’ll admit that I was very worried that we wouldn’t be able to hire great people if our criteria was expertise in an uncommon language without a comparatively sparse industrial track record, but the honest truth is that we found a great Haskell hacker in the Bay area after about 4 days of looking.

We had a chance to hire him because we were using Haskell, not in spite of it.

Final Thoughts

While it’s usually difficult to objectively measure things like choice of programming language or softwarestack, we’re now seeing fantastic, obvious productivity and efficiency gains.  Even a year later, all the Haskell code we have runs on just a tiny number of servers and, when we have to make changes to the code, we can do so quickly and confidently.

Charming Python: How to actually close a socket by calling shutdown() before calling close()

By Eric Hohenstein

You may never have to use the Python low-level socket interface but if you ever do, here’s a tip regarding some surprising Python socket behavior that might help you.

The standard Python library has a “socket” module written in Python that wraps an underlying “_socket” module written in C that wraps OS level sockets. Together, these expose a cross-platform Berkeley socket-like interface to Python applications. I say Berkeley socket-like because the socket functions are exposed on an object rather than global free-functions. The surprise with this interface is that the close() function of the socket object does not close the socket. Instead, it de-references it, allowing garbage collection to eventually close the OS-level socket. If you only use the close() method on a socket, the socket will continue using both client and server resources until it is garbage collected. Worse, if the other end is continuing to send data to the locally discarded socket, it may block when the receive window of local socket fills up. Even worse than that, if the software handling the other end of the socket is written in erlang and 0 window packets are dropped somewhere in between (for instance because of overly aggressive firewall rules), the other end may block until the socket gets garbage collected even if the socket is configured to immediately time out send operations if they would block.

The solution is to first call shutdown() on the socket before calling close(). You have to be careful doing this if the socket is being read/written by multiple threads simultaneously but down that path lies madness, so don’t do that.

The Python documentation for the socket module actually does briefly mention this but doesn’t really give much of an explanation as to why it’s necessary. Really, the Python documentation on the subject is correct in that calling shutdown is better form. However, in cases where you are sure that the application state of the connection is no longer valid, it should be possible to bypass the shutdown and close the socket to simply free its resources immediately.

Why is it necessary to call shutdown() before close()? Let’s take a look at the socket.py module from the standard Python library. These code snippets are all taken from Python 2.6 source, though I’ve confirmed that Python 2.7 and Python 3.0 behave similarly. First, socket.py imports everything from _socket, which includes a class called socket:

import _socket

from _socket import * 

Then, it saves a reference to _socket.socket called _realsocket

_realsocket = socket

Then it defines a list of functions that will be exposed from the real socket object to the wrapper class (see below):

_socketmethods = (         

          ‘bind’, ‘connect’, ‘connect_ex’, ‘fileno’, ‘listen’,         
          ‘getpeername’, ‘getsockname’, ‘getsockopt’, ‘setsockopt’,         
          ‘sendall’, ‘setblocking’,         
          ‘settimeout’, ‘gettimeout’, ‘shutdown’) 

Note that ‘close’ is not in that list but ‘shutdown’ is.

Then it defines a _closedsocket class that will fail all operaitons:

class _closedsocket(object):   

        __slots__ = []   
        def _dummy(*args):       
                raise error(EBADF, ‘Bad file descriptor’)   
         # All _delegate_methods must also be initialized here.   
          send = recv = recv_into = sendto = recvfrom = recvfrom_into = _dummy   

          __getattr__ = _dummy

Then, it defines a class called _socketobject. Note the comment before the start of this class:

# Wrapper around platform socket objects. This implements

# a platform-independent dup() functionality. The

# implementation currently relies on reference counting

# to close the underlying socket object.

class _socketobject(object):    

       __doc__ = _realsocket.__doc__    
       __slots__ = [“_sock”, “__weakref__”] + list(_delegate_methods)    
       def __init__(self, family=AF_INET, type=SOCK_STREAM, proto=0, _sock=None):       
                if _sock is None:           
                      _sock = _realsocket(family, type, proto)       
                self._sock = _sock       
                for method in _delegate_methods:           
                        setattr(self, method, getattr(_sock, method)) 

Within the _socketobject class, it defines a method for each of the functions exposed by the _socket class:

             _s = (“def %s(self, *args): return self._sock.%s(*args)\n\n”         

            “%s.__doc__ = _realsocket.%s.__doc__\n”)   
             for _m in _socketmethods:       
                      exec _s % (_m, _m, _m, _m)   
             del _m, _s 

Remember that close was not in the _socketmethods list. Here’s the close method of the _socketobject class:

             def close(self):       

                     self._sock = _closedsocket()       
                     dummy = self._sock._dummy       
                     for method in _delegate_methods:           
                             setattr(self, method, dummy)   
              close.__doc__ = _realsocket.close.__doc__ 

Assigning a new value to self._sock de-references the real socket object, allowing it to be garbage collected. It then exposes the _socketobject class as socket:

socket = SocketType = _socketobject

Client networking code using the Python socket library would create a socket like so:

import socket

sock = socket.socket() 

which will create a socket._socketobject instance that has a _sock member which is a _socket.socket instance. After connecting the socket to a remote endpoint, calling close() just assigns the _sock member to an instance of _closesocket, allowing Python to eventually garbage collect the original _socket.socket object. After the previous code, the following will “leak” the OS socket for some arbitrary amount of time:

sock.connect((“www.foobar.com“, 80))

sock.close() 

Looking at the C code in _socketmodule.c, there is a close() method defined on the _socket.socket class but there is no code in socket.py that will call it:

static PyObject *

sock_close(PySocketSockObject *s)
{
SOCKET_T fd; 
if ((fd = s->sock_fd) != -1) {
s->sock_fd = -1;
Py_BEGIN_ALLOW_THREADS
(void) SOCKETCLOSE(fd);
Py_END_ALLOW_THREADS
}
Py_INCREF(Py_None);
return Py_None;
}  

Actually, there is code in socket.py that will call this function but it’s in the socket._fileobject class that is returned from the socket._socketobject makefile() function which typical networking code will likely not use.

If the close() method of socket._socketobject deferred to the _socket.socket close() method, calling close() on a socket.socket object would actually close the socket (if it was open). Since it does not, calling shutdown() is the only way to close the socket immediately. I’ll not go into the difference between close() and shutdown() here since it’s fairly complex and mostly unimportant but a simplified explanation is that shutdown() may wait until any unsent data has been flushed to the network (although it may not) and close() will not.

Here’s the implementation of _socket.socket shutdown() function:

static PyObject *

sock_shutdown(PySocketSockObject *s, PyObject *arg)
{
int how;
int res; 
how = PyInt_AsLong(arg);
if (how == -1 && PyErr_Occurred())
return NULL;
Py_BEGIN_ALLOW_THREADS
res = shutdown(s->sock_fd, how);
Py_END_ALLOW_THREADS
if (res < 0)
return s->errorhandler();
Py_INCREF(Py_None);
return Py_None;
}  

Calling shutdown() on an instance of socket._socketobject will actually shutdown the OS level socket before it is garbage collected. If the _socket.socket object is not already shutdown or closed, it will eventually be closed when it’s garbage collected:

 

static void

sock_dealloc(PySocketSockObject *s)
{
if (s->sock_fd != -1)
(void) SOCKETCLOSE(s->sock_fd);
Py_TYPE(s)->tp_free((PyObject *)s);
} 

The conclusion is that Python sockets should always be closed by first calling shutdown() and then calling close().

Multiplatform C++ on the Web with Emscripten

At GDC 2013, IMVU shared the analysis that led us to using Emscripten as a key component of our technology strategy to bring our rich 3D virtual goods catalog to new platforms, including web browsers.

Here are the slides from said presentation.

[slideshare id=18258801&style=border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px&sc=no]

Efficient and Scalable Off-Site Backup to Amazon Glacier

By Ted Reed

The strength of IMVU is our large catalog of User-Generated Content (UGC). With more than ten million items in our virtual catalog, losing our UGC would be crippling to our business.  We in the Operations Team take the preservation of this data seriously. We recently upgraded our aging UGC backup system to use Amazon Glacier as the storage medium. This post will briefly explain the old backup system before detailing the new one.

In The Beginning

We store our UGC in a MogileFS instance, a system for cheap and efficient storage of files across commodity hardware. Our offsite backups originally took the form of USB drives onto which a process would copy the files as they were written to Mogile. As each drive filled up, we would then transport it from our colocation facility to a fire safe in our office. To cover the period of time when the disk was still being written to, we would keep a synced copy of the disk on a machine in a server closet in our offices. This copy would then be deleted once the USB disk had been safely stored.

As time went on, the rate at which our customers were adding photos or uploading products to our catalog grew and grew. We also started permitting higher-quality photos and made it easier to upload and manage them. In order to compensate for the increase in growth, we started buying larger and larger USB drives. This past summer we reached the point where we had the biggest USB drives we could reasonably get and we were still filling them up in about a week. Each time a drive filled up, one of us had to drive out to the facility to retrieve it.

Auditing the data also became woefully inconvenient, as the number of individual drives one had to juggle during the audit skyrocketed. It was an annoyingly manual process, even with helper scripts to manage everything but plugging and unplugging drives. Additionally, annoyingly manual processes tend to not get done as often as we ought to.

Enter Amazon Glacier

We spent some time looking over options for better off-site backups. We talked to many vendors and even for those who could handle our “monumental amount of data” (actual term used by a vendor who shall remain nameless). The quotes left us wondering if we might not be better off just putting some cheap servers in another data center somewhere.

While we were evaluating these options in late August 2012, Amazon announced Glacier, an “extremely low-cost storage service” intended for long-term archival of infrequently-accessed data. At just one cent per gigabyte per month, it handily beat out every other vendor we’d seen in terms of price. We poked around and tested the service out, and found it to be right up our alley, although the pricing was annoyingly confusing. (I ended up writing a small Haskell program to re-implement the math from their FAQ. It rounds in weird places.)

The Backup Process

The backup process begins with a Perl daemon which manages worker pools of Fetchers and Uploaders. The master process figures out which files need to be backed up and hands work off to the Fetchers, which talk to Mogile and pull the file down to a local directory. Once a directory has about 25 MB of files, it’s closed off and handed to an Uploader. The Uploader will then tar the directory and send the tar to Glacier.

It’s worth pointing out here that the 25 MB bundling is actually pretty important if you don’t want to pay a great deal of money. Glacier charges five cents per thousand uploads. My first tests ignored that cost and we pretty quickly ran up an unexpectedly large bill. After some analysis, we found that 25 MB was about right as a middle point between cost and flexibility. This middle point is likely to differ for your use case and budget. The state of the backup process is stored in a MySQL database, which has tables for archives and the files that go into them. We record roughly the same information that Mogile does about each file so that we can accurately restore it if the file is accidentally deleted from Mogile. For each archive, we store both its size as well as information about where to find it (region/vault/id). We also have tables to record information about backup failures for later investigation.

Restoration

The restoration process begins with a simple front-end script, which can take a source and destination. The source can be in terms of Mogile ID, Mogile Domain/Key, or one of our URLs (the latter two are dereferenced to Mogile ID). The destination can be Mogile itself, a local file, or S3. There’s also a batch mode which takes input where each line represents one source and one destination. The script will record each file to restore in the database. It will then issue requests to Glacier to retrieve the archives needed to restore the files, unless there is already an active and uncompleted request for the same archive.

Elsewhere in our network, we have a daemon which sits and listens to an SQS queue tied to the SNS topic for restoration. The notification will include the database ID for the restore request. The daemon will pull the information needed to do the restoration and then clear it afterwards.

Auditing and Validation

One side benefit of a backup system that doesn’t involve managing disks is that we can automate testing the backups and our restoration process. There are two aspects that we want to test. First, we want to prove that the backups are intact and accurate. Second, we want to prove that we are able to restore the data. Glacier permits you to pull up to 5% of your total data per month, presumably for purposes such as these. You still need to pay bandwidth egress charges if the data leaves AWS, which informed our design for the automated audit process.

In order to audit the backups, there’s a cron on our side which runs hourly and selects a random sampling of one millionth of the archives we’ve uploaded. If we had 3,000,000 archives, we’d audit 3 per hour. For each archive, this cron pulls the files from Mogile and generates md5sums. It issues a request to Glacier, with a different SNS topic than the one we use for ordinary restoration. It then uploads the md5sums as a file to an EC2 micro instance which runs a daemon listening for that SNS topic. As the archives become available, the daemon does the same md5sum generation on the Glacier data, comparing it to the list uploaded beforehand. If there is a discrepancy, it sends an email to a mailing list that we check at least daily.

To handle verifying the restoration path, there is a smaller monthly process. It will generate a list of 10 random files and use our normal restoration tool to restore them to test buckets in both Mogile and S3. If the files haven’t been restored within 24 hours, a notification goes to the same mailing list as for audit failures.

In The End

We’re still pushing our historical data into Glacier, but this project is already looking pretty successful. We’ve been adding new data since last November, and are in the process of checking through everything to gain a full trust in the system before we finally pull the plug on the old system and build a mighty throne out of the USB drives.

The backup process was mostly optimized to be able to push these historical files up to Amazon in a reasonable timeframe. As a result, the process which backs up our real-time uploads is quite fast with plenty of room to grow as our usage scales up. Files added to Mogile are in the backup process within seconds, and are typically in Glacier within less than a minute. I looked at our system during peak time for an example, and found that files were in Glacier 35 seconds after being uploaded to us by a user. Knowing this gives me great peace of mind that our UGC will be safe in the event a disaster strikes.

What about other forms of backup? We briefly looked at storing our offsite database backups in Glacier, but decided against this. Any data you put into Glacier will be billed for at least three months. Adding daily backups and then deleting some of them later makes for a rather expensive solution. There’s also little to gain as the process which moves the database backups to our office isn’t nearly as painful as our older UGC backup process was.

Monitoring Delayed Replication, with a focus on MySQL

By Stuart Cianos, CISSP

Author’s Preface

Many (many) years ago, I had the opportunity to work with a variety of organizations targeting harm reduction for at risk populations in the San Francisco Bay Area. One of the great lessons learned from that experience is that the future is mostly indeterminate. In 1997-1998, severe storms ravaged our region and very few of us dealing with relief services (myself included) anticipated the total impact. It was only by sheer luck and coincidence that there was available shelter space to house an entire community of displaced families after automatic flood gates failed to open on a large stream following years of drought. We didn’t think it would happen to us until it was too late to be better prepared. If I win the lottery tomorrow, my future might change in ways that were unexpected. Conversely, if I were to accidentally delete all of the files (“rm -rf /”) on a mission critical computer system, my future might also change in unexpected ways.

I no longer worry about the future, I prepare for it. What follows is a powerful technique which can help mitigate risks around unanticipated data loss, even when that data loss is due to human error.

Introduction

IMVU utilizes MySQL on a daily basis to drive our site and customer experience forward. As IMVU has grown and scaled, so have the number of database hosts and the importance of data maintained in our environment. With over 100,000,000 registered user accounts and hundreds of thousands of concurrent transactions at any given time, it’s of paramount importance that our practices embody the CIA triad: confidentiality, integrity and accessibility of information. Implementing delayed replication at the data tier can enhance business continuity and improve posture when recovering from events impacting data integrity and availability.

IMVU uses MogileFS to store user generated binary content (images, content creator products, etc.). MogileFS is a user-space fault tolerant, distributed file system. MogileFS stores filesystem metadata in a MySQL database (PostgreSQL is also supported, but not used in our environment). The system is designed to ensure that no single failure will result in data loss or service degradation.

In June of 2012, human error resulted in the MogileFS metadata table (“file”) being dropped from the primary database instance rather than a standby under maintenance. Without any way to relate requests for files back to the content on disk, MogileFS could no longer serve any requests resulting in customers being unable to access some major features of the IMVU social network and rich 3D experience. There was a slave database instance lagging behind its master due to write heavy operations, which allowed us to use it as a recovery source. We were very fortunate that this slave was lagging behind unintentionally; had it not been, the impact of this incident would have been much more severe.

Delayed replication is being implemented across IMVU’s cluster of database instances in order to protect against these types of accidental operations moving forward.

Data Availability and the CIA Triad

The CIA triad is a set of common attributes which should be applied to information management and security. Information on the CIA triad is available abundantly online, but a brief summary is below:

Stuart blog1

  • Confidentiality: Prevention of information disclosure to unauthorized entities.
  • Integrity: Prevention and detection of the unauthorized modification of data.
  • Availability: Assurance that the information will be available when needed.

One question that comes up is what delayed replication has to do with information security… and my response is “a lot!” Delayed replication can help mitigate risks around data availability as well as data integrity, and the impact an incident will have on the bottom line.

What is Database Replication?

Database replication allows transactions on one database host to be replicated to one or more separate instances. There are many replication strategies, but the examples discussed will focus on a simple Master-Slave configuration throughout.

Stuart blog2

When the transaction “INSERT INTO foo VALUES (‘hello world’)” is committed to the master database instance, it is replicated on the slave and committed there as well. By having a slave or standby host available and relatively up to date, it is possible to recover very quickly from various hardware and software failures on the master by replacing it with the slave/standby host.

Stuart blog3

What is Delayed Replication?

Delayed replication is the technique of inserting a delay line into a database’s replication mechanism. Transactions will each be held in a first-in, first-out queue for two hours prior to committal on the slave host being delayed. For the purposes of this example, “mytable” is a mission critical set of data required for the application to function and contains 100 gigabytes of data. If the table “mytable” becomes unavailable for whatever reason (for example, a drop table statement was executed accidentally/unintentionally on the primary and it immediately replicated and got executed on the standby), customers will not have access to the service. In this example, fictitious business parameters will be defined to help us quantify possible benefits:

  • Recovery time objective (RTO): 8 hours
  • Recovery point objective (RPO): 4 hours
  • Single loss expectancy is calculated using the standard formula: SLE=(Asset Value) * (Exposure Factor).
  • Our exposure factor (a subjective percentage of functionality or impact): 100%, since no customers will be able to access the system.
  • Timespan used to calculate the single loss expectancies in terms of hours is the time from failure to service availability.
  • Value of customer transactions against “mytable”, per hour: $10,000 USD
  • Single loss expectancy based on recovery time objective of 8 hours: (10,000 * 8) * (1.00) = $80,000.
  • Total time to rebuild “mytable” from a backup: 20 hours
  • Total time to fail over a database from master to standby/slave: 5 minutes

Given the above, a single loss expectancy can be calculated for “mytable” assuming a worst case scenario of ~20 hours. The realistic single loss expectancy (for the table, not the entire database server asset as a whole) may be calculated: (10,000 * 20) * (1.00) = $200,000. This is $120,000 loss above and beyond what management would have expected. Worse, the recovery time objective and recovery point objectives cannot be consistently met.

Adding a two hour delay line (well within the recovery point objective) can help meet business requirements and reduce single loss expectancy.

stuart blog5

The above transaction seems pretty innocuous, and having a delay line enabled doesn’t make sense for some use cases such as read slaves that are expected to be relatively consistent with their master. So… why bother with a delay line at all?

The benefits become clear when looking at another example. What if the INSERT statement above is changed to DELETE FROM mytable (or even DROP TABLE mytable) due to a bug in code, human error or malicious intent? In an environment with immediate committal on all slaves, one must:

1. Hold their breath and:

◦     Hope that a slave can be stopped before the transaction propagates, dump the table, and restore to the impacted master (likely 8+ hours).
◦     Hope that a slave can be stopped before the transaction propagates and fail over to it.

2. Restore from backup (20 hours).

Hope must not be factored into sound business practices; option one is off the table. With a delay line enabled procedures may be implemented to create a recovery process:

stuart blog4

The recovery process and benefits when using delayed replication can be documented, made repeatable and proven. So long as the problem is caught before the offending transaction makes it through the delay line, the recovery process is:

  1. T+0 minutes: Service degraded. Investigation into cause begins.
  2. T+5 minutes: Cause determined. Stop all replication to the delayed slave (on MySQL, don’t forget to stop the slave IO thread in addition to the SQL thread).
  3. T+35 minutes: Remove the pending transaction in the relay log, or, reset the relay log based on best judgment and the importance of data consistency.
  4. T+40 minutes: Promote the delayed standby to master.
  5. T+50 minutes: Validate that the system is functional.
  6. Service recovery declared.
  7. Recover the demoted master to a consistent state.

The total time to recovery during the above incident was 50 minutes, with an approximate loss of (10,000 * 0.833) * (1.00) = $8,333. The recovery time objective and recovery point objective have been met.

Delayed Replication and Seconds Behind Master:

MySQL 5.6 natively includes delayed replication as a new feature, and will be configurable via the CHANGE MASTER TO statement. MySQL 5.6 is not a generally available release, however, so is not deployed in production for most environments. Organizations running versions 5.1 and 5.5 can effectively implement delayed replication using the Percona-Toolkit utilities (formerly Maatkit) from Percona Software. The Percona Toolkit is an open source collection of helpful tools focused towards management of the MySQL database server. At IMVU, we have successfully implemented delayed replication using the percona-toolkit tools (more specifically, pt-slave-delay).

One side effect of delayed MySQL replication via an external process is that it is no longer possible to fully determine slave state using SHOW SLAVE STATUS. As the slave’s SQL_THREAD is periodically stopped and started, “Seconds behind master” will be NULL most of the time.

In order to mitigate the lack of information from MySQL’s internal replication state, a second utility is available through the percona-toolkit: pt-heartbeat. Pt-heartbeat writes a timestamped entry to a table periodically, creating a heartbeat.

The heartbeat transaction will be replicated, and delayed. By comparing the slave’s current time with the timestamp on the last committed heartbeat, we can approximate with fair certainty the number of seconds the slave has been delayed (or is lagging behind the master).

Monitoring the Solution, effectively:

Monitoring delayed replication becomes more complex than simply watching for heartbeats and comparing timestamps:

  • Usually, the slave will be stopped during backups unless something like Percona Xtrabackup or Enterprise Backup is being used. The replication delay will increase for the duration of the backup, and then contract. It should be noted that IMVU is OK with taking backups from a two hour old data source given our business requirements. Always check with your organization’s business requirements; don’t assume!
  • The replication delay will be affected by MySQL’s replication characteristics, including the fact that it is single threaded. High transaction volume can cause replication delays to increase and subsequently contract. The replication delay is variable and will not be consistent on a heavily loaded database.
  • If the pt-heartbeat/pt-slave-delay fails to maintain the delay line and monitoring, the team must always be made aware.
  • If the replication fails due to IO or SQL thread errors, the team must always be made aware.
  • If the database replication or monitoring system fails for any other reason, the team must be made aware. Contingencies must be in place for failures in MySQL, as well as in code which monitors delayed replication.

Rather than looking at the difference in timestamps at a moment in time, IMVU took the approach of calculating the slope of the delay across a timespan. With the additional information, descriptive statistics are calculated:

  • The slope of the delay line’s values over a time period: Allows determination of whether or not the delay line is stable, moving towards the desired value, or away from it and how quickly.
  • The Y-intercept of the slope: Treated as the approximate current number of seconds behind, functionally equivalent to MySQL’s Seconds Behind Master.
  • Correlation Coefficient: Allows determination of how well the current values on the trend line correlate as a series. If the correlation is 0, there is very little correlation meaning that the values are highly distributed over the given timespan. For values -1 < p < 0, there is correlation between values and the line is trending downward. For values 0 < p < 1, there is correlation between values and the delay line is trending upwards.

Additional information is helpful as well, and can be calculated based on data gathered from MySQL:

  1. The time when backups were kicked off.
  2. The number of seconds backups have been running (if applicable).
  3. The number of seconds the trend line has been in a degraded state due to: backups running, recovering after a backup, and trending towards recovery but not in a backup state.

If backups are being taken from the delayed standby/slave, the delay line will increase dramatically when replication is paused.

To prevent false alarms from paging our operations team, monitoring sensitivity must be decreased during backup windows. Once the backup finishes, it will take time for MySQL’s replication thread to catch up to the desired delay line. Again, sensitivity must be reduced during the recovery window.

If the delay line briefly dips below the configured value for 20 seconds but recovers 30 seconds later, the monitoring system should not page as this would be considered a false alarm at 3:00 in the morning. This is not a real time computing system/database platform, so there is no guarantee that the delay line will maintain at exactly the configured value… only a promise that it will maintain the delay line as close to the target as possible. On average, the delay line varies by up to 10 seconds when not under load and not under impact.

In a mission critical system, any type of exception to standard monitoring should have clearly defined parameters and values around the specific exception trigger, and the limits of the exception. In IMVU’s case, the limits around the exception are based on time. Every exception which results in reduced sensitivity has a corresponding limit to how long the service can remain in that state. If the limit is exceeded, the service goes into an alarm state for Nagios to process. Therefore, no external process may lessen sensitivity of the monitoring beyond configurable limits.

Delayed replication and monitoring in action!

Here is a real example of delayed replication running in a production environment (currently being staged across our cluster). It should be noted that additional exceptions and limits are in place for our internal processing. Our implementation will accept the following configurable limits and ranges:

–max-seconds-behind=<seconds>: Max seconds behind for local heartbeat (there needs to be a frequent heartbeat detected from the local host itself as well as the master – or there’s a problem!)

–max-relay-behind=<seconds>: Max seconds for relay log last update

–max-master-behind=<seconds>: Max seconds for master log last update

–max-backup-behind=<seconds>: Max seconds a backup may run

–max-deshard-behind=<seconds>: Max seconds a deshard job may run

–max-last-seen=<seconds>: Max seconds since worker last seen

–min-delay-time=<seconds>: Minimum allowed replication transaction delay time

–max-delay-time=<seconds>: Maximum allowed replication transaction delay time

–max-recovery-time=<seconds>: Seconds after backup completes to allow variances

–max-trend-time=<seconds>: Maximum time to allow intercept to trend to recovery

The actual limits passed to our Nagios plugin during our staged roll out are currently:

–max-seconds-behind 60

–max-master-behind 1800

–max-relay-behind 28800

–max-backup-behind 64800

–max-deshard-behind 32400

–max-last-seen 60

–min-delay-time 5400

–max-delay-time 9000

–max-recovery-time 21600

–max-trend-time 9000

A database instance in good health with a 2 hour replication delay:

stuart blog6

A database instance currently running an active backup. If the backup was not running, this service would be in an alarm/critical state, as the replication delay line has grown to 16,408 seconds (far above our window limit of 9,000 seconds, and the target goal of 7,200 seconds):

stuart blog7.jpg

Once a backup finishes, the recovery window is entered. If recovery back to the desired delay line does not occur within the recovery window, the service will go critical:

stuart blog8

The complete list of delayed replication conditions trapped at IMVU is listed below, along with the service state (i.e. WARNING, CRITICAL):

  • WARNING: Replication delay intercept recovering – below minimum with slope: Replication delay is below the minimum desired value, but is moving towards recovery.
  • WARNING: Replication delay intercept recovering – above maximum with slope: Replication delay is above the maximum desired value, but is moving towards recovery.
  • WARNING: Replication delay intercept is out of bounds, currently in recovery window for backup which completed <seconds> seconds ago: Replication delay is above/below the desired range, and may not be moving towards recovery. State is held in warning to allow recovery post backup until the recovery window expires.
  • WARNING: Replication delay intercept is out of bounds, currently in recovery window for deshard which completed <seconds> seconds ago: Replication delay is above/below the desired range, and may not be moving towards recovery. State is held in warning to allow recovery post deshard until the recovery window expires.
  • WARNING: Replication delay intercept recovering – real value within range: Replication delay intercept is above or below the desired range, but the real value has already recovered. Replication delay intercept is trending towards recovery.
  • CRITICAL: Replication delay intercept is out of bounds, and slope does not indicate recovery: The replication delay is outside of the desired range, and is not moving towards recovery or is not moving towards recovery at the minimum desired rate (slope).
  • CRITICAL: Replication delay intercept recovery exceeded window: The replication delay intercept was recovering but exceeded the maximum recovery window time limit. Replication may be lagging or is not catching up in sufficient time.
  • WARNING: Active backup/deshard: An active backup or deshard job is running.

Naturally, other parameters necessary for healthy replication should continue to be checked as well. Replication, including delayed replication, will always fail if the replication state is stopped due to an error.

Caveats

No solution is completely perfect or without trade-offs, and delayed replication is no exception. Some considerations that should be taken into account prior to implementation:

  • If failing over to a delayed standby/slave, recovery time can be impacted. For instance, if auto-incrementing columns are used in MySQL and statement based or hybrid replication is being used then the standby must be caught up or data inconsistencies are likely. Make sure that the additional time to catch the host up is taken into account when estimating service restoration time.
  • The longer replication is delayed, the longer it will take for an instance to catch up.
  • If the application being targeted uses the delayed standby as a read slave, it is important to verify that the delay line doesn’t impact functionality or create edge/corner cases (this is not the case at IMVU, but is worth mentioning as it is not uncommon). Ideally, this should be validated by understanding the application’s logic or (even better) its code. A great example: A user is disabled in an application that actively queries read slaves for user information in a table. If the read slave is two hours behind, the user    might be able to log in until the slaves commit the transaction which disabled the account.
  • Delayed replication will not mitigate data loss unless the impacting event is caught within the delay window, and action is taken prior to the statements in question being executed. Clearly document that delayed replication only serves as a component of a comprehensive ecosystem.

Final Thoughts

Delayed replication is most useful to recover from errors made by administrative users, particularly since statements involving most DDL cannot be wrapped in a transaction. It is not, however, a magic bullet; it only provides protection if the offending transactions are identified and removed before making it through the delay line. The monitoring strategies for delayed replication are different (and decoupled) from standard MySQL replication monitoring, and the decoupled processes themselves must be monitored as well.

Continuous Monitoring: Real-time statistics for a thousand servers and the application they serve

By Jon Watte, Technical Director, IMVU.com 
Image

At IMVU, we push code to production fifty times a day. Each time an engineer finishes a task, the code goes through a large battery of unit tests, and when it passes, we deploy it on our servers right away. This makes the feedback loop immediate: If something is wrong, we hear about it and can fix it while the context is still fresh in the mind of the engineer.

An important part of this process is the “immune system.” The immune system monitors the status of the entire application, and detects abrupt changes. If these abrupt changes are bad enough, and closely enough correlated with a recent code deployment, that code deployment is rolled back, and the engineer in question sent links to graphs and error logs to go look at to figure out the problem.

For a long time, we used rrdtool with scripts to scrape counter values out of memcached to capture data, and cacti to plot that data into graphs. This was an easy way get get started when IMVU was small, and it has scaled to the size we’re at now. Two years ago, the system started showing its age. A year ago, we decided to do something about it. The problems we wanted to solve were:

1) The system we had only collected data at 5 minute intervals. This is way too slow to quickly detect problems after a bad code push. Bad code pushes are rare, but we want them to impact customers as briefly as possible.

2) The system we had would aggregate data as “average” over time, to keep coarser data available for a longer time. But this means that we lose useful resolution. What was the swing of the data within each “bucket” of measurements? What was the min, and the max?

3) The retention times for the data were too short. To compare if the system is mis-behaving right now, or if it’s just normal high load for a week-end, we need accurate data from a week ago as a baseline.

4) The system that relied on metrics to be written into memcache, and then scraped back out into rrd files by cacti, was running out of steam, and we often had time intervals with missing data for many counters.

To solve these problems, we went looking for other counter management solutions. We tried a large number, wrote off a bunch, and then settled on “Graphite,” which our friends over at Etsy seemed to recommend highly. However, Graphite was still not quite right — it would still only allow a single aggregation function when aggregating metrics over time, and the built-in storage back-end had some performance problems, largely traced to the distribution model of “use NFS.”

So, we started writing our own back-end for the nice Graphite graphing front-end. We made the back-end fit into Graphite’s expectations, and exposed the different data from a single metric as separate sub-counters. For each data point in a graph, we could get the average, sample count, standard deviation, minimum, and maximum. Getting there required pretty heroic efforts, and pretty nasty hacks, though — the internals of Graphite simply weren’t made to support this use case. Also, Graphite used server-side rendering, which meant that just a few engineers keeping a dashboard of a dozen counters on their screen, refreshing every 10 seconds, would overload the machine collecting the metrics.

At some point, enough was enough, and we took the back-end we’d developed, and wrote our own front-end. This front-end is a HTML5 application using client-side JavaScript for rendering, thus offloading the metrics server. It also uses HTTP for data transport, thus lending itself well to various kinds of clients — including web caching if needed!

Finally, to solve the intermittent data problem, we made the system capable of forwarding incoming data in a graph. An agent can run on each server, and receive local data, which it then forwards to the master database. Should the agent connection go down, the data is buffered while the agent attempts to re-connect.

Today, we’re releasing this (both back-end and front-end) on GitHub to the open source world as our contribution to operations and engineering teams everywhere. If you have a large number of counters to track, and want richer data than a “simple” aggregation function per data point, I encourage you to have a look, starting with the wiki:

https://github.com/imvu-open/istatd/wiki

The back-end application is written in C++ with boost::asio for threading and networking, and currently keeps half a million counter files, each updated every 10 seconds, on a mid-range Dell server with raid5 SSD drives. Currently, there is build and packaging support for Ubuntu 10.04 LTS, although any reasonable UNIX with GCC should be supported. Give it a spin, and let us know what you think!