Continuous Monitoring: Real-time statistics for a thousand servers and the application they serve
At IMVU, we push code to production fifty times a day. Each time an engineer finishes a task, the code goes through a large battery of unit tests, and when it passes, we deploy it on our servers right away. This makes the feedback loop immediate: If something is wrong, we hear about it and can fix it while the context is still fresh in the mind of the engineer.
An important part of this process is the “immune system.” The immune system monitors the status of the entire application, and detects abrupt changes. If these abrupt changes are bad enough, and closely enough correlated with a recent code deployment, that code deployment is rolled back, and the engineer in question sent links to graphs and error logs to go look at to figure out the problem.
For a long time, we used rrdtool with scripts to scrape counter values out of memcached to capture data, and cacti to plot that data into graphs. This was an easy way get get started when IMVU was small, and it has scaled to the size we’re at now. Two years ago, the system started showing its age. A year ago, we decided to do something about it. The problems we wanted to solve were:
1) The system we had only collected data at 5 minute intervals. This is way too slow to quickly detect problems after a bad code push. Bad code pushes are rare, but we want them to impact customers as briefly as possible.
2) The system we had would aggregate data as “average” over time, to keep coarser data available for a longer time. But this means that we lose useful resolution. What was the swing of the data within each “bucket” of measurements? What was the min, and the max?
3) The retention times for the data were too short. To compare if the system is mis-behaving right now, or if it’s just normal high load for a week-end, we need accurate data from a week ago as a baseline.
4) The system that relied on metrics to be written into memcache, and then scraped back out into rrd files by cacti, was running out of steam, and we often had time intervals with missing data for many counters.
To solve these problems, we went looking for other counter management solutions. We tried a large number, wrote off a bunch, and then settled on “Graphite,” which our friends over at Etsy seemed to recommend highly. However, Graphite was still not quite right — it would still only allow a single aggregation function when aggregating metrics over time, and the built-in storage back-end had some performance problems, largely traced to the distribution model of “use NFS.”
So, we started writing our own back-end for the nice Graphite graphing front-end. We made the back-end fit into Graphite’s expectations, and exposed the different data from a single metric as separate sub-counters. For each data point in a graph, we could get the average, sample count, standard deviation, minimum, and maximum. Getting there required pretty heroic efforts, and pretty nasty hacks, though — the internals of Graphite simply weren’t made to support this use case. Also, Graphite used server-side rendering, which meant that just a few engineers keeping a dashboard of a dozen counters on their screen, refreshing every 10 seconds, would overload the machine collecting the metrics.
Finally, to solve the intermittent data problem, we made the system capable of forwarding incoming data in a graph. An agent can run on each server, and receive local data, which it then forwards to the master database. Should the agent connection go down, the data is buffered while the agent attempts to re-connect.
Today, we’re releasing this (both back-end and front-end) on GitHub to the open source world as our contribution to operations and engineering teams everywhere. If you have a large number of counters to track, and want richer data than a “simple” aggregation function per data point, I encourage you to have a look, starting with the wiki:
The back-end application is written in C++ with boost::asio for threading and networking, and currently keeps half a million counter files, each updated every 10 seconds, on a mid-range Dell server with raid5 SSD drives. Currently, there is build and packaging support for Ubuntu 10.04 LTS, although any reasonable UNIX with GCC should be supported. Give it a spin, and let us know what you think!