Archive

Author Archive

Efficient and Scalable Off-Site Backup to Amazon Glacier

March 11, 2013 1 comment

By Ted Reed

The strength of IMVU is our large catalog of User-Generated Content (UGC). With more than ten million items in our virtual catalog, losing our UGC would be crippling to our business.  We in the Operations Team take the preservation of this data seriously. We recently upgraded our aging UGC backup system to use Amazon Glacier as the storage medium. This post will briefly explain the old backup system before detailing the new one.

In The Beginning

We store our UGC in a MogileFS instance, a system for cheap and efficient storage of files across commodity hardware. Our offsite backups originally took the form of USB drives onto which a process would copy the files as they were written to Mogile. As each drive filled up, we would then transport it from our colocation facility to a fire safe in our office. To cover the period of time when the disk was still being written to, we would keep a synced copy of the disk on a machine in a server closet in our offices. This copy would then be deleted once the USB disk had been safely stored.

As time went on, the rate at which our customers were adding photos or uploading products to our catalog grew and grew. We also started permitting higher-quality photos and made it easier to upload and manage them. In order to compensate for the increase in growth, we started buying larger and larger USB drives. This past summer we reached the point where we had the biggest USB drives we could reasonably get and we were still filling them up in about a week. Each time a drive filled up, one of us had to drive out to the facility to retrieve it.

Auditing the data also became woefully inconvenient, as the number of individual drives one had to juggle during the audit skyrocketed. It was an annoyingly manual process, even with helper scripts to manage everything but plugging and unplugging drives. Additionally, annoyingly manual processes tend to not get done as often as we ought to.

Enter Amazon Glacier

We spent some time looking over options for better off-site backups. We talked to many vendors and even for those who could handle our “monumental amount of data” (actual term used by a vendor who shall remain nameless). The quotes left us wondering if we might not be better off just putting some cheap servers in another data center somewhere.

While we were evaluating these options in late August 2012, Amazon announced Glacier, an “extremely low-cost storage service” intended for long-term archival of infrequently-accessed data. At just one cent per gigabyte per month, it handily beat out every other vendor we’d seen in terms of price. We poked around and tested the service out, and found it to be right up our alley, although the pricing was annoyingly confusing. (I ended up writing a small Haskell program to re-implement the math from their FAQ. It rounds in weird places.)

The Backup Process

The backup process begins with a Perl daemon which manages worker pools of Fetchers and Uploaders. The master process figures out which files need to be backed up and hands work off to the Fetchers, which talk to Mogile and pull the file down to a local directory. Once a directory has about 25 MB of files, it’s closed off and handed to an Uploader. The Uploader will then tar the directory and send the tar to Glacier.

It’s worth pointing out here that the 25 MB bundling is actually pretty important if you don’t want to pay a great deal of money. Glacier charges five cents per thousand uploads. My first tests ignored that cost and we pretty quickly ran up an unexpectedly large bill. After some analysis, we found that 25 MB was about right as a middle point between cost and flexibility. This middle point is likely to differ for your use case and budget. The state of the backup process is stored in a MySQL database, which has tables for archives and the files that go into them. We record roughly the same information that Mogile does about each file so that we can accurately restore it if the file is accidentally deleted from Mogile. For each archive, we store both its size as well as information about where to find it (region/vault/id). We also have tables to record information about backup failures for later investigation.

Restoration

The restoration process begins with a simple front-end script, which can take a source and destination. The source can be in terms of Mogile ID, Mogile Domain/Key, or one of our URLs (the latter two are dereferenced to Mogile ID). The destination can be Mogile itself, a local file, or S3. There’s also a batch mode which takes input where each line represents one source and one destination. The script will record each file to restore in the database. It will then issue requests to Glacier to retrieve the archives needed to restore the files, unless there is already an active and uncompleted request for the same archive.

Elsewhere in our network, we have a daemon which sits and listens to an SQS queue tied to the SNS topic for restoration. The notification will include the database ID for the restore request. The daemon will pull the information needed to do the restoration and then clear it afterwards.

Auditing and Validation

One side benefit of a backup system that doesn’t involve managing disks is that we can automate testing the backups and our restoration process. There are two aspects that we want to test. First, we want to prove that the backups are intact and accurate. Second, we want to prove that we are able to restore the data. Glacier permits you to pull up to 5% of your total data per month, presumably for purposes such as these. You still need to pay bandwidth egress charges if the data leaves AWS, which informed our design for the automated audit process.

In order to audit the backups, there’s a cron on our side which runs hourly and selects a random sampling of one millionth of the archives we’ve uploaded. If we had 3,000,000 archives, we’d audit 3 per hour. For each archive, this cron pulls the files from Mogile and generates md5sums. It issues a request to Glacier, with a different SNS topic than the one we use for ordinary restoration. It then uploads the md5sums as a file to an EC2 micro instance which runs a daemon listening for that SNS topic. As the archives become available, the daemon does the same md5sum generation on the Glacier data, comparing it to the list uploaded beforehand. If there is a discrepancy, it sends an email to a mailing list that we check at least daily.

To handle verifying the restoration path, there is a smaller monthly process. It will generate a list of 10 random files and use our normal restoration tool to restore them to test buckets in both Mogile and S3. If the files haven’t been restored within 24 hours, a notification goes to the same mailing list as for audit failures.

In The End

We’re still pushing our historical data into Glacier, but this project is already looking pretty successful. We’ve been adding new data since last November, and are in the process of checking through everything to gain a full trust in the system before we finally pull the plug on the old system and build a mighty throne out of the USB drives.

The backup process was mostly optimized to be able to push these historical files up to Amazon in a reasonable timeframe. As a result, the process which backs up our real-time uploads is quite fast with plenty of room to grow as our usage scales up. Files added to Mogile are in the backup process within seconds, and are typically in Glacier within less than a minute. I looked at our system during peak time for an example, and found that files were in Glacier 35 seconds after being uploaded to us by a user. Knowing this gives me great peace of mind that our UGC will be safe in the event a disaster strikes.

What about other forms of backup? We briefly looked at storing our offsite database backups in Glacier, but decided against this. Any data you put into Glacier will be billed for at least three months. Adding daily backups and then deleting some of them later makes for a rather expensive solution. There’s also little to gain as the process which moves the database backups to our office isn’t nearly as painful as our older UGC backup process was.

Monitoring Delayed Replication, with a focus on MySQL

January 9, 2013 1 comment

By Stuart Cianos, CISSP

Author’s Preface

Many (many) years ago, I had the opportunity to work with a variety of organizations targeting harm reduction for at risk populations in the San Francisco Bay Area. One of the great lessons learned from that experience is that the future is mostly indeterminate. In 1997-1998, severe storms ravaged our region and very few of us dealing with relief services (myself included) anticipated the total impact. It was only by sheer luck and coincidence that there was available shelter space to house an entire community of displaced families after automatic flood gates failed to open on a large stream following years of drought. We didn’t think it would happen to us until it was too late to be better prepared. If I win the lottery tomorrow, my future might change in ways that were unexpected. Conversely, if I were to accidentally delete all of the files (“rm -rf /”) on a mission critical computer system, my future might also change in unexpected ways.

I no longer worry about the future, I prepare for it. What follows is a powerful technique which can help mitigate risks around unanticipated data loss, even when that data loss is due to human error.

Introduction

IMVU utilizes MySQL on a daily basis to drive our site and customer experience forward. As IMVU has grown and scaled, so have the number of database hosts and the importance of data maintained in our environment. With over 100,000,000 registered user accounts and hundreds of thousands of concurrent transactions at any given time, it’s of paramount importance that our practices embody the CIA triad: confidentiality, integrity and accessibility of information. Implementing delayed replication at the data tier can enhance business continuity and improve posture when recovering from events impacting data integrity and availability.

IMVU uses MogileFS to store user generated binary content (images, content creator products, etc.). MogileFS is a user-space fault tolerant, distributed file system. MogileFS stores filesystem metadata in a MySQL database (PostgreSQL is also supported, but not used in our environment). The system is designed to ensure that no single failure will result in data loss or service degradation.

In June of 2012, human error resulted in the MogileFS metadata table (“file”) being dropped from the primary database instance rather than a standby under maintenance. Without any way to relate requests for files back to the content on disk, MogileFS could no longer serve any requests resulting in customers being unable to access some major features of the IMVU social network and rich 3D experience. There was a slave database instance lagging behind its master due to write heavy operations, which allowed us to use it as a recovery source. We were very fortunate that this slave was lagging behind unintentionally; had it not been, the impact of this incident would have been much more severe.

Delayed replication is being implemented across IMVU’s cluster of database instances in order to protect against these types of accidental operations moving forward.

Data Availability and the CIA Triad

The CIA triad is a set of common attributes which should be applied to information management and security. Information on the CIA triad is available abundantly online, but a brief summary is below:

Stuart blog1

  • Confidentiality: Prevention of information disclosure to unauthorized entities.
  • Integrity: Prevention and detection of the unauthorized modification of data.
  • Availability: Assurance that the information will be available when needed.

One question that comes up is what delayed replication has to do with information security… and my response is “a lot!” Delayed replication can help mitigate risks around data availability as well as data integrity, and the impact an incident will have on the bottom line.

What is Database Replication?

Database replication allows transactions on one database host to be replicated to one or more separate instances. There are many replication strategies, but the examples discussed will focus on a simple Master-Slave configuration throughout.

Stuart blog2

When the transaction “INSERT INTO foo VALUES (‘hello world’)” is committed to the master database instance, it is replicated on the slave and committed there as well. By having a slave or standby host available and relatively up to date, it is possible to recover very quickly from various hardware and software failures on the master by replacing it with the slave/standby host.

Stuart blog3

What is Delayed Replication?

Delayed replication is the technique of inserting a delay line into a database’s replication mechanism. Transactions will each be held in a first-in, first-out queue for two hours prior to committal on the slave host being delayed. For the purposes of this example, “mytable” is a mission critical set of data required for the application to function and contains 100 gigabytes of data. If the table “mytable” becomes unavailable for whatever reason (for example, a drop table statement was executed accidentally/unintentionally on the primary and it immediately replicated and got executed on the standby), customers will not have access to the service. In this example, fictitious business parameters will be defined to help us quantify possible benefits:

  • Recovery time objective (RTO): 8 hours
  • Recovery point objective (RPO): 4 hours
  • Single loss expectancy is calculated using the standard formula: SLE=(Asset Value) * (Exposure Factor).
  • Our exposure factor (a subjective percentage of functionality or impact): 100%, since no customers will be able to access the system.
  • Timespan used to calculate the single loss expectancies in terms of hours is the time from failure to service availability.
  • Value of customer transactions against “mytable”, per hour: $10,000 USD
  • Single loss expectancy based on recovery time objective of 8 hours: (10,000 * 8) * (1.00) = $80,000.
  • Total time to rebuild “mytable” from a backup: 20 hours
  • Total time to fail over a database from master to standby/slave: 5 minutes

Given the above, a single loss expectancy can be calculated for “mytable” assuming a worst case scenario of ~20 hours. The realistic single loss expectancy (for the table, not the entire database server asset as a whole) may be calculated: (10,000 * 20) * (1.00) = $200,000. This is $120,000 loss above and beyond what management would have expected. Worse, the recovery time objective and recovery point objectives cannot be consistently met.

Adding a two hour delay line (well within the recovery point objective) can help meet business requirements and reduce single loss expectancy.

stuart blog5

The above transaction seems pretty innocuous, and having a delay line enabled doesn’t make sense for some use cases such as read slaves that are expected to be relatively consistent with their master. So… why bother with a delay line at all?

The benefits become clear when looking at another example. What if the INSERT statement above is changed to DELETE FROM mytable (or even DROP TABLE mytable) due to a bug in code, human error or malicious intent? In an environment with immediate committal on all slaves, one must:

1. Hold their breath and:

◦     Hope that a slave can be stopped before the transaction propagates, dump the table, and restore to the impacted master (likely 8+ hours).
◦     Hope that a slave can be stopped before the transaction propagates and fail over to it.

2. Restore from backup (20 hours).

Hope must not be factored into sound business practices; option one is off the table. With a delay line enabled procedures may be implemented to create a recovery process:

stuart blog4

The recovery process and benefits when using delayed replication can be documented, made repeatable and proven. So long as the problem is caught before the offending transaction makes it through the delay line, the recovery process is:

  1. T+0 minutes: Service degraded. Investigation into cause begins.
  2. T+5 minutes: Cause determined. Stop all replication to the delayed slave (on MySQL, don’t forget to stop the slave IO thread in addition to the SQL thread).
  3. T+35 minutes: Remove the pending transaction in the relay log, or, reset the relay log based on best judgment and the importance of data consistency.
  4. T+40 minutes: Promote the delayed standby to master.
  5. T+50 minutes: Validate that the system is functional.
  6. Service recovery declared.
  7. Recover the demoted master to a consistent state.

The total time to recovery during the above incident was 50 minutes, with an approximate loss of (10,000 * 0.833) * (1.00) = $8,333. The recovery time objective and recovery point objective have been met.

Delayed Replication and Seconds Behind Master:

MySQL 5.6 natively includes delayed replication as a new feature, and will be configurable via the CHANGE MASTER TO statement. MySQL 5.6 is not a generally available release, however, so is not deployed in production for most environments. Organizations running versions 5.1 and 5.5 can effectively implement delayed replication using the Percona-Toolkit utilities (formerly Maatkit) from Percona Software. The Percona Toolkit is an open source collection of helpful tools focused towards management of the MySQL database server. At IMVU, we have successfully implemented delayed replication using the percona-toolkit tools (more specifically, pt-slave-delay).

One side effect of delayed MySQL replication via an external process is that it is no longer possible to fully determine slave state using SHOW SLAVE STATUS. As the slave’s SQL_THREAD is periodically stopped and started, “Seconds behind master” will be NULL most of the time.

In order to mitigate the lack of information from MySQL’s internal replication state, a second utility is available through the percona-toolkit: pt-heartbeat. Pt-heartbeat writes a timestamped entry to a table periodically, creating a heartbeat.

The heartbeat transaction will be replicated, and delayed. By comparing the slave’s current time with the timestamp on the last committed heartbeat, we can approximate with fair certainty the number of seconds the slave has been delayed (or is lagging behind the master).

Monitoring the Solution, effectively:

Monitoring delayed replication becomes more complex than simply watching for heartbeats and comparing timestamps:

  • Usually, the slave will be stopped during backups unless something like Percona Xtrabackup or Enterprise Backup is being used. The replication delay will increase for the duration of the backup, and then contract. It should be noted that IMVU is OK with taking backups from a two hour old data source given our business requirements. Always check with your organization’s business requirements; don’t assume!
  • The replication delay will be affected by MySQL’s replication characteristics, including the fact that it is single threaded. High transaction volume can cause replication delays to increase and subsequently contract. The replication delay is variable and will not be consistent on a heavily loaded database.
  • If the pt-heartbeat/pt-slave-delay fails to maintain the delay line and monitoring, the team must always be made aware.
  • If the replication fails due to IO or SQL thread errors, the team must always be made aware.
  • If the database replication or monitoring system fails for any other reason, the team must be made aware. Contingencies must be in place for failures in MySQL, as well as in code which monitors delayed replication.

Rather than looking at the difference in timestamps at a moment in time, IMVU took the approach of calculating the slope of the delay across a timespan. With the additional information, descriptive statistics are calculated:

  • The slope of the delay line’s values over a time period: Allows determination of whether or not the delay line is stable, moving towards the desired value, or away from it and how quickly.
  • The Y-intercept of the slope: Treated as the approximate current number of seconds behind, functionally equivalent to MySQL’s Seconds Behind Master.
  • Correlation Coefficient: Allows determination of how well the current values on the trend line correlate as a series. If the correlation is 0, there is very little correlation meaning that the values are highly distributed over the given timespan. For values -1 < p < 0, there is correlation between values and the line is trending downward. For values 0 < p < 1, there is correlation between values and the delay line is trending upwards.

Additional information is helpful as well, and can be calculated based on data gathered from MySQL:

  1. The time when backups were kicked off.
  2. The number of seconds backups have been running (if applicable).
  3. The number of seconds the trend line has been in a degraded state due to: backups running, recovering after a backup, and trending towards recovery but not in a backup state.

If backups are being taken from the delayed standby/slave, the delay line will increase dramatically when replication is paused.

To prevent false alarms from paging our operations team, monitoring sensitivity must be decreased during backup windows. Once the backup finishes, it will take time for MySQL’s replication thread to catch up to the desired delay line. Again, sensitivity must be reduced during the recovery window.

If the delay line briefly dips below the configured value for 20 seconds but recovers 30 seconds later, the monitoring system should not page as this would be considered a false alarm at 3:00 in the morning. This is not a real time computing system/database platform, so there is no guarantee that the delay line will maintain at exactly the configured value… only a promise that it will maintain the delay line as close to the target as possible. On average, the delay line varies by up to 10 seconds when not under load and not under impact.

In a mission critical system, any type of exception to standard monitoring should have clearly defined parameters and values around the specific exception trigger, and the limits of the exception. In IMVU’s case, the limits around the exception are based on time. Every exception which results in reduced sensitivity has a corresponding limit to how long the service can remain in that state. If the limit is exceeded, the service goes into an alarm state for Nagios to process. Therefore, no external process may lessen sensitivity of the monitoring beyond configurable limits.

Delayed replication and monitoring in action!

Here is a real example of delayed replication running in a production environment (currently being staged across our cluster). It should be noted that additional exceptions and limits are in place for our internal processing. Our implementation will accept the following configurable limits and ranges:

–max-seconds-behind=<seconds>: Max seconds behind for local heartbeat (there needs to be a frequent heartbeat detected from the local host itself as well as the master – or there’s a problem!)

–max-relay-behind=<seconds>: Max seconds for relay log last update

–max-master-behind=<seconds>: Max seconds for master log last update

–max-backup-behind=<seconds>: Max seconds a backup may run

–max-deshard-behind=<seconds>: Max seconds a deshard job may run

–max-last-seen=<seconds>: Max seconds since worker last seen

–min-delay-time=<seconds>: Minimum allowed replication transaction delay time

–max-delay-time=<seconds>: Maximum allowed replication transaction delay time

–max-recovery-time=<seconds>: Seconds after backup completes to allow variances

–max-trend-time=<seconds>: Maximum time to allow intercept to trend to recovery

The actual limits passed to our Nagios plugin during our staged roll out are currently:

–max-seconds-behind 60

–max-master-behind 1800

–max-relay-behind 28800

–max-backup-behind 64800

–max-deshard-behind 32400

–max-last-seen 60

–min-delay-time 5400

–max-delay-time 9000

–max-recovery-time 21600

–max-trend-time 9000

A database instance in good health with a 2 hour replication delay:

stuart blog6

A database instance currently running an active backup. If the backup was not running, this service would be in an alarm/critical state, as the replication delay line has grown to 16,408 seconds (far above our window limit of 9,000 seconds, and the target goal of 7,200 seconds):

stuart blog7.jpg

Once a backup finishes, the recovery window is entered. If recovery back to the desired delay line does not occur within the recovery window, the service will go critical:

stuart blog8

The complete list of delayed replication conditions trapped at IMVU is listed below, along with the service state (i.e. WARNING, CRITICAL):

  • WARNING: Replication delay intercept recovering – below minimum with slope: Replication delay is below the minimum desired value, but is moving towards recovery.
  • WARNING: Replication delay intercept recovering – above maximum with slope: Replication delay is above the maximum desired value, but is moving towards recovery.
  • WARNING: Replication delay intercept is out of bounds, currently in recovery window for backup which completed <seconds> seconds ago: Replication delay is above/below the desired range, and may not be moving towards recovery. State is held in warning to allow recovery post backup until the recovery window expires.
  • WARNING: Replication delay intercept is out of bounds, currently in recovery window for deshard which completed <seconds> seconds ago: Replication delay is above/below the desired range, and may not be moving towards recovery. State is held in warning to allow recovery post deshard until the recovery window expires.
  • WARNING: Replication delay intercept recovering – real value within range: Replication delay intercept is above or below the desired range, but the real value has already recovered. Replication delay intercept is trending towards recovery.
  • CRITICAL: Replication delay intercept is out of bounds, and slope does not indicate recovery: The replication delay is outside of the desired range, and is not moving towards recovery or is not moving towards recovery at the minimum desired rate (slope).
  • CRITICAL: Replication delay intercept recovery exceeded window: The replication delay intercept was recovering but exceeded the maximum recovery window time limit. Replication may be lagging or is not catching up in sufficient time.
  • WARNING: Active backup/deshard: An active backup or deshard job is running.

Naturally, other parameters necessary for healthy replication should continue to be checked as well. Replication, including delayed replication, will always fail if the replication state is stopped due to an error.

Caveats

No solution is completely perfect or without trade-offs, and delayed replication is no exception. Some considerations that should be taken into account prior to implementation:

  • If failing over to a delayed standby/slave, recovery time can be impacted. For instance, if auto-incrementing columns are used in MySQL and statement based or hybrid replication is being used then the standby must be caught up or data inconsistencies are likely. Make sure that the additional time to catch the host up is taken into account when estimating service restoration time.
  • The longer replication is delayed, the longer it will take for an instance to catch up.
  • If the application being targeted uses the delayed standby as a read slave, it is important to verify that the delay line doesn’t impact functionality or create edge/corner cases (this is not the case at IMVU, but is worth mentioning as it is not uncommon). Ideally, this should be validated by understanding the application’s logic or (even better) its code. A great example: A user is disabled in an application that actively queries read slaves for user information in a table. If the read slave is two hours behind, the user    might be able to log in until the slaves commit the transaction which disabled the account.
  • Delayed replication will not mitigate data loss unless the impacting event is caught within the delay window, and action is taken prior to the statements in question being executed. Clearly document that delayed replication only serves as a component of a comprehensive ecosystem.

Final Thoughts

Delayed replication is most useful to recover from errors made by administrative users, particularly since statements involving most DDL cannot be wrapped in a transaction. It is not, however, a magic bullet; it only provides protection if the offending transactions are identified and removed before making it through the delay line. The monitoring strategies for delayed replication are different (and decoupled) from standard MySQL replication monitoring, and the decoupled processes themselves must be monitored as well.

Continuous Monitoring: Real-time statistics for a thousand servers and the application they serve

September 26, 2012 3 comments
By Jon Watte, Technical Director, IMVU.com 
Image

At IMVU, we push code to production fifty times a day. Each time an engineer finishes a task, the code goes through a large battery of unit tests, and when it passes, we deploy it on our servers right away. This makes the feedback loop immediate: If something is wrong, we hear about it and can fix it while the context is still fresh in the mind of the engineer.

An important part of this process is the “immune system.” The immune system monitors the status of the entire application, and detects abrupt changes. If these abrupt changes are bad enough, and closely enough correlated with a recent code deployment, that code deployment is rolled back, and the engineer in question sent links to graphs and error logs to go look at to figure out the problem.

For a long time, we used rrdtool with scripts to scrape counter values out of memcached to capture data, and cacti to plot that data into graphs. This was an easy way get get started when IMVU was small, and it has scaled to the size we’re at now. Two years ago, the system started showing its age. A year ago, we decided to do something about it. The problems we wanted to solve were:

1) The system we had only collected data at 5 minute intervals. This is way too slow to quickly detect problems after a bad code push. Bad code pushes are rare, but we want them to impact customers as briefly as possible.

2) The system we had would aggregate data as “average” over time, to keep coarser data available for a longer time. But this means that we lose useful resolution. What was the swing of the data within each “bucket” of measurements? What was the min, and the max?

3) The retention times for the data were too short. To compare if the system is mis-behaving right now, or if it’s just normal high load for a week-end, we need accurate data from a week ago as a baseline.

4) The system that relied on metrics to be written into memcache, and then scraped back out into rrd files by cacti, was running out of steam, and we often had time intervals with missing data for many counters.

To solve these problems, we went looking for other counter management solutions. We tried a large number, wrote off a bunch, and then settled on “Graphite,” which our friends over at Etsy seemed to recommend highly. However, Graphite was still not quite right — it would still only allow a single aggregation function when aggregating metrics over time, and the built-in storage back-end had some performance problems, largely traced to the distribution model of “use NFS.”

So, we started writing our own back-end for the nice Graphite graphing front-end. We made the back-end fit into Graphite’s expectations, and exposed the different data from a single metric as separate sub-counters. For each data point in a graph, we could get the average, sample count, standard deviation, minimum, and maximum. Getting there required pretty heroic efforts, and pretty nasty hacks, though — the internals of Graphite simply weren’t made to support this use case. Also, Graphite used server-side rendering, which meant that just a few engineers keeping a dashboard of a dozen counters on their screen, refreshing every 10 seconds, would overload the machine collecting the metrics.

At some point, enough was enough, and we took the back-end we’d developed, and wrote our own front-end. This front-end is a HTML5 application using client-side JavaScript for rendering, thus offloading the metrics server. It also uses HTTP for data transport, thus lending itself well to various kinds of clients — including web caching if needed!

Finally, to solve the intermittent data problem, we made the system capable of forwarding incoming data in a graph. An agent can run on each server, and receive local data, which it then forwards to the master database. Should the agent connection go down, the data is buffered while the agent attempts to re-connect.

Today, we’re releasing this (both back-end and front-end) on GitHub to the open source world as our contribution to operations and engineering teams everywhere. If you have a large number of counters to track, and want richer data than a “simple” aggregation function per data point, I encourage you to have a look, starting with the wiki:

https://github.com/imvu-open/istatd/wiki

The back-end application is written in C++ with boost::asio for threading and networking, and currently keeps half a million counter files, each updated every 10 seconds, on a mid-range Dell server with raid5 SSD drives. Currently, there is build and packaging support for Ubuntu 10.04 LTS, although any reasonable UNIX with GCC should be supported. Give it a spin, and let us know what you think!

Categories: Engineering Tags:

Writing Resilient Unit Tests

By Andrew Wilcox

I happened to be looking at a unit test I wrote a couple months ago and was dismayed to realize it wasn’t working any longer:

test("don't launch the rocket in stormy weather", function () {
    setup_rocket_systems_controller();
    setup_fake_ground_based_internet();
    set_weather_to_stormy();
    equal(ready_to_launch(), false);
});

Today this test will always go green – regardless of whether the code launches the rocket in stormy weather or not.

My unit test worked when I first wrote it.  But my test was also fragile: it was easily broken in the normal course of code development.

Here’s the code being tested as it appears today:

function ready_to_launch() {
     var controller = connect_to_rocket_systems_controller();
     if (! controller) {
         return false;
     }

     if (controller.get_fuel_level() < needed_fuel()) {
         return false;
     }

     var internet = controller.connect_to_ground_based_internet();
     if (! internet) {
         return false;
     }

     if (internet.obtain_weather_report() == WEATHER_STORMY) {
         return false;
     }

     return true;
}

Of course we could argue whether this is the best coding style to use: whether each check could be split out into a separate function, or whether exceptions should be thrown instead of returning false, and so on.  But with working unit tests such improvements in design are easy to make.

Without unit tests (or with broken unit tests!) changing the code can easily break things without us noticing.  With my broken unit test, someone could be working on improving or enhancing the code, and they could make a small typo that broke the code which prevents launches in stormy weather, and they would think that everything was OK because — after all — there’s clearly a unit test that checks for that!

Why did my unit test break?  What happened was:

  • We noticed that launching our rockets without enough fuel often resulted in a crash;

  • An engineer carefully reviewed all the changes that would be needed to implement a fix, and wrote unit tests to check that the code would not launch the rocket when it didn’t have enough fuel;

  • A couple places in the code that needed to be updated were overlooked but caught by unit tests going red, and so were quickly found and fixed;

  • A few unit tests broke by going red because they themselves now needed to fuel the rocket in order to do their test, but they were also quickly found and fixed;

  • My unit test also now needed to fuel the rocket to do its test, but what it checked was just that the rocket wasn’t ready to launch… and so it was now going green not because of stormy weather but because it was failing to fuel the rocket!

At this point the code to check for stormy weather could be inadvertently broken by further code changes and no one might notice.  We could even delete the check for stormy weather entirely and my unit test would still go green!  Bit rot has set in… there’s impressive looking code to avoid launching the rocket when we’re not supposed to… and impressive looking unit tests to test that… and yet when stormy weather comes we could actually launch the rocket anyway.

There are two ways for a unit test to be broken, false-red tests and false-green tests:

  • false-red tests go red even if the code feature they’re testing is correct;

  • false-green tests go green even if the code they’re testing is incorrect.

Because we run all unit tests automatically before pushing to production, red tests are quickly found and fixed.  Thus any buggy unit tests that might be able to sneak into the code base are the false-green ones: the ones that say “fine!  everything’s fine!  no problem!”… even when the code is broken.

“Happy path” unit tests rarely break by going false-green.  A happy path test is one that checks that the desired result is achieved when all the necessary conditions are in place.  E.g., we launch our rocket, and it makes it to low Earth orbit, and it inserts into the correct orbit, and it matches velocity with the space station, and it docks without crashing, etc. etc.  Or, in the case of IMVU, a happy path test would be that customers can place orders when they have enough credits, the product is available, and so on.

Happy path unit tests rarely go false-green because of entropy: there are typically many more ways for something to fail than there are for something to succeed.  It’s possible that I might manage to break a test and break the code at the same time and have my changes conspire to get the test to go green anyway…

test("2 + 2", function () {
    equal(plus(2, 2), 5);
})

but it’s rather unlikely.

“Sad path” unit tests on the other hand check that things do in fact fail when they’re supposed to.  “The rocket can’t be launched in stormy weather” is one example.  Another example would be “customers can’t buy a virtual product when they don’t have enough credits”.

Sad path unit tests are often important for the integrity of the business.  Suppose the check for not being able to buy a product when the customer had insufficient credits wasn’t working.  We might not notice this — just using our application ourselves — because when we use our application we don’t generally try to buy things when we don’t have enough credits.

But though they are often quite important, these sad path unit tests are more easily broken as the code continues to be developed because there are many ways that things can fail — and if the test is just checking for “something failed” then lots of things might be able to do that.  For example, this unit test will go green if the code throws any exception:

test("names must be a string", function () {
    raises(function () { set_name(3.14159) });
});

and so it will be broken by any bug that manifests by throwing an exception.  In fact when I review existing unit tests that merely check if an exception has been thrown, I’ve found that they’ve become useless more often than not!

Thus extra care is needed when writing sad path tests.  It’s not enough to just check whether the code failed — we need to check whether it specifically failed for the reason that we’re testing for.

For this we need to do an analysis: are there any other code paths that could generate the same result as we’re testing for?  For example, it is better to check for the specific exception being thrown — but that could still be buggy if there’s more than one way that specific exception could be thrown.

In the rocket example, one option would be to instrument the code to make it more testable by having it return a string indicating why we’re not ready to launch:

function ready_to_launch() {
    var controller = connect_to_rocket_systems_controller();
    if (! controller) {
        return "the rocket systems controller is offline";
    }

    if (controller.get_fuel_level() < needed_fuel()) {
        return "there is insufficient fuel to launch";
    }

    var internet = controller.connect_to_ground_based_internet();
    if (! internet) {
        return "the ground based Internet service is not available";
    }

    if (internet.obtain_weather_report() == WEATHER_STORMY) {
        return "flight protocols do not permit launching in stormy weather";
    }

    return null;
}

Code will often need some enhancements to make it testable, though you may be able to find ways that make the testability requirement less intrusive.

And so if I had written my unit test at the time to check for the specific condition I was testing for:

test("don't launch the rocket in stormy weather", function () {
    ...
    set_weather_to_stormy();
    equal(
        ready_to_launch(),
        "flight protocols do not permit launching in stormy weather"
    );
});

then when the fuel requirement had been added later my test would have broken by going falsely red instead of going falsely green — and it would have been fixed.

My test would have been a more resilient unit test — one that over time would have had a better chance of being kept working as development proceeded.

Categories: Engineering

Distributed Redis Transactions

February 9, 2012 Leave a comment

By Eric Hohenstein, IMVU, Inc.

Preface

Four times a year, employees of IMVU (engineers and non-engineers) are encouraged to participate in what we like to call “Hack Week“. Hack week is when we are given the opportunity to work on anything that we want so long as there is the possibility that it could eventually achieve one or more IMVU business objectives. Within that constraint, we have total freedom. On the Monday following hack week we get to show off what we accomplished to the rest of the company. Some people choose to work on relatively small projects by themselves and others will organize and participate in larger projects with many people.

Our last hack week was November 28 through December 2, 2011. This is a description of the project that I worked on in conjunction with two other people, David Johnson and Peter Offenwanger, both software engineers at IMVU.

What we built

We built an ACID compliant distributed key value database implementing a small but critical subset of the redis command set and wire protocol:

GET
SET
DEL
WATCH
UNWATCH
MULTI
EXEC

The source is available on github and has been released as open-source under the MIT license.

The system should theoretically be able to scale to arbitrary limits since there is no singular bottleneck in the design. Clients connect to a homogeneous pool of client session hosts. The client sessions hash keys to buckets and each bucket maps to a separate erlang process and back-end redis storage instance. The number of possible buckets is limited only by the number of bits used to generate the hash (currently 2^128). The client sessions and the buckets communicate with the a homogeneous pool of transaction monitors. Only the cluster startup/monitoring process can currently limit scalability but this process can also be distributed if necessary though that should only be needed for a frighteningly large cluster.

The motivation for this investigation was to explore the performance limits of redis beyond what a single redis host can offer. Any solution based on standard redis has an upper bound on throughput without sacrificing consistency. Some applications of key value stores require very strong consistency (for instance a separate index view of a data set maintained in parallel with the data set). Redis is remarkably performant, especially given that it operates with just a single thread. However if an application requires strong data consistency and higher throughput than redis (or any other key value data store) can offer, the options with existing solutions are very limited. The redis team is in the process of adding a clustering solution but it will not support distributed transactions, only consistent hashing.

How we built it

The project started as an application I built to learn erlang shortly after the previous hack week in July. Before we started this last hack week the project consisted of about 700 lines of code that implemented the entire supported command set but only accessible from the erlang shell and without any persistence and without a transaction monitor. By the end of this last hack week we had around 1200 lines of code and a working prototype. The main pieces missing from the prototype are crash recovery and process monitoring.

How we benchmarked it

We spent a fair amount of time during hack week trying to reuse a standard redis benchmarking tool. This gave us good visibility into redis and dtm-redis performance on a single host with and without the appendfsync redis configuration. After hack week I spent some time building a custom benchmarking tool in C (included in the dtm-redis project) to work around some limitations of the standard redis benchmarking tool which include:

  • no support for transactions
  • no latency measurements, only throughput
  • no support for multiple listening hosts

This custom tool accepts a list of host:port parameters on the command line, a parameter for the number of clients per host, the number of seconds to run the test, and the type of benchmark to perform (get_set or trans). It produces output like the following:

ehohenstein@localhost:~/dtm-bench$ ./dtm-bench host1:6379,host2:6379,host3:6379,host4:6379 100 60 get_set
creating 100 clients connecting to each of 4 hosts
starting clients
clients running for 60 seconds
stopping clients
total requests:  17604943
total latency:   23993.386276540
average latency: 0.001363
max latency:     0.217846477
elapsed time:    60.007721772
requests/sec:    293377

 

When testing transactions, a “request” represented a transaction involving five round trips: WATCH, GET, MULTI, SET, EXEC. The average latency is calculated across the entire transaction though the max latency is calculated across each individual round trip. In at least one instance this resulted counter-intuitively in the measured max latency being lower than the average latency.

Most tests were performed with 100 clients per listening host. Experimentation showed that throughput grew up to that level and flattened out once that number of clients per host was reached.

The IMVU operations team was very helpful by setting up and configuring 4 hosts in the production cluster with redis installed to use for benchmarking. These hosts had dual six-core Xeon 2.66 GHz processors with hyper threading enabled and with spinning disks. When testing dtm-redis, 8, 16, or 32 buckets were configured, each with its own separate redis instance listening on 127.0.0.X.

Benchmark results

Outside of transactions, the system was able to easily beat standard redis throughput performance. Throughput appears to increase linearly with the number of hosts (at least with the number we used in the benchmark test). Latency of dtm-redis was mostly similar to standard redis but max latency appears to increase exponentially with the number of hosts which is troubling.

Transactions were tested in both synchronous I/O (durable) and non-synchronous I/O (non-durable) modes. The durable mode was not totally durable since when performing this test, the redis configuration was not altered to make it use the appendfsync persistence option. It was nevertheless a fair approximation since in this mode, dtm-redis was performing 2 synchronous writes to disk per transaction or write operation.

In non-durable mode, dtm-redis throughput again beat standard redis and had linear growth with the number of hosts. This time, average latency of dtm-redis was roughly constant and max latency again grew exponentially, exceeding redis by 20x with four hosts.

In durable mode, dtm-redis throughput was extremely low. When testing with just a single host and appendfsync persistence, both redis and dtm-redis allowed roughly 300 requests per second which is roughly 3 orders of magnitude below the throughput without appendfsync. This is not surprising since the disk would be the bottleneck in both cases. In durable mode, dtm-redis again had roughly linear growth in throughput with the number of hosts though only up to a maximum of 1925 transactions per second with four hosts. Average latency and max latency were both roughly constant though note that max latency reached more than 3 seconds in the 3 host test. This is especially alarming given that max latency was measured for individual round trips rather than the transaction as a whole.

Durable mode performance could likely have been greatly increased using a host with a SSD. The IMVU operations team didn’t have one available for us to use for benchmarking so we aren’t able to compare apples to apples. However, I noted on my development system, which does have an SSD, while working on the benchmark tool that I was able to get over 5000 transactions per second locally with only 1 CPU in a linux VM being shared by redis, dtm-redis, and the benchmark tool. Chances are that SSD would increase durable mode performance by a factor of at least 10x.

One thing that was interesting was that in durable mode the throughput, average latency, and max latency all improved with the number of clients per host. This is almost certainly a side effect of an optimization which was chosen in dtm-redis to batch together multiple I/O operations waiting in the binlog process queue.

Another thing that was very interesting is that when running the non-transaction and non-durable transaction benchmarks, the hosts under benchmark test spent almost all of their CPU time running erlang code. Each of the 8 redis processes had roughly 20-25% CPU utilization and erlang had 800-1000%.

Conclusions

This prototype was built with only 1200 lines of code. If we were to build a fully functional and robust system, it’s probably fair to estimate that it would require at least a 10x increase in the amount of code necessary. That being said, this wasn’t a giant project. I can’t say if this would provide enough value to IMVU to build, but it is at least within reason to imagine us building it.

The use of erlang for this application is somewhat questionable. It was useful as a learning tool to build this prototype in erlang. I suspect however that erlang is not efficient enough to match the kind of performance typically expected from redis. The difference in average latency and the unacceptably high max latency of dtm-redis together with the CPU utilization of dtm-redis would suggest that erlang is spending too much time copying data from process to process. It’s hard to be sure without building another prototype but I believe that C would be a better choice. Building this system in C would be significantly more code however and a corresponding difference in development and maintenance costs.

Of course in durable mode, erlang performance is really not an important factor. In this case, the disk is the limiting factor in performance. The scalability of this system (as currently defined) is actually based solely on the ability to distribute disk I/O. Adding a large number of disks would likely allow a significant performance increase but might also reduce the mean time to failure.

After having completed the prototype it was clear that the durable mode is likely not going to be useful due to the very low throughput and high latency. An alternate design that I would like to explore is one that would be more or less equivalent to the default redis mode where the data-set is periodically checkpointed. It should be possible to obtain the performance characteristics of dtm-redis seen in non-durable mode while maintaining (check-pointed) consistency by periodically synchronizing all bucket processes and having them each initiate a save operation. This will cause occasional high latency but the average latency should remain low.


Categories: Engineering

Gephi Plugins Developers Workshop at IMVU

by Paco Nathan

At IMVU, the Data team often works with large data sets which represent customer interactions across social graphs. This kind of data cannot be explored very well using relational databases, such as MySQL. Instead, we rely on other tools for analyzing social graphs.

One tool which we really like is called Gephi, an open source platform for visualization and analysis of graphs. Our team uses Gephi as an interactive UI for exploring relationships and patterns in graphs. We also use it for calculating social graph metrics and other statistics which help us to characterize large graphs. One favorite feature in Gephi is how readily it can be extended by writing plugins.

We’ve received much appreciated help and guidance from the Gephi developer community. Now we are glad to be able to host the first Gephi workshop about developing plugins. Bring your laptop, install the required software from sources provided, and work with mentors to learn how to write your own Gephi plugins! And get to network with other people in the Gephi developer community.

To RSVP, please click through the Meetup.com link below.

Title: Gephi Plugins Developers Workshop

Organizers: Mathieu Bastian and Paco Nathan

Date and Time: Thursday, October 6, 2011, 7:00-9:30pm

Location: IMVU, Inc.. 100 W. Evelyn Ave #110, Mountain View, CA

Cost: No cost

Food and drinks: Food, drinks, and wifi will be provided

Agenda: This is the first Gephi workshop dedicated to Gephi Plugins developers! Come and learn how to write your first Gephi plugins and ask questions.

The workshop will start with a 1-hour presentation of Gephi’s architecture and the different types of plugins that can be written with examples. Details about Gephi’s API, code examples and best practices will be presented in an interactive “live coding” way. The Gephi Toolkit will also be covered in details. The second part of the workshop will be dedicated to help individuals with their projects and answer questions. Depending on the audience’s needs we can discuss plugin design, how to code UI, databases, toolkit, data sources or any plug-in idea you have.

Gephi is modular software and can be extended with plugins. Plugins can add new features like layout, filters, metrics, data sources, etc. or modify existing features. Gephi is written in Java so anything that can be used in Java can be packaged as a Gephi plugin! Visit the Plugins Portal on the wiki and follow the tutorials.

Link to event on meetup.com: Gephi Plugins Developers Workshop

Silicon Valley Open Source Business Intelligence Meetup at IMVU

September 12, 2011 Leave a comment

By Tim Levine and Paco Nathan

At IMVU, our leaders are increasingly leveraging data to inform decisions. In response to this we have embarked on leveraging an off-the-shelf business intelligence solution to enable ready access to data.  Considering our needs, present and future, and the personality of the company we have deployed a commercial version of Pentaho.  Using this suite, we have been able to provide fresher data (daily refresh versus monthly), more interactive data (using ROLAP cubes), and newly processed data (30-day rolling of usage).
Our success has gotten to the point where we would like to contribute and draw from the larger community and so we will be hosting  a meetup of those interested in open source business intelligence.  To RSVP, please click through the meetup.com link below.

Title: 

Silicon Valley Open Source Business Intelligence Meetup

Organizer: 

Tim Levine and Paco Nathan

Date and Time:

Thursday, September 15, 2011
7:30-9:30pm

Location:

IMVU Corporate Headquarters
100 W. Evelyn Ave #110 , Mountain View, CA

Cost:
No cost

Food and drink:
Food and drinks will be provided

Agenda:

1 Food and networking
2 Introductions around the room, find burning questions
3 Tim Levine and Paco Nathan will present on how Pentaho fits into our technology stack to solve problems involving little and Big Data alike. Specifically how we use Pentaho to fetch, munge, and report data as well as bringing in R and hadoop in the AWS cloud to provide otherwise impossible insights.
4 Discussion of burning questions
5 Solicit ideas for next meeting, decide on frequency
6 Announcements

Link to event on meetup.com
 Silicon Valley Open Source Business Intelligence Meetup


Categories: Engineering

Partner Integrations

By Rusty Larner, IMVU

IMVU partners with several companies to provide extra services on our website that are not our core competencies, especially when it makes more sense to use others that already implement the service.  Each of these integration efforts is a new process, but we have found several practices that make them easier to accomplish and more likely to succeed.

Doing a technical review during the initial partner evaluation sounds obvious, but is extremely critical.  Not just at the marketing level, but having our engineers talk with theirs about the expectations we have on performance (connections per second, etc.) as well as the actual integration work helps reduce confusion before the effort starts.  Several times we have surprised our partners with either the usage we expect for a new feature, or how we want to integrate their services into our entire development cycle.  Sometimes the technical review at this stage determines that it won’t really be reasonable to do the integration – helping us to ‘fail fast’ instead of waiting until attempting the integration.

Continuing this communication on through the initial implementation effort is also important. The best way to accomplish this is having both sets of engineers co-located during the initial integration effort.  We have found that having the engineers from our partner work directly with our engineers helps cut down on the communication delays.  There are always questions during any integration, about API documentation or error messages or just adding logging to determine why a given request isn’t working – having engineers in the same room helps immensely.  The timing on this is also critical – we do not want to have someone fly out before we’ve started the integration work, but we also do not want to go too far down the rabbit hole before having an expert there to help.  Having flexible partners helps – during one integration process we ran into difficultly far earlier than expected – our east-coast partner flew out an engineer the next day to work with our team, and he stayed until we determined the problem.

Another part is being able to integrate tests against our external partners into our continuous integration process.  We have several solutions here – first, we implement a full ‘fake’ version of the partner’s API that we use during standard commits.  This allows our large suite of unit and functionality tests to not be dependent on external connections that would make them slower and sometimes intermittent.  We also have a separate set of tests that we run against the actual API.  These include tests that confirm our fake APIs work the same as the actual APIs.  These external tests are run every couple of hours, so we can detect if the partner’s APIs change.  The combination of these tests allow us to feel confident with our continuous integration efforts, even when doing things like changing our Music Store (which talks to our music partner).

Metrics and graphing are also a very important part of our process.  They let us quickly determine if there is a problem with the connection to our partner.  The faster you know there is a problem, the faster you can respond!  (A couple of times during my first years at IMVU, we only found out about problems through our forums – not the way to respond quickly to problems!  Plus, it becomes far harder to fix the problem when you don’t know exactly when it started…)  For these integrations, we keep several levels of metrics.  First, each point where we talk to an external customer gets metrics – for example, each music purchase attempt gets counted from our side.  But we also ask our partners to expose a service to report their own metrics to us.  This gives us better pointers to problems when they do occur – i.e. if there is a problem with the network connections between our servers and our partners’ servers.

The metrics and graphing are only a part of the continuing maintenance that we put in place.  Another major part is communication paths.  We create an internal mailing list that we ask our partners to send all email to – so when we internally change responsible individuals we don’t have to have all of our partners change their address books.  We also ask our partners to let us know as soon as possible whenever they detect a problem, then follow up later on to let us know how they solved the problem.  Where available, we’ve asked our partners to include our contact info for some of their monitoring alerts.  Our operations staff also has contact details and a call escalation path for our partner and solution instructions so they know how to handle any alerts that are triggered by our monitoring.

So, in review – for partners of IMVU, here is the checklist we (currently) follow:

* Initial engineering technical review before any contract is signed

* Co-located engineering support early in the integration schedule (usually within the first week or two)

* Regular testing calls being made to the partner’s APIs (including test purchases, etc.)

* Expose a web service containing real-time metrics important for the integration

* Call escalation path for operations if issues arise

* Continuous communication about scheduled maintenance, unplanned issues, or upcoming upgrades.

Buildbot and Intermittent Tests

January 19, 2011 13 comments

By Jon Watte, Technical Director at IMVU, Inc.

At IMVU, we do continuous deployment. You may have read previous blog entries describing how we ship code to production 50 times a day. Today, I’d like to talk a little bit about what we’ve learned as IMVU has grown as a company and engineering organization, and what we’re doing to support this growth.

The workflow for anyone who wants to check code or assets into our website looks something like this:



As part of developing a feature increment, we also build the automated tests that prove that the feature works as intended. Over time, as we develop more and more features, we also accumulate more and more tests. Each time we check in code or assets, that check-in is run through all of the tests that have been developed over the years — at last count, this is over 40,000 tests, with seven digits’ worth of individual test assertions (the exact number varied depending on how you count).

The great thing about this is that, if your feature accidentally breaks some part of the code that you didn’t think or know about, the buildbot will tell you. You have the ability to run any test you want in your local sandbox before you check in, and you’ll generally run the tests that you think may be relevant, but in a code base the size and complexity of all of IMVU, you can’t possibly know or run everything.We have an informal goal for our buildbot to run all the tests in 12 minutes or less. This means that we can build 5 times an hour. Some contributors show up for work 8 am (Eastern time); others work until late in the night (Pacific time), so there’s opportunity for well over 50 builds per day.

However, not all builds result in success — the whole reason for having a buildbot is to catch the mistakes that inevitably slip in. When a submitted change causes one or more tests to fail, we call that “making buildbot red,” and that change is not allowed out onto the live website. Instead, you as committer will back out those changes (we have scripts that make this easy and painless), and investigate the failure (buildbot saves the appropriate log files and other information) to fix it before you re-submit.

Unfortunately, tests, like code, are written by humans. This means that tests will have bugs, Some bugs will immediately be obvious — but others will sit silently, working 99% of the time (or, more likely, 99.999% of the time). Then, once in a blue moon, some timing issue or bad assumption will surface, and the test will spuriously fail. What’s worse is that, most often, running the test again will likely not fail, so tracking down the failure requires a fair amount of diligence and effort. When this happens in your local sandbox, it’s not so bad, but if one of these intermittent test failures happens on buildbot, it’s a bit of a problem, because nobody can deploy their code to production while buildbot is red. At a minimum, the test failure has to be investigated to the point that someone understands that it’s an intermittent failure, rather than caused by the code that you committed.

Common causes of intermittent tests may be things that are hard to get right for humans (like systems with sophisticated caching strategies), or things that are hard or expensive to control for in a testing environment (assuming some pre-existing data, or absence of data, in particular databases, for example), or just because of the underlying platform. For example, we test user visible web pages using a system that drives web browsers, and then we assert things about both the state of the web page within the browser, the requests that the browser has made, and even screen shots taken of certain pages. Unfortunately, web browsers will sometimes decide to crash for no discernible reason. There are other components that are not 100% reliable, either, including tests that have to integrate with external services over the internet, although I’ll refrain from going too deep into that particular discussion in this post.

We have been trying to flush out intermittent tests by running tests all night, every night. Because no code changes between the runs, any test that goes red during this nightly grind is intermittent. We will look at the status in the morning, and any test that went red is distributed to the team that seems likely to know the most about the particular failure, for fixing.

Still, intermittent tests have been one of the most annoying thorns in the side of this entire process. As we scale the organization, buildbot will be busy most of the day, generally building three to five different change sets as part of the same build. Any of those committers will be prevented from deploying their changes to production until at least the next build has completed. This can add half an hour to the build-tests-deploy cycle, which hashes the mellow of continuous deployment driven development goodness!

To illustrate, the worst week of our internal buildbot for 2010 looked like this:

Clearly, in an environment that values quick iteration and agile development, if you find yourself in that position, you want to do something about it. So we did!

The key insight is that an intermittent test only fails some of the time, but a real failure caused by new code will fail all of the time. The second insight was that we could gather the specific outcome of each test within each test run, to track metrics on tests over time. We started by just adding these measurements and tracking tests over time. Once we learned that we could predict an intermittent test based on two successive test runs, and that we get clear green-or-red status out of each test run, we re-structured the way that we run tests.

Because we have 40,000 tests to run, allocated over several thousand test implementation files, we split the entire batch of tests across a large number of virtual machines, hosted on a few really beefy server-class machines (raid-6 SSD, 48 GB RAM, dual-CPU six-core-Xeon type machines). Our first implementation spreads the build files across different machines, and uses measurements of how long each test file is to statically sort the files into reasonably equal run lengths. This worked for many years, but now we have better information!

We updated our test runners to first discover all tests that need to run, and put those into a database of available tests. Each build slave will then request one test at a time from this database (which we call the Test Distributor), run the test, and report back. We start by doling out the slowest tests, and then dole out faster and faster tests, to achieve almost 100% perfect balancing in runtime. Even better, though — we know whether a test fails, while the build is still running. Thus, we made the distributor re-distribute a test that is reported as red to some other build slave. If the test comes back as green, we then classify the test as intermittent, and do not make that one red test get in the way of declaring the change set we’re testing as successful. In addition, we remove the build slave that reported the intermittent from the pool of available builders, so that we can analyze the cause of the intermittency whenever convenient.

Does this work?
Oh, yes! Here is a screen shot from the buildbot availability status this week. You will see that the buildbot caught some real, red check-ins, but intermittent tests are not getting in the way anymore. Thursday, we had an entire day with 100% green availability!

Pretty nifty, huh? I’m glad that I work in an environment as dynamic as IMVU that really cares about keeping our productivity up. The more we can remove annoying roadblocks from the path between developers doing hard work, and them seeing their hard work pay off, the better we will do.

Categories: Engineering

REST and Games Don’t Mix

December 18, 2010 4 comments

By Jon Watte, Technical Director at IMVU, Inc.

In web development, the REST (REpresentational State Transfer) approach to services often works well to build scalable, robust systems. Unfortunately, the principles behind REST collide with some necessary requirements of games, and the end result is that many problems faced by multiplayer games cannot be solved using REST techniques.

One pillar of REST design is that all necessary data for servicing a request is included in the request, and a client imposes no load on a server when not within the context of a request. For example, a web server serves a file to you when you pass it a URL; how to serve that file may be governed by cookies sent by your web browser; but once the file is served, the server conceptually needs to do nothing else to satisfy the client. If you are building, say, a photo sharing site, then this makes a lot of sense. A request may upload a new photo for some particular user, identified through URL tokens or cookies. Another request may list photos that match some particular criteria. A third request may remove or otherwise edit metadata surrounding the photos that are uploaded, by uploading some JSON or XML document to go with the photo. All in all, each request is isolated, and the server needs to do nothing (other than store the photos on disk) between requests.

Compare to a game of chess: Each move in a game of chess isn’t a complete new document (or chess board), but instead a delta over some previous state. Moves are checked for validity before they are accepted, by being interpreted in the context of the board. You could, conceivably, remember the old board state, and have a player upload a new board state that represents the state after his move, and derive what his move was by running a diff to the board. Unfortunately, even that is not enough to properly implement the rules of chess, because certain moves (castling, en passant pawn capture) depend on historical state of the board. To work around this, we can include a full set of historical state with the board, as well as the actual board state, and still implement a chess game using REST techniques.

All that remains, then, is to figure out how your opponent learns about a move having taken place, and knows how to download the new board state to display it. This is something that REST techniques simply do not solve, because the infrastructure of REST depends on the direction of requests, the completeness of state transfer, and the cacheability of data. To get rid of cache problems in the chess game above, you’d have to also keep a “next move id” counter as part of the request URL, so that the state of the board, including history, is addressed by a separate URL for each new move.

Another problem with the implementation of the chess game as a REST service is that it transfers the entire game state, including the board, for each move. This uses more network bandwidth than necessary. For something simple, like a chess board, this doesn’t really matter that much. Broadband makes bandwidth cheap, and even a cell phone will transfer 50 kilobytes of data in less than a second these days. However, for more complex games, like a 3D role-playing game or a real-time military simulation game, the amount of data transferred per frame, and the rate at which frames need to be transferred, makes REST bandwidth requirements a real problem.

However, that’s not the worst part. Some games have rules that require secrecy. For example, in a game of Texas Hold’em poker, some state is known, such as what the community cards are and what the amounts of the pot and each player’s chip stack is. Other state, though, is hidden. I do not know what cards you have in your hand, and to avoid cheating, I should have no way of finding this out until and unless you reveal your hand at the end of the game. What cards each of us have in our hands is state that is shared between only us and the server. This means that a player cannot upload a “new game state” that includes the cards in each hand. Yet, to determine the winner, all the state is required. Additionally, the deck that cards are being dealt from is secret state that only the server knows. Thus, at a minimum, to apply REST to our poker game, we’d have to decompose the game into many separate states (each hand, the deck, the pot, each player’s chip stack, etc), and we’d have to enforce separate access control to each of these values. More troubling is the fact that the cards in a player’s hand are changed by the dealer, which means it’s an action someone else takes that affects what my state is. Again, REST does not support server to client push, nor does it generally support “active” state, so implementing even a simple game like Poker requires us to break the REST mold.

The World Wide Web is a large instance of largely REST-ful services. Various work-arounds exist to get around the limitations of the REST architecture, such as long-poll XMLHttpRequests from Javascript, chunked-encoding IFRAME script execution, etc. The main reason to use the Flash player or other browser plug-ins is to break the REST mold. Once games expand outside the simple rules of something like poker or chess, the contortions necessary to work around the limitations of REST become byzantine enough that the benefits of running within a browser may be outweighed by the costs, both in development, deployment and operation. You simply can’t describe a typical server instance of a game like World of Warcraft using a REST API and expect any normal consumer network or computer hardware to keep up.

Why does this matter? The World Wide Web is built on REST. New web standards that allow us to do more and more with the internet browser are driven by the same people who developed REST. As a case in point, the HTML5 standards umbrella does not address streaming media. You can “play” a sound or video URL, and the implementation of those URLs may optimize by not delivering the data until you need it, but the only way to “record” sound or video is to record into a static piece of data, which can later be uploaded. This means that voice or video chat cannot be done in a web browser without relying on third party plug-ins, such as the Flash player. Additionally, HTML5 describes a mechanism called Web Sockets to try to work around the server push limitations of REST, but that technology has run into significant implementation problems, and will not be part of the round of web browsers releasing in the first half of 2011.

It’s clear that people want to play games and do other interactive, bidirectional activities with their network-connected computers. The success of networked games and the success of even non-interactive games like FarmVille or Backyard Monsters leave this without a doubt. It would be great if those who understand interactive entertainment (such as game makers) could make friends with those who are currently driving the web, and find common ground to allow for a bidirectional, interactive, real-time world wide web of human communications. HTML6, anyone? :-)

Categories: Engineering Tags: ,
Follow

Get every new post delivered to your Inbox.

Join 38 other followers

%d bloggers like this: