Distributed Redis Transactions

By Eric Hohenstein, IMVU, Inc.

Preface

Four times a year, employees of IMVU (engineers and non-engineers) are encouraged to participate in what we like to call “Hack Week“. Hack week is when we are given the opportunity to work on anything that we want so long as there is the possibility that it could eventually achieve one or more IMVU business objectives. Within that constraint, we have total freedom. On the Monday following hack week we get to show off what we accomplished to the rest of the company. Some people choose to work on relatively small projects by themselves and others will organize and participate in larger projects with many people.

Our last hack week was November 28 through December 2, 2011. This is a description of the project that I worked on in conjunction with two other people, David Johnson and Peter Offenwanger, both software engineers at IMVU.

What we built

We built an ACID compliant distributed key value database implementing a small but critical subset of the redis command set and wire protocol:

GET
SET
DEL
WATCH
UNWATCH
MULTI
EXEC

The source is available on github and has been released as open-source under the MIT license.

The system should theoretically be able to scale to arbitrary limits since there is no singular bottleneck in the design. Clients connect to a homogeneous pool of client session hosts. The client sessions hash keys to buckets and each bucket maps to a separate erlang process and back-end redis storage instance. The number of possible buckets is limited only by the number of bits used to generate the hash (currently 2^128). The client sessions and the buckets communicate with the a homogeneous pool of transaction monitors. Only the cluster startup/monitoring process can currently limit scalability but this process can also be distributed if necessary though that should only be needed for a frighteningly large cluster.

The motivation for this investigation was to explore the performance limits of redis beyond what a single redis host can offer. Any solution based on standard redis has an upper bound on throughput without sacrificing consistency. Some applications of key value stores require very strong consistency (for instance a separate index view of a data set maintained in parallel with the data set). Redis is remarkably performant, especially given that it operates with just a single thread. However if an application requires strong data consistency and higher throughput than redis (or any other key value data store) can offer, the options with existing solutions are very limited. The redis team is in the process of adding a clustering solution but it will not support distributed transactions, only consistent hashing.

How we built it

The project started as an application I built to learn erlang shortly after the previous hack week in July. Before we started this last hack week the project consisted of about 700 lines of code that implemented the entire supported command set but only accessible from the erlang shell and without any persistence and without a transaction monitor. By the end of this last hack week we had around 1200 lines of code and a working prototype. The main pieces missing from the prototype are crash recovery and process monitoring.

How we benchmarked it

We spent a fair amount of time during hack week trying to reuse a standard redis benchmarking tool. This gave us good visibility into redis and dtm-redis performance on a single host with and without the appendfsync redis configuration. After hack week I spent some time building a custom benchmarking tool in C (included in the dtm-redis project) to work around some limitations of the standard redis benchmarking tool which include:

  • no support for transactions
  • no latency measurements, only throughput
  • no support for multiple listening hosts

This custom tool accepts a list of host:port parameters on the command line, a parameter for the number of clients per host, the number of seconds to run the test, and the type of benchmark to perform (get_set or trans). It produces output like the following:

ehohenstein@localhost:~/dtm-bench$ ./dtm-bench host1:6379,host2:6379,host3:6379,host4:6379 100 60 get_set
creating 100 clients connecting to each of 4 hosts
starting clients
clients running for 60 seconds
stopping clients
total requests:  17604943
total latency:   23993.386276540
average latency: 0.001363
max latency:     0.217846477
elapsed time:    60.007721772
requests/sec:    293377

 

When testing transactions, a “request” represented a transaction involving five round trips: WATCH, GET, MULTI, SET, EXEC. The average latency is calculated across the entire transaction though the max latency is calculated across each individual round trip. In at least one instance this resulted counter-intuitively in the measured max latency being lower than the average latency.

Most tests were performed with 100 clients per listening host. Experimentation showed that throughput grew up to that level and flattened out once that number of clients per host was reached.

The IMVU operations team was very helpful by setting up and configuring 4 hosts in the production cluster with redis installed to use for benchmarking. These hosts had dual six-core Xeon 2.66 GHz processors with hyper threading enabled and with spinning disks. When testing dtm-redis, 8, 16, or 32 buckets were configured, each with its own separate redis instance listening on 127.0.0.X.

Benchmark results

Outside of transactions, the system was able to easily beat standard redis throughput performance. Throughput appears to increase linearly with the number of hosts (at least with the number we used in the benchmark test). Latency of dtm-redis was mostly similar to standard redis but max latency appears to increase exponentially with the number of hosts which is troubling.

Transactions were tested in both synchronous I/O (durable) and non-synchronous I/O (non-durable) modes. The durable mode was not totally durable since when performing this test, the redis configuration was not altered to make it use the appendfsync persistence option. It was nevertheless a fair approximation since in this mode, dtm-redis was performing 2 synchronous writes to disk per transaction or write operation.

In non-durable mode, dtm-redis throughput again beat standard redis and had linear growth with the number of hosts. This time, average latency of dtm-redis was roughly constant and max latency again grew exponentially, exceeding redis by 20x with four hosts.

In durable mode, dtm-redis throughput was extremely low. When testing with just a single host and appendfsync persistence, both redis and dtm-redis allowed roughly 300 requests per second which is roughly 3 orders of magnitude below the throughput without appendfsync. This is not surprising since the disk would be the bottleneck in both cases. In durable mode, dtm-redis again had roughly linear growth in throughput with the number of hosts though only up to a maximum of 1925 transactions per second with four hosts. Average latency and max latency were both roughly constant though note that max latency reached more than 3 seconds in the 3 host test. This is especially alarming given that max latency was measured for individual round trips rather than the transaction as a whole.

Durable mode performance could likely have been greatly increased using a host with a SSD. The IMVU operations team didn’t have one available for us to use for benchmarking so we aren’t able to compare apples to apples. However, I noted on my development system, which does have an SSD, while working on the benchmark tool that I was able to get over 5000 transactions per second locally with only 1 CPU in a linux VM being shared by redis, dtm-redis, and the benchmark tool. Chances are that SSD would increase durable mode performance by a factor of at least 10x.

One thing that was interesting was that in durable mode the throughput, average latency, and max latency all improved with the number of clients per host. This is almost certainly a side effect of an optimization which was chosen in dtm-redis to batch together multiple I/O operations waiting in the binlog process queue.

Another thing that was very interesting is that when running the non-transaction and non-durable transaction benchmarks, the hosts under benchmark test spent almost all of their CPU time running erlang code. Each of the 8 redis processes had roughly 20-25% CPU utilization and erlang had 800-1000%.

Conclusions

This prototype was built with only 1200 lines of code. If we were to build a fully functional and robust system, it’s probably fair to estimate that it would require at least a 10x increase in the amount of code necessary. That being said, this wasn’t a giant project. I can’t say if this would provide enough value to IMVU to build, but it is at least within reason to imagine us building it.

The use of erlang for this application is somewhat questionable. It was useful as a learning tool to build this prototype in erlang. I suspect however that erlang is not efficient enough to match the kind of performance typically expected from redis. The difference in average latency and the unacceptably high max latency of dtm-redis together with the CPU utilization of dtm-redis would suggest that erlang is spending too much time copying data from process to process. It’s hard to be sure without building another prototype but I believe that C would be a better choice. Building this system in C would be significantly more code however and a corresponding difference in development and maintenance costs.

Of course in durable mode, erlang performance is really not an important factor. In this case, the disk is the limiting factor in performance. The scalability of this system (as currently defined) is actually based solely on the ability to distribute disk I/O. Adding a large number of disks would likely allow a significant performance increase but might also reduce the mean time to failure.

After having completed the prototype it was clear that the durable mode is likely not going to be useful due to the very low throughput and high latency. An alternate design that I would like to explore is one that would be more or less equivalent to the default redis mode where the data-set is periodically checkpointed. It should be possible to obtain the performance characteristics of dtm-redis seen in non-durable mode while maintaining (check-pointed) consistency by periodically synchronizing all bucket processes and having them each initiate a save operation. This will cause occasional high latency but the average latency should remain low.