Monitoring Delayed Replication, with a focus on MySQL
By Stuart Cianos, CISSP
Many (many) years ago, I had the opportunity to work with a variety of organizations targeting harm reduction for at risk populations in the San Francisco Bay Area. One of the great lessons learned from that experience is that the future is mostly indeterminate. In 1997-1998, severe storms ravaged our region and very few of us dealing with relief services (myself included) anticipated the total impact. It was only by sheer luck and coincidence that there was available shelter space to house an entire community of displaced families after automatic flood gates failed to open on a large stream following years of drought. We didn’t think it would happen to us until it was too late to be better prepared. If I win the lottery tomorrow, my future might change in ways that were unexpected. Conversely, if I were to accidentally delete all of the files (“rm -rf /”) on a mission critical computer system, my future might also change in unexpected ways.
I no longer worry about the future, I prepare for it. What follows is a powerful technique which can help mitigate risks around unanticipated data loss, even when that data loss is due to human error.
IMVU utilizes MySQL on a daily basis to drive our site and customer experience forward. As IMVU has grown and scaled, so have the number of database hosts and the importance of data maintained in our environment. With over 100,000,000 registered user accounts and hundreds of thousands of concurrent transactions at any given time, it’s of paramount importance that our practices embody the CIA triad: confidentiality, integrity and accessibility of information. Implementing delayed replication at the data tier can enhance business continuity and improve posture when recovering from events impacting data integrity and availability.
IMVU uses MogileFS to store user generated binary content (images, content creator products, etc.). MogileFS is a user-space fault tolerant, distributed file system. MogileFS stores filesystem metadata in a MySQL database (PostgreSQL is also supported, but not used in our environment). The system is designed to ensure that no single failure will result in data loss or service degradation.
In June of 2012, human error resulted in the MogileFS metadata table (“file”) being dropped from the primary database instance rather than a standby under maintenance. Without any way to relate requests for files back to the content on disk, MogileFS could no longer serve any requests resulting in customers being unable to access some major features of the IMVU social network and rich 3D experience. There was a slave database instance lagging behind its master due to write heavy operations, which allowed us to use it as a recovery source. We were very fortunate that this slave was lagging behind unintentionally; had it not been, the impact of this incident would have been much more severe.
Delayed replication is being implemented across IMVU’s cluster of database instances in order to protect against these types of accidental operations moving forward.
Data Availability and the CIA Triad
The CIA triad is a set of common attributes which should be applied to information management and security. Information on the CIA triad is available abundantly online, but a brief summary is below:
- Confidentiality: Prevention of information disclosure to unauthorized entities.
- Integrity: Prevention and detection of the unauthorized modification of data.
- Availability: Assurance that the information will be available when needed.
One question that comes up is what delayed replication has to do with information security… and my response is “a lot!” Delayed replication can help mitigate risks around data availability as well as data integrity, and the impact an incident will have on the bottom line.
What is Database Replication?
Database replication allows transactions on one database host to be replicated to one or more separate instances. There are many replication strategies, but the examples discussed will focus on a simple Master-Slave configuration throughout.
When the transaction “INSERT INTO foo VALUES (‘hello world’)” is committed to the master database instance, it is replicated on the slave and committed there as well. By having a slave or standby host available and relatively up to date, it is possible to recover very quickly from various hardware and software failures on the master by replacing it with the slave/standby host.
What is Delayed Replication?
Delayed replication is the technique of inserting a delay line into a database’s replication mechanism. Transactions will each be held in a first-in, first-out queue for two hours prior to committal on the slave host being delayed. For the purposes of this example, “mytable” is a mission critical set of data required for the application to function and contains 100 gigabytes of data. If the table “mytable” becomes unavailable for whatever reason (for example, a drop table statement was executed accidentally/unintentionally on the primary and it immediately replicated and got executed on the standby), customers will not have access to the service. In this example, fictitious business parameters will be defined to help us quantify possible benefits:
- Recovery time objective (RTO): 8 hours
- Recovery point objective (RPO): 4 hours
- Single loss expectancy is calculated using the standard formula: SLE=(Asset Value) * (Exposure Factor).
- Our exposure factor (a subjective percentage of functionality or impact): 100%, since no customers will be able to access the system.
- Timespan used to calculate the single loss expectancies in terms of hours is the time from failure to service availability.
- Value of customer transactions against “mytable”, per hour: $10,000 USD
- Single loss expectancy based on recovery time objective of 8 hours: (10,000 * 8) * (1.00) = $80,000.
- Total time to rebuild “mytable” from a backup: 20 hours
- Total time to fail over a database from master to standby/slave: 5 minutes
Given the above, a single loss expectancy can be calculated for “mytable” assuming a worst case scenario of ~20 hours. The realistic single loss expectancy (for the table, not the entire database server asset as a whole) may be calculated: (10,000 * 20) * (1.00) = $200,000. This is $120,000 loss above and beyond what management would have expected. Worse, the recovery time objective and recovery point objectives cannot be consistently met.
Adding a two hour delay line (well within the recovery point objective) can help meet business requirements and reduce single loss expectancy.
The above transaction seems pretty innocuous, and having a delay line enabled doesn’t make sense for some use cases such as read slaves that are expected to be relatively consistent with their master. So… why bother with a delay line at all?
The benefits become clear when looking at another example. What if the INSERT statement above is changed to DELETE FROM mytable (or even DROP TABLE mytable) due to a bug in code, human error or malicious intent? In an environment with immediate committal on all slaves, one must:
1. Hold their breath and:
◦ Hope that a slave can be stopped before the transaction propagates, dump the table, and restore to the impacted master (likely 8+ hours).
◦ Hope that a slave can be stopped before the transaction propagates and fail over to it.
2. Restore from backup (20 hours).
Hope must not be factored into sound business practices; option one is off the table. With a delay line enabled procedures may be implemented to create a recovery process:
The recovery process and benefits when using delayed replication can be documented, made repeatable and proven. So long as the problem is caught before the offending transaction makes it through the delay line, the recovery process is:
- T+0 minutes: Service degraded. Investigation into cause begins.
- T+5 minutes: Cause determined. Stop all replication to the delayed slave (on MySQL, don’t forget to stop the slave IO thread in addition to the SQL thread).
- T+35 minutes: Remove the pending transaction in the relay log, or, reset the relay log based on best judgment and the importance of data consistency.
- T+40 minutes: Promote the delayed standby to master.
- T+50 minutes: Validate that the system is functional.
- Service recovery declared.
- Recover the demoted master to a consistent state.
The total time to recovery during the above incident was 50 minutes, with an approximate loss of (10,000 * 0.833) * (1.00) = $8,333. The recovery time objective and recovery point objective have been met.
Delayed Replication and Seconds Behind Master:
MySQL 5.6 natively includes delayed replication as a new feature, and will be configurable via the CHANGE MASTER TO statement. MySQL 5.6 is not a generally available release, however, so is not deployed in production for most environments. Organizations running versions 5.1 and 5.5 can effectively implement delayed replication using the Percona-Toolkit utilities (formerly Maatkit) from Percona Software. The Percona Toolkit is an open source collection of helpful tools focused towards management of the MySQL database server. At IMVU, we have successfully implemented delayed replication using the percona-toolkit tools (more specifically, pt-slave-delay).
One side effect of delayed MySQL replication via an external process is that it is no longer possible to fully determine slave state using SHOW SLAVE STATUS. As the slave’s SQL_THREAD is periodically stopped and started, “Seconds behind master” will be NULL most of the time.
In order to mitigate the lack of information from MySQL’s internal replication state, a second utility is available through the percona-toolkit: pt-heartbeat. Pt-heartbeat writes a timestamped entry to a table periodically, creating a heartbeat.
The heartbeat transaction will be replicated, and delayed. By comparing the slave’s current time with the timestamp on the last committed heartbeat, we can approximate with fair certainty the number of seconds the slave has been delayed (or is lagging behind the master).
Monitoring the Solution, effectively:
Monitoring delayed replication becomes more complex than simply watching for heartbeats and comparing timestamps:
- Usually, the slave will be stopped during backups unless something like Percona Xtrabackup or Enterprise Backup is being used. The replication delay will increase for the duration of the backup, and then contract. It should be noted that IMVU is OK with taking backups from a two hour old data source given our business requirements. Always check with your organization’s business requirements; don’t assume!
- The replication delay will be affected by MySQL’s replication characteristics, including the fact that it is single threaded. High transaction volume can cause replication delays to increase and subsequently contract. The replication delay is variable and will not be consistent on a heavily loaded database.
- If the pt-heartbeat/pt-slave-delay fails to maintain the delay line and monitoring, the team must always be made aware.
- If the replication fails due to IO or SQL thread errors, the team must always be made aware.
- If the database replication or monitoring system fails for any other reason, the team must be made aware. Contingencies must be in place for failures in MySQL, as well as in code which monitors delayed replication.
Rather than looking at the difference in timestamps at a moment in time, IMVU took the approach of calculating the slope of the delay across a timespan. With the additional information, descriptive statistics are calculated:
- The slope of the delay line’s values over a time period: Allows determination of whether or not the delay line is stable, moving towards the desired value, or away from it and how quickly.
- The Y-intercept of the slope: Treated as the approximate current number of seconds behind, functionally equivalent to MySQL’s Seconds Behind Master.
- Correlation Coefficient: Allows determination of how well the current values on the trend line correlate as a series. If the correlation is 0, there is very little correlation meaning that the values are highly distributed over the given timespan. For values -1 < p < 0, there is correlation between values and the line is trending downward. For values 0 < p < 1, there is correlation between values and the delay line is trending upwards.
Additional information is helpful as well, and can be calculated based on data gathered from MySQL:
- The time when backups were kicked off.
- The number of seconds backups have been running (if applicable).
- The number of seconds the trend line has been in a degraded state due to: backups running, recovering after a backup, and trending towards recovery but not in a backup state.
If backups are being taken from the delayed standby/slave, the delay line will increase dramatically when replication is paused.
To prevent false alarms from paging our operations team, monitoring sensitivity must be decreased during backup windows. Once the backup finishes, it will take time for MySQL’s replication thread to catch up to the desired delay line. Again, sensitivity must be reduced during the recovery window.
If the delay line briefly dips below the configured value for 20 seconds but recovers 30 seconds later, the monitoring system should not page as this would be considered a false alarm at 3:00 in the morning. This is not a real time computing system/database platform, so there is no guarantee that the delay line will maintain at exactly the configured value… only a promise that it will maintain the delay line as close to the target as possible. On average, the delay line varies by up to 10 seconds when not under load and not under impact.
In a mission critical system, any type of exception to standard monitoring should have clearly defined parameters and values around the specific exception trigger, and the limits of the exception. In IMVU’s case, the limits around the exception are based on time. Every exception which results in reduced sensitivity has a corresponding limit to how long the service can remain in that state. If the limit is exceeded, the service goes into an alarm state for Nagios to process. Therefore, no external process may lessen sensitivity of the monitoring beyond configurable limits.
Delayed replication and monitoring in action!
Here is a real example of delayed replication running in a production environment (currently being staged across our cluster). It should be noted that additional exceptions and limits are in place for our internal processing. Our implementation will accept the following configurable limits and ranges:
–max-seconds-behind=<seconds>: Max seconds behind for local heartbeat (there needs to be a frequent heartbeat detected from the local host itself as well as the master – or there’s a problem!)
–max-relay-behind=<seconds>: Max seconds for relay log last update
–max-master-behind=<seconds>: Max seconds for master log last update
–max-backup-behind=<seconds>: Max seconds a backup may run
–max-deshard-behind=<seconds>: Max seconds a deshard job may run
–max-last-seen=<seconds>: Max seconds since worker last seen
–min-delay-time=<seconds>: Minimum allowed replication transaction delay time
–max-delay-time=<seconds>: Maximum allowed replication transaction delay time
–max-recovery-time=<seconds>: Seconds after backup completes to allow variances
–max-trend-time=<seconds>: Maximum time to allow intercept to trend to recovery
The actual limits passed to our Nagios plugin during our staged roll out are currently:
A database instance in good health with a 2 hour replication delay:
A database instance currently running an active backup. If the backup was not running, this service would be in an alarm/critical state, as the replication delay line has grown to 16,408 seconds (far above our window limit of 9,000 seconds, and the target goal of 7,200 seconds):
Once a backup finishes, the recovery window is entered. If recovery back to the desired delay line does not occur within the recovery window, the service will go critical:
The complete list of delayed replication conditions trapped at IMVU is listed below, along with the service state (i.e. WARNING, CRITICAL):
- WARNING: Replication delay intercept recovering – below minimum with slope: Replication delay is below the minimum desired value, but is moving towards recovery.
- WARNING: Replication delay intercept recovering – above maximum with slope: Replication delay is above the maximum desired value, but is moving towards recovery.
- WARNING: Replication delay intercept is out of bounds, currently in recovery window for backup which completed <seconds> seconds ago: Replication delay is above/below the desired range, and may not be moving towards recovery. State is held in warning to allow recovery post backup until the recovery window expires.
- WARNING: Replication delay intercept is out of bounds, currently in recovery window for deshard which completed <seconds> seconds ago: Replication delay is above/below the desired range, and may not be moving towards recovery. State is held in warning to allow recovery post deshard until the recovery window expires.
- WARNING: Replication delay intercept recovering – real value within range: Replication delay intercept is above or below the desired range, but the real value has already recovered. Replication delay intercept is trending towards recovery.
- CRITICAL: Replication delay intercept is out of bounds, and slope does not indicate recovery: The replication delay is outside of the desired range, and is not moving towards recovery or is not moving towards recovery at the minimum desired rate (slope).
- CRITICAL: Replication delay intercept recovery exceeded window: The replication delay intercept was recovering but exceeded the maximum recovery window time limit. Replication may be lagging or is not catching up in sufficient time.
- WARNING: Active backup/deshard: An active backup or deshard job is running.
Naturally, other parameters necessary for healthy replication should continue to be checked as well. Replication, including delayed replication, will always fail if the replication state is stopped due to an error.
No solution is completely perfect or without trade-offs, and delayed replication is no exception. Some considerations that should be taken into account prior to implementation:
- If failing over to a delayed standby/slave, recovery time can be impacted. For instance, if auto-incrementing columns are used in MySQL and statement based or hybrid replication is being used then the standby must be caught up or data inconsistencies are likely. Make sure that the additional time to catch the host up is taken into account when estimating service restoration time.
- The longer replication is delayed, the longer it will take for an instance to catch up.
- If the application being targeted uses the delayed standby as a read slave, it is important to verify that the delay line doesn’t impact functionality or create edge/corner cases (this is not the case at IMVU, but is worth mentioning as it is not uncommon). Ideally, this should be validated by understanding the application’s logic or (even better) its code. A great example: A user is disabled in an application that actively queries read slaves for user information in a table. If the read slave is two hours behind, the user might be able to log in until the slaves commit the transaction which disabled the account.
- Delayed replication will not mitigate data loss unless the impacting event is caught within the delay window, and action is taken prior to the statements in question being executed. Clearly document that delayed replication only serves as a component of a comprehensive ecosystem.
Delayed replication is most useful to recover from errors made by administrative users, particularly since statements involving most DDL cannot be wrapped in a transaction. It is not, however, a magic bullet; it only provides protection if the offending transactions are identified and removed before making it through the delay line. The monitoring strategies for delayed replication are different (and decoupled) from standard MySQL replication monitoring, and the decoupled processes themselves must be monitored as well.