Musings on IT, data management, whitewater rafting, and backpacking

Monday, October 25, 2010

A different angle on RTO, RPO, Backups and Restores

When people design IT backup and restore processes, they typically focus on Recovery Point Objective (RPO), and Recovery Time Objective (RTO), with the goal of reducing both of those as much as the organization can afford.

Two hidden assumptions:
  • You instantly know when you have a problem.
  • You instantly initiate recovery.
Those assumptions are not always true!

What are the implications of getting these assumptions wrong?

Take this scenario as an example:

You have an RPO of 8 hours, and an RTO of 8 hours.  So you carefully design a system that copies data off-site every 8 hours, using an expensive WAN link and duplicate storage, using something like rsync.  In case of catastrophic failure, you can redirect to the off-site storage, while you rebuild your primary storage.  You've done some testing, and believe both your RPO and RTO can be met.

Later, you suffer silent, massive data corruption that continues for 10 hours before you recognize the problem.

You are in big trouble.  You've replicated the corrupted data off-site, and your RPO and RTO cannot be met.

Let's say your silent period is only 2 hours.  You still have about a 20% chance that your RPO and RTO cannot be met, because off-site replication could have triggered during those two hours.

Let's say you instantly recognize your data corruption, but you can't initiate recovery for several hours.  Same problems.

How could you not notice data corruption, or not initiate recovery, in time?  Here are some examples:
  • Your system doesn't log enough information to warn of certain kinds of data corruption.
  • You aren't watching your system logs in real time for every possible type of data corruption.
  • You are watching your logs in real time, but don't recognize some particularly obscure error messages as data corruption.
  • You watch your logs 9 am to 5 pm Monday through Friday, but your off-site replication runs 24 hours per day.
  • You watch your logs 24 hours per day, but you can only initiate recovery 9 am to 5 pm Monday through Friday.
  • You watch your logs 24 hours per day, you can initiate recovery 24 hours per day -- but today several staffers are sick and nobody's available to watch logs or initiate recovery.
So what's the point?

We need to plan for delayed corruption recognition times, and recovery initiation times, as much as RPO and RTO, if you really want to meet your RPO and RTO objectives.

We need a couple more TLAs:
  • CRO -- Corruption Recognition Objective:  The time period between the actual corruption or loss of data, and the recognition that recovery must be initiated.
  • RIO -- Recovery Initiation Objective:  The time period between recognition that recovery must be initiated, and the actual initiation of the recovery process.
So, how can we improve CRO and RIO to minimize impacts on RPO and RTO?

Stay tuned for a future blog post.

No comments:

Post a Comment