As we all know, RTO stands for Recovery Time Objective. And the RTO for a given application specifies a timeframe that the business would like to see IT able to restore the application in the event of a disaster to the data center. However, what is not included in this definition is how it applies to our DR exercises and planning, and hence RTO and its impact on testing.
This type of conversation happens a lot with the customer:
Customer: Can you run a test on my critical application to validate that it can be recovered? We have made some changes to the architecture and want to make sure that your documentation accomplishes what is required in the event of a disaster.
DR Manager: We can do that. Would you like this done as a one-off, or included in our annual DR exercise planned for the fall?
Customer: I do not want to wait until the fall. I cannot be sure you can restore the application unless you validate it now that the changes are in place.
DR Manager: We will get it tested in the next couple weeks.
After a couple of weeks:
Customer: Was the test successful? Were you able to recover our application in the colocation?
DR Manager: Yes. It was recovered and validated.
Customer: Did it fall within the RTO?
DR Manager: Yes, it did.
Customer: Good. Then we are confidant that it can be recovered within the RTO in the event of a disaster.
Hopefully everyone sees the folly in having the customer automatically think that an application that meets RTO in a one-off exercise somehow verifies that it will meet the RTO in the event of a true disaster, unless it is one of only a handful of applications that need to be recovered.
The truth is that most have somewhere between several dozen to hundreds of applications that will need recovering. Additionally, the number that fit our “Critical Application” definition and will need to be recovered in the first 24 hours to several days could be several dozen in number. Does testing one application by itself in a one-off give you, the DR Manager, a solid belief that you can have it restored in that timeframe during a true disaster? I would guess that answer to that question would be not necessarily. I know that it always was for me.
The business, and hence customers, do not necessarily understand that recovering an application in a one-off exercise is not the same as a true disaster recovery. We do not have two or three servers to restore, we have several hundred or thousands of servers to restore. And with just our Critical Applications, we may have several hundred just for those. Therefore, the only way to give a definitive answer to the business on whether an application can be recovered within the RTO is to have it be part of the bigger exercise that you perform where all Critical Applications are restored at once.
This also puts into question why some companies perform quarterly testing of Disaster Recovery on a subset of the Critical Applications. If your company has 40 Critical Applications and they ask you to test ten of those every three months, they will never get clear results. The only way to determine if you are going to be able to meet RTOs for all Critical Applications is to be forced to recover all of them during the same exercise.
Help upper-level management and the business understand how testing one application is not the same as recovering 40, with more than half of those having an RTO of 24 hours or less. Only then will they begin to understand why one annual exercise with all of the Critical Applications being recovered at once is needed to validate whether or not RTOs can be met and where improvements in the process can be made. This is what is meant by RTO and its impact on testing.