When we consider the importance of Disaster Recovery (DR) within an organization, the only true performance measure is a recovery after an actual disaster event in our data center. So, how do you tell if you are ready for that type of event? Does testing monthly, quarterly, or annually provide any real value in defining our readiness? Today I will discuss DR Testing Methodology as it applies to preparedness and providing the business with enough information to instill faith and trust in your ability to recover in the case of a true disaster.
So, what is the best interval to perform DR testing? Is it monthly, and for only modified applications, whereby you can validate that any changes made to the application or environment is accounted for in your DR environment? Is it quarterly, which allows for a potential greater number of applications to play into the exercise and you can get a better picture of inter-dependencies with applications and infrastructure? Or is it annually, whereby you can validate the overall recovery picture so that the business knows how well you perform against requested RTOs and RPOs?
Monthly tests are generally seen by the business as the most robust of the testing paradigms. However, what is difficult to quantify for the business is not only the amount of time that is spent performing each of these tests, but the relative lack of value that is attained through this method of testing. Depending on the amount of infrastructure needed to facilitate the recovery of an application, the ROI on one of these tests may be extremely low and take away from other projects the business has need for.
I am more a proponent of quarterly validations over monthly because it allows for resources to not be “wasted” in performing multiple exercises. You could easily choose a quarter of your Critical applications and perform a recovery with testing for each of those apps. However, again, this does not provide an overall validation of the team’s ability to recover all Critical applications within the defined RTOs for those applications. This can only be accomplished when ALL of the Critical applications are recovered during one full exercise.
This is not to say that monthly and quarterly tests and validations cannot be done in concert with an annual exercise. The point is that, without a full-blown exercise to validate RTOs can be met in the event of a true disaster, the other, smaller exercises, are not really showing anything other than the importance a Technical Recovery Plan (TRP) plays in the overall DR Process.
So, now that we know that it is important to have an annual exercise, and maybe either a couple of additional monthly or quarterly tests, what is important to test? Depending on your environment, this answer could range from end-to-end testing down to validating that an application is up and running and the data within the corresponding databases is additionally recovered.
When I first started managing Disaster Recovery, I tried to determine which form of testing made the most sense. We had a primary data center that was backed up/replicated to our DR colocation several hundred miles away. Because some of the applications ran on legacy platforms and the guts of the application had not changed that much over the years, there was a concern around changing IP addresses on the recovered servers and the applications’ abilities to function properly. Therefore, a case was made to create a “bubble” environment for restoring the applications and providing testing. This all but ensures that end-to-end testing cannot be accomplished, mainly because you are required to have this “bubble” closed off to incoming and outgoing internet traffic.
I determined that our best practice was to validate the application could come up, the users could log in, data that was expected from the Disaster Declaration Date was available within the bubble, and some small validations could be completed. Functional testing was not to be considered a part of the exercise because we could not guarantee that the application did not need some sort of external integration or connectivity.
This is not the case if you have a hot/hot environment or failover. These types of environments allow for more robust testing and could include end-to-end testing. However, is that really the purpose of a DR exercise? In today’s IT environment, so much is done with middleware servers that provide access to these third-party integrations that the necessity to validate everything really becomes overkill. If you were to have a disaster event and you were able to recover your middleware servers, you should have a level of security from the middleware team that restoring that server in a new data center, i.e. your colocation, would not require a complete validation of all of the external integrations annually during your exercise. The processes that run on these servers should work the same regardless of the data center they are running in. The code has not changed, and if you have connectivity to the internet, they should still have all the URL’s to connect to and function as they do in your production environment.
So, you have done your homework on what is important, and now it is time to let the business know what will be tested. The business almost always wants to have complete, end-to-end functionality testing to validate that their applications will function as they currently do in the production environment. However, we already know, based on infrastructure and coding changes, and therefore updates to the TRPs, that the ability to build up the servers and to troubleshoot any issues that arise with connectivity, communication, etc., can be easily mitigated whether it is during an exercise, or in the event of a true disaster. So, why go through complex test plans to validate that which we know, as DR Managers, will work if we were to bring everything back in the event of a true disaster?
This all takes me back to my article from last week, Technical Recovery Plan (TRP) Best Practices, when I mentioned the importance of architectural diagrams in the TRP process. If your organization has proper documentation to show each and every dependency and integration each application has in the overall environment, then the additional testing becomes superfluous. Bring up the infrastructure, make sure your databases, application and web servers are all restored and brought up in the right order, and then make sure your testers can log into the application and validate some data. That is what is needed to provide evidence that your applications are recoverable in the event of a true disaster.
The hard part on this is the sell. You probably will not get buy-in from the business on this. However, if you can get backing from IT, and ultimately the CIO, that is your starting point. From there it is important to talk to the individuals within the organization who can understand the importance of validating that applications and infrastructure are recoverable without performing end-to-end testing. This could be individuals such as the COO or the CEO. They are the ones who should be determining what level of validation needs to be accomplished. They should be interested in the why and how it is completed, and what level of security it will provide them. The biggest piece of restoring a data center to your colocation after a disaster is the infrastructure. Showing that the DR Team can restore the infrastructure, start up the applications, provide the ability to log into those applications, and have the data available accounts for the largest part of the equation. Beyond that, the value of end-to-end testing is a negligible addition to the overall process.
Keep in mind that I am not advocating that you throw out functional testing if you currently incorporate it. Especially if you have a hot/hot or failover environment between your primary data center and your colocation. What I am saying is that the added value of functional or end-to-end testing in your DR Testing Methodology should be considered minimal to the overall value of restoring infrastructure and bringing back applications and data for the business to validate.