Technical Recovery Plans (TRPs) are vital to any Disaster Recovery plan. These documents are the foundation for recovering your data center in the event of a true disaster or for testing out your plan and processes. But what should be included in the TRP document? Here we will discuss some of the Technical Recovery Plan Best Practices.
The importance of a TRP cannot be disputed. However, what goes into a TRP can be widely different based on the organization, DR Manager, recovery philosophy, or even the environment that your data center and colocation provide. This article will discuss some of the areas of concern that a DR Manager should consider when developing TRPs for their organizations, and some things that would be considered Best Practice to include.
First off, every application in your landscape should have a TRP. This includes not only Critical applications (those needing to be back because they are vital to the continuation of the business), but also for all of the other applications that would need to be recovered at some point in time. While this could be a major undertaking, as organizations may have hundreds of applications that are run in their data centers, it is a necessary practice to ensure recoverability of your data center.
So, now that we understand why we need a TRP for all of our applications in our data center or technical environment, it is important to understand how we should go about creating and documenting how each of these applications would be recovered in the event of a disaster, or used during a test. Several areas will be listed, some of which may be important to you, as the DR Manager, and others that you feel either exist elsewhere or can easily be determined in the event of needing that information. However, be aware that having all necessary information in one place can greatly reduce recovery time for your applications as you would not have to do the legwork to find information at time of disaster or test.
One of the most important pieces of information deals with the applications or system information. This would include such items as the name (and any aliases) of the application, a description of its functionality, who owns the application from both the business and the IT side, as well as information about how the application is accessed (such as a URL to the app), whether it is vendor hosted, whether it is vendor supported, and all vendor contact information if either of the previous questions are answered “yes”. This information allows for easy communication with the teams using and supporting the application, as well as access to vendors if they are in any way involved.
Another important section deals with Infrastructure and Application Development resources supporting each application. This should have various methods of access to each resource, with multiple phone numbers in the event of a disaster and email addresses at a minimum. This should include at least the primary and backup for each of the technologies employed for the application from an infrastructure standpoint as well as the primary and secondary contact from the application development team. This ensures the DR Manager can get to a resource when someone is needed to assist with restoration of the application.
Another valuable area deals with how the application is restored. This section would talk about how an application is going to be recovered: Failover, Replication, or backups. This would include steps that would be performed for any of the three of these recovery types, based on which is used for the application. Along with this information would be the person or team responsible for each step as well as an estimate on time-to-complete for each step. Additionally, there should be a notation as to whether the task has prerequisites for completing the current task. So, if step 5 requires that step 3 be completed first, that should be noted in the document. This area should talk about how each of the recovery processes are validated and by whom. Also, it is important to discuss how this would all be recovered back at the primary data center after it is made useable again, if that is what is required by the organization.
Another section that you would want to document is information about the servers, as well as backups or replication for each piece of infrastructure the application requires. This should list such items as the server name, server type (App, DB, Web, etc.), whether the server is virtual or not, the OS on the server and any release levels for that OS that have been applied, the corresponding server name at the recovery site (if it is different than the prod server name), whether the server is load-balanced, and if so, which servers it is load-balanced with, and the recovery order or sequence so that your servers can be restored with minimal need for reboots.
Another vital part of this document would be the interfaces and dependencies that each application has on other elements in the IT landscape. All integrations should be listed so that during recovery of each application, the recovery and validation of each of those integrations deemed Critical can also occur prior to turning an application over for validation and testing, and therefore prior to its use in the recovered environment. Dependencies obviously need to be restored prior to the application being functional, so listing each of these is extremely important in the overall recovery process. Dependencies should also mention if they are downstream, upstream, or bi-directional so that all elements can be validated.
One other piece of information that can greatly assist in the recovery of applications is an architectural diagram. This is a document that should outline each integration in a visual format as well as all the information around which servers are part of each integration. They say that a picture is worth a thousand words, well, in the event of a disaster, sometimes it is easier and less time-consuming to view an architectural diagram of the application, its components, dependencies, and integrations and come away with a better general understanding of the overall recovery plan for that application than reading the entire TRP. I am not saying that you should bypass reading the document, however, getting a view of the necessary pieces to begin the recovery can reduce overall recovery time.
The last piece of any document that you put into your DR environment should be a revision history. This is vital to ensure that document is being revised and validated on a regular basis but also puts some ownership to changes that are made. I would recommend more oversight than just this section, and many of the document versioning tools allow for multiple versions to be saved and the ability to look at what has changed. One other area to consider is to always receive sign-off when those updates are made. This will ensure that middle-level management is onboard with the changes being made to the document as well as their being aware of any changes to the environment. They should ultimately be responsible for the applications owned in their areas.
So, there you have it. Some Best Practices for a Technical Recovery Plan. This is certainly not the end-all for TRPs. There are other inclusions that could be made depending on your environment or needs. I am going to attempt to include some templates of Technical Recovery Plans in the DR Document Templates section, and would love to have some of you send me a clean copy of your TRP template so that others can possibly use pieces from other templates to create their own, or quickly get started in developing their DR Plan and processes without having to reinvent the proverbial wheel when it comes to TRPs. Please send those to dan@disasterrecoveryblog.com .