Disaster Recovery has one main purpose, to restore all the infrastructure and applications for an organization at the colocation. And with this comes the need to rank applications, and hence the corresponding infrastructure, based on the RTO, or Recovery Time Objective, that is defined by the business. This generally helps to determine what should be restored first, as it is most critical to the business.
Now, let us say that your organization has placed an RTO of 24 hours on one of the applications that is in the data center. Many of the applications in the data center probably have a similar designation, however, are these applications all really that critical? If they were, then they should have all the latest technologies attached to them to make recovery within 24 hours easily obtainable.
For discussion purposes, we are going to imagine one of the most critical applications in the data center for a moment. This application runs on a Windows 2008 server because the business has not provided money for upgrades to the application. The database is on SQL Server 2008 also, since that is what the version of the application supports. Since this is such a highly critical application, and hence database, the size of the database comes in at around 4TB. To top it off, the Windows 2008 servers that this application runs on are all physical servers, not virtualized.
Hopefully, you are beginning to get the picture. This application, which is supposed to be one of the most critical applications in the data center, lacks so much when it comes to being able to recover this it within the given RTO. So, what can the DR Manager do to make sure that he/she is covered if they are not able to recover this application within the requested RTO?
One way to circumvent this issue is to set up a system around DR Guarantees, almost like an SLA for Disaster Recovery. You, as the DR Manager, set up a matrix of supported infrastructure and technologies at the highest level. That would mean the current Windows servers, the current level of SQL servers, virtualization of servers and databases, and even the size of a database when it comes to not having the right replication abilities between your data center and your colocation.
You list all levels that are in your data center for each of these infrastructures and take points off for not having an application utilizing the best and most recent technologies. For instance, an application running its application and web servers on Windows 2008 servers automatically loses points for not being current. Beyond that, if a server that is being used in production is no longer supported by the provider, that automatically disqualifies the application from being considered Critical as the potential for not being able to recover that hardware/OS becomes increasingly more difficult.
So, the plan for this is to have levels for each of your Tier levels that you use to define Critical Applications. Platinum would define those applications requiring recovery within say 24 hours. Your next RTO Tier may be set for 48 hours. This would be the Gold level. Silver may be 72 hours, with Bronze coming in at 5 days. If you only use that number of Tier levels, then the rest would fit into non-Critical and would be considered Tin or Plastic. The idea is to define for the business a new way of thinking about Tier levels against the technologies that are employed for each of those applications.
So, everything starts out as a Platinum SLA DR level that the business has deemed Critical. Any of the technologies listed in your matrix that are not supported by the vendor automatically falls to Tin or Plastic. Additionally, you can drop things by 1 SLA level for each item that is not considered your top technology. For example, the DB server is running on SQL Server 2012 and does not provide the level of replication needed to have a primary, secondary, and remote tertiary server running to ensure nearly immediate recovery of the database at the colocation. Therefore, you drop the application SLA from wherever it currently stands to one level lower. Let us also imagine that this is a 4TB database associated with the DB Server. Recovery from a backup may take 2 or 3 days. This would also drop the SLA for this application another level. Is the application on servers that are virtualized? No, then it goes down another level.
The idea behind this exercise is not to anger the business but to outline the shortfalls of not having applications utilizing the most current technologies. By performing this exercise for all of your Critical applications in your data center, you will have a better understanding of what is truly possible in the event of a disaster. If you go to the business owner of the application mentioned earlier, you would be able to point out why it should not be considered Critical because of the lack of investment in keeping it current. Hopefully this will promote more thought within the business teams on what applications need to have money spent to update the technologies that these applications run on so that if, for some horrible reason, you lose your data center, the recovery of the most critical of applications can occur in a timely fashion.
The matrix that you create will apply to the hardware, server OS’s, virtualization techniques, replication, and backup architecture that you have in place within your organization. Unfortunately, there is not a one size fits all when it comes to creating one of these matrices. I would be glad to assist you in defining one for your organization if you run into trouble with any of the points described in the blog post.