Disaster Recovery – RTO versus RTA

Disaster Recovery (DR) has many different aspects to it.  From Recovery Time Objectives (RTO) to Recovery Point Objectives (RPO), individuals in both IT and the business seem to think that what you accomplish during one exercise defines how you will perform during the next exercise.  Unfortunately, many factors influence preparedness and outcomes.

The business seems to look at Recovery Time Actual (RTA), or how quickly you were able to recover an application during an exercise, as their new starting point for what they expect going forward.  If your performance one year beats the RTO by 50%, can we safely assume that it will be recovered in the same timeframe during the next exercise?

My response to the business would be that they should not have those expectations.  While I, as a DR Manager, may have those expectations, the business should look at the required recovery time of their applications and base the RTO on when the application must be back, not on past performances.

Things change. If your infrastructure changes year over year, the number of Critical applications rises or falls, the technology in the colocation becomes older and in need of upgrading, or replication methodologies change, these types of changes can all impact how quickly the team is able to recover applications during a DR exercise. Having the business say “You were able to recover it in X hours last year” does not automatically make it so during the next exercise.

It is imperative for the DR Manager and team to define the process for defining Criticality and follow that no matter the performance during the previous exercise. Inform the business on what RTO really means and make sure there is little variance from year to year unless the application somehow takes on more significance in the organization over that timeframe. Make sure that you adequately explain what it means to recover applications within the RTO window and set expectations for future exercises.

The main purpose of an RTO is to ensure that the business has their applications restored within the timeframe they define as needing the application to be up and running.  Having the business always looking at past performance and expecting equal or better performance going forward will only hurt the overall DR process and could potentially look bad for the DR Manager and team. Making sure that the business understands the true meaning of RTO versus RTA could greatly improve your rapport with the business as well as your ability to continue making improvements in the DR space to ensure that RTAs continue to drop.

The Argument for a DR Guarantee

Disaster Recovery has one main purpose, to restore all the infrastructure and applications for an organization at the colocation.  And with this comes the need to rank applications, and hence the corresponding infrastructure, based on the RTO, or Recovery Time Objective, that is defined by the business.  This generally helps to determine what should be restored first, as it is most critical to the business.

Now, let us say that your organization has placed an RTO of 24 hours on one of the applications that is in the data center.  Many of the applications in the data center probably have a similar designation, however, are these applications all really that critical?  If they were, then they should have all the latest technologies attached to them to make recovery within 24 hours easily obtainable.

For discussion purposes, we are going to imagine one of the most critical applications in the data center for a moment.  This application runs on a Windows 2008 server because the business has not provided money for upgrades to the application. The database is on SQL Server 2008 also, since that is what the version of the application supports.  Since this is such a highly critical application, and hence database, the size of the database comes in at around 4TB.  To top it off, the Windows 2008 servers that this application runs on are all physical servers, not virtualized.

Hopefully, you are beginning to get the picture.  This application, which is supposed to be one of the most critical applications in the data center, lacks so much when it comes to being able to recover this it within the given RTO.  So, what can the DR Manager do to make sure that he/she is covered if they are not able to recover this application within the requested RTO?

One way to circumvent this issue is to set up a system around DR Guarantees, almost like an SLA for Disaster Recovery. You, as the DR Manager, set up a matrix of supported infrastructure and technologies at the highest level.  That would mean the current Windows servers, the current level of SQL servers, virtualization of servers and databases, and even the size of a database when it comes to not having the right replication abilities between your data center and your colocation.

You list all levels that are in your data center for each of these infrastructures and take points off for not having an application utilizing the best and most recent technologies.  For instance, an application running its application and web servers on Windows 2008 servers automatically loses points for not being current.  Beyond that, if a server that is being used in production is no longer supported by the provider, that automatically disqualifies the application from being considered Critical as the potential for not being able to recover that hardware/OS becomes increasingly more difficult.

So, the plan for this is to have levels for each of your Tier levels that you use to define Critical Applications. Platinum would define those applications requiring recovery within say 24 hours.  Your next RTO Tier may be set for 48 hours.  This would be the Gold level.  Silver may be 72 hours, with Bronze coming in at 5 days.  If you only use that number of Tier levels, then the rest would fit into non-Critical and would be considered Tin or Plastic.  The idea is to define for the business a new way of thinking about Tier levels against the technologies that are employed for each of those applications. 

So, everything starts out as a Platinum SLA DR level that the business has deemed Critical.  Any of the technologies listed in your matrix that are not supported by the vendor automatically falls to Tin or Plastic.  Additionally, you can drop things by 1 SLA level for each item that is not considered your top technology.  For example, the DB server is running on SQL Server 2012 and does not provide the level of replication needed to have a primary, secondary, and remote tertiary server running to ensure nearly immediate recovery of the database at the colocation.  Therefore, you drop the application SLA from wherever it currently stands to one level lower.  Let us also imagine that this is a 4TB database associated with the DB Server.  Recovery from a backup may take 2 or 3 days.  This would also drop the SLA for this application another level. Is the application on servers that are virtualized?  No, then it goes down another level.

The idea behind this exercise is not to anger the business but to outline the shortfalls of not having applications utilizing the most current technologies.  By performing this exercise for all of your Critical applications in your data center, you will have a better understanding of what is truly possible in the event of a disaster.  If you go to the business owner of the application mentioned earlier, you would be able to point out why it should not be considered Critical because of the lack of investment in keeping it current.  Hopefully this will promote more thought within the business teams on what applications need to have money spent to update the technologies that these applications run on so that if, for some horrible reason, you lose your data center, the recovery of the most critical of applications can occur in a timely fashion.

The matrix that you create will apply to the hardware, server OS’s, virtualization techniques, replication, and backup architecture that you have in place within your organization.  Unfortunately, there is not a one size fits all when it comes to creating one of these matrices.  I would be glad to assist you in defining one for your organization if you run into trouble with any of the points described in the blog post.

RTO and its Impact on Testing

As we all know, RTO stands for Recovery Time Objective.  And the RTO for a given application specifies a timeframe that the business would like to see IT able to restore the application in the event of a disaster to the data center.  However, what is not included in this definition is how it applies to our DR exercises and planning, and hence RTO and its impact on testing.

This type of conversation happens a lot with the customer:

Customer: Can you run a test on my critical application to validate that it can be recovered?  We have made some changes to the architecture and want to make sure that your documentation accomplishes what is required in the event of a disaster.

DR Manager: We can do that.  Would you like this done as a one-off, or included in our annual DR exercise planned for the fall?

Customer: I do not want to wait until the fall.  I cannot be sure you can restore the application unless you validate it now that the changes are in place.

DR Manager: We will get it tested in the next couple weeks.

After a couple of weeks:

Customer: Was the test successful?  Were you able to recover our application in the colocation?

DR Manager: Yes.  It was recovered and validated.

Customer: Did it fall within the RTO?

DR Manager: Yes, it did.

Customer: Good.  Then we are confidant that it can be recovered within the RTO in the event of a disaster.

Hopefully everyone sees the folly in having the customer automatically think that an application that meets RTO in a one-off exercise somehow verifies that it will meet the RTO in the event of a true disaster, unless it is one of only a handful of applications that need to be recovered.

The truth is that most have somewhere between several dozen to hundreds of applications that will need recovering. Additionally, the number that fit our “Critical Application” definition and will need to be recovered in the first 24 hours to several days could be several dozen in number.  Does testing one application by itself in a one-off give you, the DR Manager, a solid belief that you can have it restored in that timeframe during a true disaster?  I would guess that answer to that question would be not necessarily.  I know that it always was for me.

The business, and hence customers, do not necessarily understand that recovering an application in a one-off exercise is not the same as a true disaster recovery.  We do not have two or three servers to restore, we have several hundred or thousands of servers to restore.  And with just our Critical Applications, we may have several hundred just for those.  Therefore, the only way to give a definitive answer to the business on whether an application can be recovered within the RTO is to have it be part of the bigger exercise that you perform where all Critical Applications are restored at once.

This also puts into question why some companies perform quarterly testing of Disaster Recovery on a subset of the Critical Applications.  If your company has 40 Critical Applications and they ask you to test ten of those every three months, they will never get clear results.  The only way to determine if you are going to be able to meet RTOs for all Critical Applications is to be forced to recover all of them during the same exercise. 

Help upper-level management and the business understand how testing one application is not the same as recovering 40, with more than half of those having an RTO of 24 hours or less.  Only then will they begin to understand why one annual exercise with all of the Critical Applications being recovered at once is needed to validate whether or not RTOs can be met and where improvements in the process can be made. This is what is meant by RTO and its impact on testing.

Definition of a Disaster

For some reason, your data center just went down.  It’s time to declare a Disaster.  Or is it?  Many individuals within an organization, most of whom are unfamiliar with Disaster Recovery let alone the processes and policies you may have in place within your organization, are quick to decide that a Disaster should be declared.  But, do they know what has actually occurred within the data center and its overall impact on bringing your infrastructure and systems back online?

All too often, individuals who do not have a full understanding of events, processes, or even the makeup of your colocation will proclaim to know what is best for the organization.  Let’s say you just had an Emergency Power Off (EPO) event that took down everything within your environment, from networking, to storage, to your Citrix farm, to all servers and hence, all applications.  Is it time to declare a disaster and restore at your colocation?

What if your colocation only has enough compute, RAM, and storage for recovery of your Critical applications?  Still think it’s a good idea to restore at your colocation?  What if the recovery time of your applications within the colocation is slated to take 3 days while your primary data center can be restored within 24 hours?  Still a good idea to restore elsewhere?

When writing your DR Policy, it is usually a good practice to include a section on The Definition of a Disaster.  In it, you will want to include such information as to the viability of your current data center.  In the case of an EPO, is it better to attempt to recover your primary data center, or should you start restoring to your colocation?  If the colocation provides failover capabilities, then that may be the best decision.  However, if your colocation, as mentioned before, has a limited amount of resources and you won’t be able to restore all your applications, then maybe making sure that your primary data center can be recovered is the best solution.

Imagine having an EPO occurrence and you were only able to restore Critical applications within your colocation because of the environment.  If a decision had been made to declare a disaster, you could have only had your Critical applications back up and running, maybe within a couple days.  As soon as the Disaster was declared, your hosting team would have needed to order additional hardware to account for the other 70% of your applications in your environment. That could take days, weeks, or more likely months until you were fully recovered.  At that point, you would have needed to begin some sort of replication back to your primary data center, now acting as your colocation. After you caught up, you might need to take another outage to restore back to your primary data center.

Imagine what everyone would be saying about the lack of Disaster Recovery planning on your part if this occurred.  Therefore, it is vital to include a section that outlines exactly what the definition for a disaster is.  If it is viability of the data center, spell out what that means.  This could mean that you have redundant power and internet coming into the data center.  You could talk about having the correct HVAC conditions in your data center.  It could reference having enough compute power to bring up most of the systems.  Either way, having this documented will help to ensure that the decision to declare a Disaster is made correctly, following the DR Policy and everything that you have worked. 

Who can declare a Disaster?

When something happens to your data center, is everyone in the company aware of who can declare a disaster? Many employees within an organization have a misconception about who has the authority to declare that a disaster has occurred such that your company’s DR policy and recovery of the data center should be put into action.

The information around who declares a disaster should be clearly defined within your DR Policy as well as the Business Continuity Management Policy. But, does having the information stored within a policy provide enough information to the rest of the organization so that a process can be followed?  Sometimes, through no fault of the DR Manager nor that of the Business Continuity Manager, all individuals, even up to senior level or c-level leaders may not know what is in the DR or BCM Policies.  This could occur when new people come into an organization and have not consumed that information, or changes are made to policies and everyone has not yet been updated.

However, many organizations may not have the best method to deliver this message to everyone within the organization. With that being said, the first order of business is to make sure that whoever can declare a disaster is clearly spelled out in both policies and that those individuals are made aware of the process.

Many times, when an issue comes up, such as a loss of a set of servers or an Emergency Power Off (EPO) event occurs where someone presses the red button in the data center by mistake, certain individuals will automatically call for a disaster to be declared.  Many of these individuals will not have knowledge about the process of declaring a disaster, nor what should go into determining whether a disaster should even be declared, but will remain vocal about having the declaration made.

These are the times when it is vital that a solid process is in place and the DR and BCM policies have been updated to include this information.  Instead of having people within the organization attempting to get their way, thinking that it is the right way, being able to point them to these policies and processes can save large amounts of time and meetings discussing why you haven’t already declared a disaster. Especially when you are probably not the one who can declare the disaster.

One thing to remember throughout this, though, is that no matter who is responsible for declaring the disaster, it is essential that the DR team is a part of the process. Many organizations will decide the CEO, COO, or maybe the CIO are part of the team responsible for declaring the disaster.  However, without input from the DR professionals that were hired to help make those decisions, the declaration may be completely without merit. Take the EPO example; shutting off power to a data center does not necessarily constitute a disaster.  Therefore, including the DR professionals in the conversation can help steer those making the decision about declaring a disaster in the right direction.

I will talk more about creating a document on defining a disaster in a future post.  But, first, getting people to understand who can even declare the disaster is important within any organization employing a DR process.

DR Blog Ground Rules

The DR Blog is going to need some ground rules for it to be successful.  The intent is not to try to control anyone, but rather to make it enjoyable and valuable for everyone.  Following are some items that are imperative to make this not only successful, but valuable.

  • First and foremost, this site is not about products nor the endorsement of companies or any of the products that they might provide that either assist or function in the DR landscape.  I can honestly say that I have not evaluated any companies or products and have little to no control over what is purchased from an infrastructure perspective in the organization I work.  I may have some notion of what works well, and I may write about a certain technology such as replication, HCI, etc, but understand that I will not endorse a product.  If you want to say something positive about a product or company in the discussion groups, that is fine, as long as it is not libelous.  Just know that I will not be posting my opinions there.
  • Be courteous.  Everyone has an opinion.  For the most part, none are better nor more important than others.  Nobody here is an expert.  Some people may have been in the business of DR for more than 20 or 30 years.  That does not mean that they have all of the answers or have no areas of improvement they could attain in their DR processes.  Share your information and opinions.  Meanwhile, realize that all of us are just trying to learn, take in information, and share our thoughts.  We all still have our day jobs.
  • If you have questions or concerns for me, please do not hesitate to send me an email.  If it is in regard to a topic, discussion, whatever, I will try to get you an answer.  I may or may not read every post in all of the discussion sections.  As the DR Blog grows, I may not have time to do all of that.  Remember, I still have a day job and a personal life also.  I will try to keep up on issues but cannot know everything that is going on.  I will, however, read and edit all authored blog posts from personal contributors as well as White Papers to make sure they are appropriate for our site.
  • I may, at some point, ask for moderators for our discussion groups.  That will come over time and will allow for us to have someone looking to make sure posts are asking questions, providing solutions, etc., without being confrontational.  Again, this blog is for all of us.  Helping to ensure that it is of value and is a safe place to ask any question about DR is extremely important to me.  There are no stupid questions.  There are probably going to be certain questions that get asked over and over.  As that occurs, we will create a FAQ post that people should read first.
  • Sponsors, please do not go into discussion groups to put your 2 cents in about technologies and products for the purpose of endorsing your product or putting down another organization or their products.  Again, we are looking to have a site that provides information.  Individuals on this site are here to gather information and feedback.  They are not here to have sales pitches jammed down their throats.  I will reserve the right to remove you and your sponsorship from this blog if you cannot follow these basic rules.

Those are probably enough ground rules to start with.  This site is about Disaster Recovery – principles, terminology, processes, training sites, and how we can make it better in our respective organizations.

Why a DR Blog?

After about 27 years in IT, I was at a crossroads.  I was in my early 50’s, had spent my entire life in some form of technical job, from help desk to app dev, and found myself looking at my future.  Do I teach myself yet another programming language for the last few years of my career, or do I make the jump to the infrastructure side of the IT aisle and take on a new challenge, Disaster Recovery.

Well, I decided that this old dog had learned enough when it came to programming languages.  I decided to trade in that hat for one that resembled that of a Project Manager, who would lead DR training, test plan creation, and DR exercises.  How hard could it be, I asked myself?  I had been involved with DR exercises in the past from an app dev perspective, but now it was going to be mine to lead.

I took over a process that was severely flawed.  I did not know why or how it was flawed, but at a testing success rate of about 45% the last year before I got it, I figured that this was not a position the organization really wanted to be in.  I started coming up with ideas on how to improve the process.  I thought they were good, but what about others in the organization?

Our first exercise was a capability exercise.  Upper-level management wanted to make sure some of my ideas would prove fruitful in our DR program.  It was a success.  The full exercise toward the end of the year? Another success.  And when I say success, I mean that we came in with all applications successfully recovered, with only a couple of issues that the business testers found wrong.  This was easy, right?

Well, the one thing that I have learned over the years is that success is only as good as your last exercise.  People in the organization tend to question your methods, the importance of a DR Plan, or whether or not you are meeting minimum standards.  While I believe that I have developed a good process, others are always quick to judge and tell you that there are better ways to do things, regardless of their knowledge on the subject.

The Disaster Recovery Blog is the place where we can discuss Disaster Recovery.  Not a combination of Business Continuity/Resiliency AND Disaster Recovery, but rather Disaster Recovery and only Disaster Recovery.  There are a couple of sites out there where you might find a post about something that relates to our jobs, but they are usually mixed in with a heavy dose of Business Continuity. Also, most training courses for Disaster Recovery incorporate tons of information about BC, when we are probably looking for training on DR specific topics.  While I am not trying to diminish the value of BC, I just want a site where I can find DR specific topics.  This is the place.

The Disaster Recovery Blog will be a place where DR Managers and practitioners can come to find articles on, you guessed it, Disaster Recovery.  And, the main blog area will contain blogs by not only me, but by other DR Managers and practitioners in the industry.  People will send me their blogs that they would like published, and I will put them on the DR Blog site, with all accolades going to the author.  I want to create discussion groups so that we can discuss current topics, ask questions, or complain about something in the DR world. 

It will also have sponsorships at some point.  However, sponsorships are just that.  Someone paying to try to reach all of you with their products.  Just because someone has a link to their product site, or writes a white paper that I post in the White Paper section of this site, does not mean that I endorse any of these products.  I will not make it my business to endorse any product listed, nor any company that is allowed to be a sponsor of the site.  Since this site is going to be one of the only one of its kind on the internet, I want to make sure that companies can sponsor the site and post information in the form of white papers, but without my endorsement of them.

Hopefully I made myself clear.  NO endorsements here.  You can discuss them in the discussion sections and give your opinions, but I am not going to chime in and provide my opinion.  That would not be professional, and is not the purpose of this blog.

I will also look for valuable feedback from DR Professionals.  That is who this blog is for.  I want this to be the site you visit on a weekly basis to find out what is new in the world of Disaster Recovery.  I want you to come and ask questions and carry on conversations about process, function, and what does and doesn’t work, but mostly to make yourself a better DR Manager or practitioner.  We can all add value and help each other do the best that we can to deliver a DR process that our organizations can be happy with.

Dan