The Tie That Binds Disaster Recovery and Business Continuity

Disaster Recovery (DR) and Business Continuity (BC) are very different areas within an organization. While some organizations have them under one group, others have done much to separate them. However, without BC, you really cannot have DR. There is a definitive tie that binds the two together.

BC can be defined as a set of plans, procedures, processes, and controls that allow an organization to operate despite some sort of interference, whether man-made or otherwise. The BC organization attempts to have processes set up to allow for the business to carry on despite something interfering with the normal day-to-day functioning of the organization.

But what does DR have to do with that? Well, first off, there is the disaster that impacts the Data Center of an organization. This disaster should be at least mentioned as a potential BC impact. However, the plans, processes, and controls that allow an organization to operate are generally owned by DR instead of BC for these types of situations. So, then, why do we think that these two different organizational groups are somehow tied together?

Good BC planning will take many things into account that will operationally assist DR planning. A good BC plan and process will ensure that such things as Critical Applications, RTOs, RPOs, etc. are properly addressed. During the Business Impact Analysis (BIA), the BC team will ask which applications the business deems Critical. They will document such things as RTO and RPO for all the applications within the organization so that the DR team has access to this information. This will allow the DR team to properly plan exercises, perform testing, and be ready for an unforeseen catastrophe that requires recovery of the Data Center at the colocation.

One thing to be careful of during the BIA process is that almost every business area will deem the software and applications they use to be Critical. The BC and DR teams should work together with upper-level management to define what the appropriate measures are for determining Criticality of applications within the organization. This could be monetary implications, reputational losses, statutory requirements, and other such items that are important to the organization. It could also include something that I referred to in a previous blog that outlined how to incorporate DR Guarantees into determining Criticality of applications in the IT landscape. Only after these measures are set for determining Criticality can the DR team adequately and accurately determine the plans and processes necessary to recover the IT components to carry on the business.

It is also vital for BC and DR to communicate on a regular basis to make sure that both entities within the organization are on the same page. Things can change in either area of the company, and when they do, there could be an impact to the other. Take, for example, the introduction of Minimum Control Standards within DR. This could require DR asking BC to ask different questions for them during the BIA process. It could require training individuals within the organization differently on the expectations of taking computer hardware home daily. The point is that changes should always be communicated between these two areas to keep an open line of communication to ensure that if something happens within the organization, both BC and DR are ready to implement their plans and processes.

Hopefully a disaster never occurs where you are required to recover your data center. In most organizations, this is a last line of defense and one that is preferably never completed. However, to be ready for any situation, it is imperative to make sure you are aware of the importance of the ties that bind BC and DR together.

Having the Correct DR Personnel Ready

I have been recovering from Covid19. Not the way I wanted to spend the early weeks of 2021, coming on the heals of a disastrous 2020. However, having Covid did provide some time to think about Disaster Recovery and always having the right personnel available. Therefore, it is of utmost importance to have the correct DR personnel ready if something happens to your data center.

When we consider DR exercises, we usually plan for this. Yes, sometimes we will have an impromptu exercise where we have not told people that they will need to be ready or who will be involved. However, unless you work in an environment where doing this is possible, planning for certain unknown issues rarely becomes part of the process.

Case in point, my getting Covid. If the organization that I worked for had a disaster, would we be prepared with the reality that the DR Manager would not be available to run the recovery? That is a question that many organizations need to ask themselves. Is there a backup DR Manager, or someone who has enough experience within the DR recovery process to at least help get you through? What about other personnel?

This example goes to show why it is crucial to cross-train your DR resources. From the DR Manager to every technical resource required in a recovery, having a person who can perform the necessary steps to recover your data center is the most important tool at your disposal. If only one DBA is trained to recover the necessary databases and perform this within your DR colocation, knowing what servers are used, what naming schemes are employed, and even what needs to be done from a multi-node implementation becomes incredibly important to the overall recovery process.

The same holds true for networking, servers, storage, middleware, and application development, in addition to many more. You name it, every part of your infrastructure and application environments are crucial to the overall recovery process. If your organization only has one application developer knowledgeable in setting up a certain critical application in the DR environment, you had better make sure that a backup is adequately trained to handle the work if the primary is not available. This could include such things as configurations and files necessary for the application to be recovered properly.

Planning for a DR is more than just knowing what needs to be recovered. It requires having the necessary resources available no matter what the circumstances are. Cross-training can help alleviate many issues you may run into if resources have Covid, are injured in the disaster, or are not able to get access to communications or networks to perform their tasks. Having the correct DR personnel ready in the event of any disaster is of vital importance in setting up a quality DR plan.

What Does an Exercise Look Like During Covid?

Disaster Recovery exercises are what DR Managers and practitioners live for. It is the chance to show that all the planning that occurs over the course of months or years will allow us to restore our data centers and our applications in the event a true disaster ever occurs. We hold these exercises either monthly, quarterly, semi-annually, or annually to provide the organization with a feeling of security. But what does an exercise look like during Covid?

In the past, many organizations would gather their infrastructure resources together and sequester them to an off-site facility to perform a DR exercise. Back in the day, that probably meant having these resources at the colocation, so they had direct access to tapes, servers, storage, and many other necessary items. Over the past several years, it may have meant getting those same individuals together at a hotel or off-site facility in town where the exercise could be performed.

But what about a time in history where people should not be sequestered with other individuals? With covid ravaging the globe in 2020, this has created the need to shift and adapt how we perform our DR exercises. We have turned what was once a strategic group event with the necessary engineers on-site to one where every aspect of the recovery is handled remotely.

What are the impacts on the overall exercise? I know that my preference has always been in having the infrastructure resources sequestered in the same facility and room for ease of communication. When a discussion needed to occur, a group of infrastructure resources, or maybe the entire team, could either talk in the room, or retreat to another room in the facility to discuss items or importance. And, while technologies continue to improve in online meeting facilitation, it is still not as good as those meetings in allowing for the best forms of communication.

Many things are lost in running in-person meetings, and therefore in a DR exercise online through these meeting facilitation apps. Interruptions are going to happen far more than if the people are in-person and can see the person talking and the others around them. While you can use meeting facilitation apps that allow you to see all the attendees, you are not able to read all the cues that you would be able to in-person. One of the things that this could impact the most is the ability to have these larger group discussions and accomplish what is needed in as timely a fashion as would happen in person.

I am in no way saying that you cannot run these discussions through meeting facilitation apps. I am saying that it could and probably will impact how well the group discussion happens. But are there other things to also consider when running a DR exercise completely remote?

What about having the ability to find or get in contact with a certain engineer quickly? If they were at the DR exercise facility, you would have a good understanding for where they are, whether that was to take a break of some kind, or they went on a short walk to clear her/his head. You have the reasonable expectation that they will return within a set timeframe.

When all those resources become remote, access to each of them becomes more of a challenge. Any person involved in the recovery process can turn off the meeting facilitation software and “step away” for a time. Yes, they probably have a cell phone, but does that guarantee that you, as the DR Manager, will be able to get them to respond? The responsibility shifts to having resources involved who will stay connected during the exercise no matter what.

What about faulty technology or communication issues for one of the resources? While it is much easier to guarantee accessibility and communication while everyone is in the same room, when each of these resources connects from their own networks, there is a chance that one or more could be impacted by some sort of communication or technological issue. Planning needs to be accomplished to verify that this will not occur during the exercise, and for that matter during a true disaster, and that mitigation efforts are predetermined if they do occur. Whether this is some sort of backup technology or additional resources queued up to take over, this needs to be determined prior to any exercise and should become part of the DR Plan.

Many things can impact the successful completion of a DR exercise. And even more can impact a successful recovery during a true DR Event. Making sure that you are ready for any potential event will assist you in being prepared. So, do you know what your DR exercise will look like during Covid, and how you will prepare to handle the issues that arise?

What is the DR Manager’s Job During an Exercise?

The DR Manager wears many hats throughout the year. We are asked to put together Disaster Recovery (DR) Plans, Technical Recovery Plans (TRPs), Data Center Recovery Plans (DRPs), Business Impact Analysis (BIA) documents, as well as countless other documents. We need to know and have a good understanding of the technology employed in our data centers, the networking and connectivity bringing them together, the number of Critical Applications in the landscape, the various infrastructure both in the primary and DR data centers, as well as the capacity in the DR colocation. There are also a large number of other items that we are required to know, fill out, report on, and perform to ensure that we have put our organizations in a position to succeed in the event of a disaster. But do we all know what the DR Manager’s job is during an exercise?

The DR exercise is the culmination of months’ worth of planning, depending on whether you perform quarterly, semi-annual, or annual exercises. All the hard work has been done, or at least should have been done, to get you to this point. But, since every DR Manager and every exercise is different, are there certain things that we should all do to ensure not only a successful test, but also include team members who want to assist with the exercise year over year?

The day of the exercise comes, and you are sequestered at a remote venue getting ready to call the exercise to a start. There are many things that should have been taken care of in the days and hours prior to the start, when you will say we are beginning the annual DR exercise. You have verified that the internet capacities and speed are sufficient for the number of people participating from this remote location. You have brought in a couple monitors per person as well as the docking station for each of the various laptop models your company employs, and you have a keyboard and mouse for each person. Do not forget those power cables, extension cords, and any other pieces of technology you may need.  Did you bring a router so that everyone can access a more secure environment?

Some of the other things that I always like to make sure that I have accounted for are food and drink. Not the items that you will order for lunch and dinner, but rather snacks that people can have access to as they hit their Snickers moment. I have always reached out to every person who will be in our remote site to ensure that I have covered each of their wants and requests. Whether that be diet or regular sodas, some with and without caffeine, sugared items, salty items, even the keto friendly items. If you make sure that you are taking care of the resources you are counting on to make YOUR annual exercise a success, you will have individuals who are always willing to go the extra mile.

So, you start the exercise. You tell everyone that as of a certain time, the DR exercise has commenced. Now what do you do? The most important thing that a DR Manager can do during the exercise is remain calm.

More than likely, there are several issues that will come up during the exercise. We have had exercises that should have been completed within 24 hours that went several days. The last thing that the people, busily working on fixing issues, need is a DR Manager who seems to panic or becomes concerned. You are their rock. You need to make sure that you do not get emotional and raise your voice. You should NEVER yell at anyone performing restorations, recoveries, builds, validations, etc. They need you to be the calming presence that helps get them through. Show that you have faith in them. They know their jobs. The do not need a DR Manager who thinks she/he knows more than actual engineers performing the work they do every day. You are all there to learn. Let mistakes occur, but never, ever show that you think they are not doing their best to accomplish the goal.

There are several other functions that the DR Manager must accomplish during the exercise. Reporting on a regular schedule, ordering food, keeping track of what has been completed and who is responsible for the next steps, as well as opening lines of communication and bringing people together to mitigate issues are some of them. But, as I stressed in the previous paragraphs, the most important thing you will ever do during an exercise is to remain calm and have faith in the people entrusted to bring everything back and make you successful.

Remember that. You will never look good if they fail. And it is even worse if you are part of the reason that they failed. Or worse yet if they intentionally failed because of the way you treated them. Show them the respect they deserve. They define your success; the DR Manager’s Job During an Exercise.

Are We Ready? A Look To The (Not So Distant?) Future

2020. What an odd year. We have seen some interesting things throughout time, but 2020 has brought something new to many of us. And it leads us to the often-asked question, Are We Ready?

As we are all acutely aware, 2020 has brought new problems to the table, probably ones that we never expected. The Coronavirus-19 has changed the way we look at our world. But, should it also change the way we look at Disaster Recovery?

If you are like me, I make sure that I attempt to plan for every possible disaster that could impact our data centers. Whether it be flooding, fire, tornadoes, hurricanes, power outages, to something more like a potential nuclear war or intentional internal destruction or sabotage, we all need to have plans in place for restoring our data center back to what is what prior to the disaster. But, are we looking at everything?

With Coronavirus-19, I have done more thinking about what could impact us in the future. Along with that, the current social unrest as well as a heated political campaign for the presidency in the United States, are there other things that could impact our data centers in the coming years, or even months?

Take, for example, if the current president loses in his attempt to retain the presidency. Do we automatically assume that everything will remain as it is? Or does the potential exist for a civil conflict that turns our world upside down? If a pandemic can cause this much damage to everything we are accustomed to, imagine what a civil conflict might bring.

I am not trying to say that any of this actually has a decent chance of occurring. What I am saying is that we, as Disaster Recovery Managers and practitioners, must do everything in our power to make sure that we have thought of every type of disaster that could happen and that we have planned for each of those. If the chance is even extremely remote that any of these disasters could occur, don’t we owe it to our organizations to at least consider those possibilities and plan for them?

What could happen during a time of civil unrest in the United States? It is impossible to tell. However, what is clear is that opposing groups could very well attempt to take over infrastructure within the US so they could control certain aspects of the country. Do the take over or cut power to control certain infrastructure or cities? Do they take over buildings that are large and have various protections built into them?

It is nearly impossible to predict the future. We do get a good glimpse of what is going to happen as the world around us changes. Wildfires have been burning out of control and record temperatures continue to be set across the southwest states in the US. Again, I am not saying that this will necessarily continue. What I am trying to point out is that we are not able to tell if this could impact our data centers or recovery plans. While geologists and other scientists believe it will continue to get worse, have we taken into account every possible outcome associated with these increases in temperature and fires burning out of control?

We may not think that any of these issues could impact our data centers. I would recommend that you think twice about it. Start thinking outside the box and put together a list of issues that could arise from any of these disasters possibly occurring. We owe it to our organizations to be prepared for any disaster. The ones that may take out our data centers and end our organizations may not be the ones that we have had to consider in the past. They may be something that we never would have considered impacting them. So, I ask you again, are we ready?

DR Testing Methodology

When we consider the importance of Disaster Recovery (DR) within an organization, the only true performance measure is a recovery after an actual disaster event in our data center. So, how do you tell if you are ready for that type of event? Does testing monthly, quarterly, or annually provide any real value in defining our readiness? Today I will discuss DR Testing Methodology as it applies to preparedness and providing the business with enough information to instill faith and trust in your ability to recover in the case of a true disaster.

So, what is the best interval to perform DR testing? Is it monthly, and for only modified applications, whereby you can validate that any changes made to the application or environment is accounted for in your DR environment? Is it quarterly, which allows for a potential greater number of applications to play into the exercise and you can get a better picture of inter-dependencies with applications and infrastructure? Or is it annually, whereby you can validate the overall recovery picture so that the business knows how well you perform against requested RTOs and RPOs?

Monthly tests are generally seen by the business as the most robust of the testing paradigms. However, what is difficult to quantify for the business is not only the amount of time that is spent performing each of these tests, but the relative lack of value that is attained through this method of testing. Depending on the amount of infrastructure needed to facilitate the recovery of an application, the ROI on one of these tests may be extremely low and take away from other projects the business has need for.

I am more a proponent of quarterly validations over monthly because it allows for resources to not be “wasted” in performing multiple exercises. You could easily choose a quarter of your Critical applications and perform a recovery with testing for each of those apps. However, again, this does not provide an overall validation of the team’s ability to recover all Critical applications within the defined RTOs for those applications. This can only be accomplished when ALL of the Critical applications are recovered during one full exercise.

This is not to say that monthly and quarterly tests and validations cannot be done in concert with an annual exercise. The point is that, without a full-blown exercise to validate RTOs can be met in the event of a true disaster, the other, smaller exercises, are not really showing anything other than the importance a Technical Recovery Plan (TRP) plays in the overall DR Process.

So, now that we know that it is important to have an annual exercise, and maybe either a couple of additional monthly or quarterly tests, what is important to test? Depending on your environment, this answer could range from end-to-end testing down to validating that an application is up and running and the data within the corresponding databases is additionally recovered.

When I first started managing Disaster Recovery, I tried to determine which form of testing made the most sense. We had a primary data center that was backed up/replicated to our DR colocation several hundred miles away. Because some of the applications ran on legacy platforms and the guts of the application had not changed that much over the years, there was a concern around changing IP addresses on the recovered servers and the applications’ abilities to function properly. Therefore, a case was made to create a “bubble” environment for restoring the applications and providing testing. This all but ensures that end-to-end testing cannot be accomplished, mainly because you are required to have this “bubble” closed off to incoming and outgoing internet traffic.

I determined that our best practice was to validate the application could come up, the users could log in, data that was expected from the Disaster Declaration Date was available within the bubble, and some small validations could be completed. Functional testing was not to be considered a part of the exercise because we could not guarantee that the application did not need some sort of external integration or connectivity.

This is not the case if you have a hot/hot environment or failover. These types of environments allow for more robust testing and could include end-to-end testing. However, is that really the purpose of a DR exercise? In today’s IT environment, so much is done with middleware servers that provide access to these third-party integrations that the necessity to validate everything really becomes overkill. If you were to have a disaster event and you were able to recover your middleware servers, you should have a level of security from the middleware team that restoring that server in a new data center, i.e. your colocation, would not require a complete validation of all of the external integrations annually during your exercise. The processes that run on these servers should work the same regardless of the data center they are running in. The code has not changed, and if you have connectivity to the internet, they should still have all the URL’s to connect to and function as they do in your production environment.

So, you have done your homework on what is important, and now it is time to let the business know what will be tested. The business almost always wants to have complete, end-to-end functionality testing to validate that their applications will function as they currently do in the production environment. However, we already know, based on infrastructure and coding changes, and therefore updates to the TRPs, that the ability to build up the servers and to troubleshoot any issues that arise with connectivity, communication, etc., can be easily mitigated whether it is during an exercise, or in the event of a true disaster. So, why go through complex test plans to validate that which we know, as DR Managers, will work if we were to bring everything back in the event of a true disaster?

This all takes me back to my article from last week, Technical Recovery Plan (TRP) Best Practices, when I mentioned the importance of architectural diagrams in the TRP process. If your organization has proper documentation to show each and every dependency and integration each application has in the overall environment, then the additional testing becomes superfluous. Bring up the infrastructure, make sure your databases, application and web servers are all restored and brought up in the right order, and then make sure your testers can log into the application and validate some data. That is what is needed to provide evidence that your applications are recoverable in the event of a true disaster.

The hard part on this is the sell. You probably will not get buy-in from the business on this. However, if you can get backing from IT, and ultimately the CIO, that is your starting point. From there it is important to talk to the individuals within the organization who can understand the importance of validating that applications and infrastructure are recoverable without performing end-to-end testing. This could be individuals such as the COO or the CEO. They are the ones who should be determining what level of validation needs to be accomplished. They should be interested in the why and how it is completed, and what level of security it will provide them. The biggest piece of restoring a data center to your colocation after a disaster is the infrastructure. Showing that the DR Team can restore the infrastructure, start up the applications, provide the ability to log into those applications, and have the data available accounts for the largest part of the equation. Beyond that, the value of end-to-end testing is a negligible addition to the overall process.

Keep in mind that I am not advocating that you throw out functional testing if you currently incorporate it. Especially if you have a hot/hot or failover environment between your primary data center and your colocation. What I am saying is that the added value of functional or end-to-end testing in your DR Testing Methodology should be considered minimal to the overall value of restoring infrastructure and bringing back applications and data for the business to validate.

Technical Recovery Plan (TRP) Best Practices

Technical Recovery Plans (TRPs) are vital to any Disaster Recovery plan. These documents are the foundation for recovering your data center in the event of a true disaster or for testing out your plan and processes. But what should be included in the TRP document? Here we will discuss some of the Technical Recovery Plan Best Practices.

The importance of a TRP cannot be disputed. However, what goes into a TRP can be widely different based on the organization, DR Manager, recovery philosophy, or even the environment that your data center and colocation provide. This article will discuss some of the areas of concern that a DR Manager should consider when developing TRPs for their organizations, and some things that would be considered Best Practice to include.

First off, every application in your landscape should have a TRP. This includes not only Critical applications (those needing to be back because they are vital to the continuation of the business), but also for all of the other applications that would need to be recovered at some point in time. While this could be a major undertaking, as organizations may have hundreds of applications that are run in their data centers, it is a necessary practice to ensure recoverability of your data center.

So, now that we understand why we need a TRP for all of our applications in our data center or technical environment, it is important to understand how we should go about creating and documenting how each of these applications would be recovered in the event of a disaster, or used during a test. Several areas will be listed, some of which may be important to you, as the DR Manager, and others that you feel either exist elsewhere or can easily be determined in the event of needing that information. However, be aware that having all necessary information in one place can greatly reduce recovery time for your applications as you would not have to do the legwork to find information at time of disaster or test.

One of the most important pieces of information deals with the applications or system information. This would include such items as the name (and any aliases) of the application, a description of its functionality, who owns the application from both the business and the IT side, as well as information about how the application is accessed (such as a URL to the app), whether it is vendor hosted, whether it is vendor supported, and all vendor contact information if either of the previous questions are answered “yes”. This information allows for easy communication with the teams using and supporting the application, as well as access to vendors if they are in any way involved.

Another important section deals with Infrastructure and Application Development resources supporting each application. This should have various methods of access to each resource, with multiple phone numbers in the event of a disaster and email addresses at a minimum. This should include at least the primary and backup for each of the technologies employed for the application from an infrastructure standpoint as well as the primary and secondary contact from the application development team. This ensures the DR Manager can get to a resource when someone is needed to assist with restoration of the application.

Another valuable area deals with how the application is restored. This section would talk about how an application is going to be recovered: Failover, Replication, or backups. This would include steps that would be performed for any of the three of these recovery types, based on which is used for the application. Along with this information would be the person or team responsible for each step as well as an estimate on time-to-complete for each step. Additionally, there should be a notation as to whether the task has prerequisites for completing the current task. So, if step 5 requires that step 3 be completed first, that should be noted in the document. This area should talk about how each of the recovery processes are validated and by whom. Also, it is important to discuss how this would all be recovered back at the primary data center after it is made useable again, if that is what is required by the organization.

Another section that you would want to document is information about the servers, as well as backups or replication for each piece of infrastructure the application requires. This should list such items as the server name, server type (App, DB, Web, etc.), whether the server is virtual or not, the OS on the server and any release levels for that OS that have been applied, the corresponding server name at the recovery site (if it is different than the prod server name), whether the server is load-balanced, and if so, which servers it is load-balanced with, and the recovery order or sequence so that your servers can be restored with minimal need for reboots.

Another vital part of this document would be the interfaces and dependencies that each application has on other elements in the IT landscape. All integrations should be listed so that during recovery of each application, the recovery and validation of each of those integrations deemed Critical can also occur prior to turning an application over for validation and testing, and therefore prior to its use in the recovered environment. Dependencies obviously need to be restored prior to the application being functional, so listing each of these is extremely important in the overall recovery process. Dependencies should also mention if they are downstream, upstream, or bi-directional so that all elements can be validated.

One other piece of information that can greatly assist in the recovery of applications is an architectural diagram. This is a document that should outline each integration in a visual format as well as all the information around which servers are part of each integration. They say that a picture is worth a thousand words, well, in the event of a disaster, sometimes it is easier and less time-consuming to view an architectural diagram of the application, its components, dependencies, and integrations and come away with a better general understanding of the overall recovery plan for that application than reading the entire TRP. I am not saying that you should bypass reading the document, however, getting a view of the necessary pieces to begin the recovery can reduce overall recovery time.

The last piece of any document that you put into your DR environment should be a revision history. This is vital to ensure that document is being revised and validated on a regular basis but also puts some ownership to changes that are made. I would recommend more oversight than just this section, and many of the document versioning tools allow for multiple versions to be saved and the ability to look at what has changed. One other area to consider is to always receive sign-off when those updates are made. This will ensure that middle-level management is onboard with the changes being made to the document as well as their being aware of any changes to the environment. They should ultimately be responsible for the applications owned in their areas.

So, there you have it. Some Best Practices for a Technical Recovery Plan. This is certainly not the end-all for TRPs. There are other inclusions that could be made depending on your environment or needs. I am going to attempt to include some templates of Technical Recovery Plans in the DR Document Templates section, and would love to have some of you send me a clean copy of your TRP template so that others can possibly use pieces from other templates to create their own, or quickly get started in developing their DR Plan and processes without having to reinvent the proverbial wheel when it comes to TRPs. Please send those to dan@disasterrecoveryblog.com .

The Importance of Discussion and Differing Points of View

Well, the time has come. I have written nine posts on the Blog and am looking to expand it in additional directions. So far, you have all read my thoughts and opinions on topics that I feel are important for any DR professional. I have many more to write about, trust me. Disaster Recovery is ever-changing and encompasses many different approaches to accomplishing what you are here to do: protect your data center and its applications. However, it is time to promote discussion and differing points of view.

I will continue to post articles about once per month. I hope that will bring you back to the site to learn about the new topics I write about.  Some may interest you, while others may not. The purpose is to write about what you want to know. However, there are other ways to get additional information across; have guest authors and open discussions.

A close up of a logo

Description automatically generated

As I had mentioned early on, this is something that is new to me. I am not a professional blogger. What I am is a long-time IT professional with a ton of information bottled up inside my head that I would love to provide to others to help them on their professional journey in Disaster Recovery. My background in support, application development, and infrastructure helped me succeed in Disaster Recovery. All DR professionals do not necessarily have a vast background in other areas of IT, and I am not saying that you cannot be successful at Disaster Recovery without that background. But everyone has a story and differing knowledge that could help others in the industry succeed.

Therefore, I am asking you to visit the site and write some comments to the blogs that I have already written, providing both feedback and critical questions to start discussion on the topics. Go to the Forum section and start a conversation. If we can get daily questions on the site, and enough DR Professionals interested in reading and responding to those questions, everyone wins. You then know that your questions about a certain process or policy can be answered by other DR professionals and a quality discussion can take place.

When I began in Disaster Recovery, I searched the internet for something like this. I could not find it anywhere. As I mentioned in my post Why a DR Blog, the information that exists out there for DR Professionals is mainly written with Business Continuity in mind. Yes, the two processes go hand in hand, however, many DR Managers are looking for information that will help answer THEIR immediate questions.

My ask here is that each of you helps make this blog something that helps every visitor to the site. Write an article that you would like me to post and you will get credit in the byline. Start a discussion in the Forum. People coming to the site can answer those questions so that everyone can do a better job ensuring that your organization is ready, just in case something does happen, and an actual Disaster is called in your data center. Let us all share our knowledge and make it available to others who share a common goal, ensuring the recoverability of our data centers in the event of a Disaster.

Disaster Recovery Minimum Control Standards

It seems like everything within Information Technology (IT) requires some sort of Minimum Control Standards (MCS) to operate. Some organizations call them Minimum Control Requirements or maybe even Minimum Internal Control Standards. The point is to ask whether MCS should apply to Disaster Recovery, and if so, at what level?

For decades, organizations have tasked the DR Manager with putting together a plan and processes around restoring the data center in the event of a disaster. Most DR Managers have developed these plans and processes through the employment of common sense and previous experience performing recoveries or tests. To this day, little information exists on the internet around best practices for DR, valuable documentation on DR Plans, Technical Recovery Plans (TRPs) and the like. So, the question exists, is it beneficial to have a Minimum Control Standard document created within your organization so that you, as the DR Manager, meet some minimum standards?

So, where do MCS documents come from? There are a couple of places that I am aware of: from Corporate Risk Management, and from past practice within the DR Team. There are probably others, but these are the two that I will discuss today.

  • MCS requirements from Risk Management
    • Many organization’s Risk Management departments believe they understand what should be done in the event of a disaster. Maybe your data center has failover, hot-hot, hot-warm, or some other type of replication and recovery process in place. When Risk Management makes determinations about MCS, oftentimes you end up with controls that may or may not be easily achievable.
    • The idea behind MCS is to have some standards in place that are minimums that must be achieved in either your plan, processes, testing, or other area of DR, to ensure that you are prepared in the event of a disaster. However, having a department within the organization (which may not understand Disaster Recovery) create these standards for you does not necessarily benefit the organization or the overall DR plan. Only if Risk Management is willing to work directly with the DR Manager does a valuable set of standards get created.
  • MCS Requirements from DR Team
    • The most valuable set of standards around Disaster Recovery will come directly from the DR Team. This is because they have the best understanding of the what, when, where, why, and how for restoring the data center, creating the plan and processes, performing testing, engaging the business and test plans, as well as many other topics. Therefore, I would always recommend that the initial framework for an MCS document come from the DR Manager and his/her team.
    • This document should encompass easily attained standards. These are minimums, not the best-case scenario if they had unlimited resources available and could build their plan and processes around the best end-state. Any organization’s DR Team can have certain criterium that they would like to meet over the next month, year, 3 years, or 5 years. However, those are not Minimum Control Standards, and upper-level management needs to understand not only the difference, but what should go in to developing those controls.

Now, having two different places where MCS documents can come from does not necessitate every organization going out and creating MCS documents just to have them. As I mentioned earlier, nobody probably knows more about Disaster Recovery in the organization than the DR Manager and the DR Team. Therefore, a discussion should occur to determine if an MCS document needs to actually be created for your organization. An argument can be made for and against them. It may be too early in the life of DR within your organization to automatically jump to including MCS in your process.

So, think long and hard about MCS for DR. Discuss it with upper-level management, or take a stab at writing your own controls. From there, making sure that everyone is onboard with what you are trying to accomplish becomes vitally important. Nobody wants a rogue DR Manager. However, showing the initiative and coming up with some of these ideas and presenting them to management will not only look good for you, but will also help in bridging any gaps that exist between the DR Team and the rest of the organization.

Tiering Applications – How to decide what is important

One of the most important parts of a good DR (Disaster Recovery) plan is to understand the importance each application has within your organization. Only after you understand that can you begin to put each of the applications into Tier levels.

Tier levels are the various timeframes that a DR Manager creates to determine when certain applications should be recovered in the event of a disaster. Applications will have RTOs (Recovery Time Objectives) and RPOs (Recovery Point Objectives) determined by the business for every application in the landscape. However, to make all the applications recoverable within these RTOs and RPOs, the DR Manager needs to define attainable timeframes to accomplish this.

The first thing to make sure you understand is the definition of Critical within your organization. Critical applications are the ones that are needed within a certain timeframe and without them, the organization will fail.  The organization may not fail at that point in time, as there may be a buffer built in whereby applications are not required, but the RTO is a guideline to use when the business would prefer to have that application back.

So, with the information determined around Critical applications and a timeframe defined by upper level management, the DR Manager must then determine Tier levels for the applications that are deemed Critical. Let’s consider an organization that determines Critical applications need to be back within a 5-day timeframe. That would mean that any applications deemed Critical would have an RTO that falls within that 5-day window. If it does not fall within the window, it should not be considered Critical.

So, we have a 5-day window with which to restore all the Critical applications. Where do we go from here?  Should we just say that we will have everything back within those five days?  No.  The better way to do this is to break this 5-day window into shorter timeframes, or Tiers.  The first Tier would be for the most critical applications within your environment.  This could be a customer-facing application that is required to have orders placed, or it could be a claim system that allows claim adjustors to enter claim information for the customers.  It could be one of nay type of applications that is important to drive the business.

Either way, you have your most critical of applications in the first Tier.  What should the timeframe be on restoring the most critical applications in the organization? Generally, people would say that this should be hours. Well, depending on the type of environment you have, and if you utilize a colocation that does not have failover capabilities, then this timeframe could be anywhere from a couple hours to 24 or even 48 hours depending on your situation.

The main thing here is to make sure that only the most critical of applications within your environment are included in the first Tier level of Critical applications. I wrote in a previous article (The Argument for a DR Guarantee) how the DR Manager will want to look at what the business has done to allow for these applications to be considered Critical.  Are they up to date with regards to OS, database size, or even virtualization of the servers? If not, then you should work to make sure that these applications are not considered Tier 1 applications, or possibly even considered Critical.

Now, let’s consider that you determine that you really only want 3 Tier levels: Tier 1 (Extremely Critical); Tier 2 (Remainder of Critical); Tier 3 (Remainder of applications). If these are using the 5 day Critical window, that would mean that everything Extremely Critical would be recovered within that 24 hour timeframe that was determined by the business and yourself. Tier 2 would be anything needing to be recovered within 5 days.  And the remainder of the applications sometime after that, maybe 20 days. The problem with this is that people will say that they cannot wait for 5 days for their application and during the BIA (Business Impact Analysis) phase of Business Continuity, they will say that their application is Extremely Critical and has to be recovered within 24 hours.

While it is alright to have the majority of your Critical applications recovered within the first Tier’s timeframe, it is not a good solution to allow the business to move things up just because the other Tier’s timeframe does not work for them. My recommendation, and the way that I defined Tier levels was based on timeframes that made sense.

If I had 5 days to restore Critical applications in my colocation that did not have failover capabilities, I would define Tier 1 to be 24 hours.  Tier 2 would either be 48 or 72 hours, depending on how many applications would be “moved” to Tier 1 by the business if they felt that 72 hours was too much, but would be fine having them restored within 48 hours. If that was the case and I needed Tier 2 to be 48 hours, I might have a Tier 3 that was at 72 hours, and my final Critical Tier at 5 days.

As you can see, it is important for you, as the DR Manager, to take all of this into consideration when defining Tier levels. Obviously, everyone within the business wants every one of their applications recovered as soon as possible.  However, it is your job to ensure that not only can only are you keeping the business happy by aligning their applications to the importance within the organization, but you are also making sure that you have Tier levels defined that make sense to the business and are attainable by IT in the event of a true disaster.