Sample Disaster Recovery Plan

Use this sample Disaster Recovery Plan for a fictional company, MedCoTech.

DOWNLOADS

Kevin Sesock

3/23/202113 min read

1. INTRODUCTION

1.1 PURPOSE

This MedCoTech Disaster Recovery Plan establishes procedures to recover MedCoTech’s Information Technology Systems following a disruption. The MedCoTech Disaster Recovery Plan is based off of the NIST 800-63r1 Contingency Plan template, modified for a private organization. [1] The following objectives have been established for this plan:

Maximize the effectiveness of contingency operations through an established plan that consists of the following phases:

  • Notification/Activation phase to detect and assess damage and to activate the plan

  • Recovery phase to restore temporary IT operations and recover damage done to the original system

  • Reconstitution phase to restore IT system processing capabilities to normal operations.

Identify the activities, resources, and procedures needed to carry out MedCoTech’s data processing requirements during prolonged interruptions to normal operations.

Assign responsibilities to designated MedCoTech personnel and provide guidance for recovering MedCoTech’s Information Technology Systems during prolonged periods of interruption to normal operations.

Ensure coordination with other staff from various partners and vendors who will participate in the contingency planning strategies. Ensure coordination with external points of contact and vendors who will participate in the contingency planning strategies.

1.2 APPLICABILITY

The MedCoTech Disaster Recovery Plan applies to the functions, operations, and resources necessary to restore and resume MedCoTech’s Line of Business operations as it is installed at the MidCon data center in Edmond, OK[2]. The MedCoTech Disaster Recovery Plan applies to MedCoTech and all other persons associated with MedCoTech as identified under Section 2.3, Responsibilities.

1.3 SCOPE

1.3.1 Planning Principles

Various scenarios were considered to form a basis for the plan, and multiple assumptions were made. The applicability of the plan is predicated on two key principles.

That MedCoTech’s facility at the MidCon Data Center in Edmond, OK, is inaccessible; therefore,

MedCoTech is unable to perform data processing for the organization.

A valid contract exists with a cold site that designates that site at Rack 59 Data center in Oklahoma City, OK, as MedCoTech’s alternate operating facility. [3]

  • MedCoTech will use the Rack 59 data center and workstation cubicle floor to recover functionality during an emergency situation that prevents access to the original facility.

  • The designated computer system at the alternate site has been configured to begin processing MedCoTech’s primary systems within 72 hours.

  • The alternate site will be used to continue MedCoTech’s recovery and processing throughout the period of disruption, until the return to normal operations.

1.3.2 Assumptions

Based on these principles, the following assumptions were used when developing the IT Contingency Plan:

MedCoTech’s core IT systems are inoperable at MedCoTech’s primary computer center and cannot be recovered within 48 hours.

Key MedCoTech IT personnel have been identified and trained in their emergency response and recovery roles; they are available to activate the MedCoTech Disaster Recovery Plan.

Preventive controls (e.g., generators, environmental controls, waterproof tarps, sprinkler systems, fire extinguishers, and fire department assistance) are fully operational at the time of the disaster.

Certain systems of MedCoTech’s already exist in geographically diverse locations, as cloud services or collocated facilities. This includes the following:

  • Office 365 for email, departmental Sharepoint intranets, and other communications tools.

  • Phone system head-end, capable of redirecting phones with support from MedCoTech’s VOIP phone service provider.

  • Employee payroll, HR management, etc., through PayCom

  • Credit Card Processing services, through Heartland Payment systems

Computer center equipment, including components supporting the MedCoTech primary VMWare cluster, are connected to an uninterruptible power supply (UPS) that provides 45 minutes to 1 hour of electricity during a power failure.

MedCoTech’s data center hardware and/or software at the MidCon data center are unavailable for at least 48 hours for any reason (natural disaster, connectivity problems, fire, flood, etc.).

Current backups of the application software and data are intact and available at the offsite storage facility or via cloud backup facilities.

The equipment, connections, and capabilities required to operate MedCoTech’s primary systems are available at MedCoTech’s racks at Rack 59 in Oklahoma City, OK.

Service agreements are maintained with Dell hardware, ERP software vendor support, Cox Communications, and other relevant vendors to support the emergency recovery.

The MedCoTech Disaster Recovery Plan does not apply to the following situations:

Overall recovery and continuity of business operations. The Business Resumption Plan (BRP) and Continuity of Operations Plan (COOP) are appended to the plan.

Emergency evacuation of personnel. The Occupant Evacuation Plan (OEP) is appended to the plan.

Emergency Communication Plan. The Emergency Communication Plan is appended to the plan.

1.4 REFERENCES/REQUIREMENTS

This MedCoTech Disaster Recovery Plan complies with the MedCoTech’s strategic plan as follows:

The organization shall develop a contingency planning capability to meet the needs of critical supporting operations in the event of a disruption extending beyond 72 hours. The procedures for execution of such a capability shall be documented in a formal contingency plan and shall be reviewed at least annually and updated as necessary. Personnel responsible for target systems shall be trained to execute contingency procedures. The plan, recovery capabilities, and personnel shall be tested to identify weaknesses of the capability at least annually.

The MedCoTech Disaster Recovery Plan also complies with federal, state, and local laws, regulations and standards, as well as corporate and departmental policies, to include (but not be limited by):

The Computer Security Act of 1987

PCI-DSS

Health Insurance Portability & Accountability Act (45 C.F.R. § 160, 162, & 164)

MedCoTech Acceptable Use Policy and Network Security Policy

1.5 RECORD OF CHANGES

Modifications made to this plan since the last printing are as follows:

2. CONCEPT OF OPERATIONS

2.1 SYSTEM DESCRIPTION AND ARCHITECTURE

MedCoTech retains two large VMWare server clusters at its primary data processing facility at the MidCon building in Edmond, OK, 6 miles from its primary office complex. Mid-Con provides a physical location with power, CRAC, Internet egress, and physical security, as well as a secure cage for the storage of MedCoTech’s two physical racks. Infrastructure located in these racks is the property of MedCoTech. Primary network routing is performed at the office complex and connected to the data processing facility at MidCon over Cox Communications Metro-E (for secure and high-speed routing) through the building’s primary and secondary Cox business-class fiber ingresses.

The first cluster is dedicated for Non-Production and non-critical systems, and the other focused on MedCoTech’s primary production systems and data. In addition to Active Directory servers and a few other central IT administration systems, this primary VMWare cluster contains 24 physical hosts, redundant SANS across a highly available fiber channel network, all supporting over 100 virtual hosts dedicated to MedCoTech’s primary line of business application servers, database servers, and support servers, such as file and print servers. These virtual hosts also include MedCoTech’s application and database servers which provide core business processes, primarily supported in MedCoTech’s ERP and surrounding systems, covering such business processes as General Ledger, supply chain management, procure to pay, order to cash, marketing, sales, & CRM, research and development, product quality assurance, and others. This primary VMWare cluster has a maximum capacity of 32 TB of data, 27 TB of which is currently allocated, and of which 6 TB is considered immediately mission critical for continuity of operations. This immediate mission critical data primarily revolves around the following:

ERP System data, to include:

  • General Ledger

  • Procure to Pay

  • Order to Cash

  • SCM

  • R&D

  • MedCoTech’s CRM system and other sales databases, including online B2B sales ordering and fulfillment system.

Because of the quantity of data, MedCoTech practices tiered on-site and off-site data backup. This includes nightly differential backups of database and file shares are stored on-site at the MidCon data center (for data recovery for minor system problems only), copied nightly to a cloud backup device using Veeam, and replicated to a Veeam appliance stored at the Rack59 facility. Due to bandwidth restrictions, weekly full database and file share backups and monthly VM snapshots are physically delivered to a secure storage vault at MedCoTech’s bank vault on external hard drive, as well as refreshed every 90 days on the Veeam appliance at Rack59.

The equipment at MedCoTech’s half-rack at Rack59 is limited to 18U. This environment contains a small Veeam backup appliance, limited SAN with 8 TB and 6 1U physical nodes capable of running a limited subset of virtual machines, supporting core business functions only. In the event of a disaster, MedCoTech maintains an agreement with Rack 59 allowing for rapid expansion to an additional full 42U rack in another area of the data center floor. This equipment will have to be rapidly procured in the event of a disaster. This means that MedCoTech’s facility at Rack 59 is a warm site.

As stated elsewhere in this document, systems such as email, VOIP phones, Sharepoint, and certain other systems are maintained in the cloud and are outside the scope of this document.

2.2 LINE OF SUCCESSION

MedCoTech sets forth an order of succession, in coordination with the order set forth by the Chief Executive Officer to ensure that decision-making authority for the MedCoTech Disaster Recovery Plan is uninterrupted. The Chief Information Officer (CIO) is responsible for ensuring the safety of personnel and the execution of procedures documented within this MedCoTech Disaster Recovery Plan. If the CIO is unable to function as the overall authority or chooses to delegate this responsibility to a successor, the Chief Information Security Officer (CISO) shall function as that authority. If the CISO is unable to function as the overall authority or chooses to delegate this responsibility to a successor, the Chief Technology Officer (CTO) shall function as that authority.

2.3 RESPONSIBILITIES

The following teams have been developed and trained to respond to a contingency event affecting the IT system.

Members of the System Infrastructure Team include personnel who are also responsible for the daily operations and maintenance of MedCoTech’s VMWare environment and line of business application and database systems. MedCoTech System Infrastructure Team is responsible for recovery of the MedCoTech VMWare environment, database recovery and restoration efforts, and supporting the Application Services team and any operating system support. The Manager of System Administration directs the System Infrastructure Team in restoration activities, and reports to the Chief Technology Officer.

Members of the Network & Telecommunications Team include personnel who are also responsible for the daily operations and maintenance of MedCoTech’s wired and wireless network infrastructure, VOIP systems, WAN interconnects, site-to-site VPN connections, etc. The MedCoTech Network & Telecommunications Team is responsible for recovery of any network infrastructure issues, as well as insuring connectivity to the alternate site and insuring continuity of connectivity to the office complex during the event. The Manager of the Network & Telecommunications Team directs the Network & Telecommunications Team in restoration activities, and reports to the Chief Technology Officer.

Members of the Application Services Team include personnel who are responsible for the daily support, management, and configuration of the Line of Business software supporting the organization, including the ERP systems, CRM, and sales systems. This includes software administrators, business system analysts, configuration specialists, and other staff that insure the software operates effectively.

3. NOTIFICATION AND ACTIVATION PHASE

This phase addresses the initial actions taken to detect and assess damage inflicted by a disruption to MedCoTech’s systems. Based on the assessment of the event, the plan may be activated by the Chief Information Officer or their designee.

In an emergency, MedCoTech’s top priority is to preserve the health and safety of its staff before proceeding to the Notification and Activation procedures.

The notification sequence is listed below:

The first responder is to notify the Chief Information Security Officer. All known information must be relayed to the Chief Information Officer.

The Chief Information Security Officer notifies the Manager of the Network & System Security Team. The Manager of the Network & System Security Team notifies Manager of the System Infrastructure Team to begin assessment procedures.

The Manager of the System Infrastructure Team is to notify team members and direct them to complete the assessment procedures outlined below to determine the extent of damage and estimated recovery time. If damage assessment cannot be performed locally because of unsafe conditions, the System Infrastructure Team is to follow the Alternate Assessment Procedure below.

Damage Assessment Procedures:

  1. Enter the MidCon facility and approach the cage only if it is safe to do so.

  2. Verify primary system capability – can any hardware, data, or systems be salvaged, or brought back online at the primary site and with minimal downtime?

  3. If not, can any hardware be salvaged and returned to service at the alternate site?

  4. If not, determine non-production VMWare cluster capability - can the non-production cluster be utilized using local backups to serve in production capacity at the primary site?

  5. If not, can any hardware from the non-production VMWare cluster be salvaged and returned to service at the alternate site?

  6. Determine availability and verify completeness of latest backup at Rack 59

  7. Determine availability and verify completeness of cloud-based backups in Veeam tenant

  8. Notify Dell Computer Support and request expedited hardware replacement at primary site, assuming hardware at primary site is destroyed or inoperable due to hardware failure.

  9. Collect photos and information for insurance claim.

  10. Report any and all data back to Manager of the System Infrastructure Team

Alternate Assessment Procedures:

  1. Are operating systems online? Do systems respond to ping?

  2. Assess potential for physical damage – check local weather, news sources, information outlets, research possible causes for outage.

  3. Contact account executive and local monitoring facility for primary site. Request status and image from security cameras and video feeds.

  4. View MidCon facility from a safe vantage from the exterior (near the I-235 Service Road, for example) for obvious signs of external damage, smoke, debris, or water incursion.

  5. Report any and all data back to Manager of the System Infrastructure Team

  • When damage assessment has been completed, the Manager of the System Infrastructure Team is to notify the Manager of Network & System Security of the results.

  • The Manager of Network & System Security is to evaluate the results and determine whether the contingency plan is to be activated and if relocation is required.

  • Based on assessment results, the Manager of Network & System Security is to notify assessment results to civil emergency personnel (e.g., police, fire) as appropriate.

The Disaster Recovery Plan is to be activated if one or more of the following criteria are met:

  1. MedCoTech’s Primary systems will be unavailable for more than 24 hours

  2. Facility is damaged and will be unavailable for more than 24 hours.

  3. Connectivity to the primary data center is lost for more than 24 hours.

  4. Hardware failures or other equipment damage will prevent significant system usage for more than 48 hours.

If the plan is to be activated, the Manager of Network & System Security is to notify all Team Leaders and inform them of the details of the event and if relocation is required.

Upon notification from the Manager of Network & System Security, Team Leaders are to notify their respective teams. Team members are to be informed of all applicable information and prepared to respond and relocate if necessary.

The Manager of Network & System Security is to notify Rack 59 that a disaster has been declared and to ship the necessary materials (as determined by damage assessment) to Rack 59.

The Manager of Network & System Security is to notify Rack 59 that a contingency event has been declared and to prepare the facility for MedCoTech’s arrival.

The Manager of Network & System Security is to notify remaining personnel (via notification procedures) on the general status of the incident.

4. RECOVERY OPERATIONS

This section provides procedures for recovering the application at the alternate site, whereas other efforts are directed to repair damage to the original system and capabilities.

The following procedures are for recovering MedCoTech at the alternate site. Procedures are outlined in order of priority. Each procedure should be executed in the sequence it is presented to maintain efficient operations.

Procure and add capacity to existing alternate site VMWare cluster. A limited amount of capacity exists at Rack59 to run critical systems under reduced load for as long as 96 hours, however, at that time, additional capacity needs to be added to support additional data, more compute power, and additional VM’s.

System Infrastructure Team

  • Salvage any recoverable equipment from the original site and add in to the environment at the Alternate Site

  • Procure additional hardware as needed to grow cluster to minimum capacity at the alternate site.

  • Contact Rack 59 and add additional rack space to meet minimums as necessary.

Network & Communications Team

  • Contact Rack 59 to add additional Internet speed as necessary.

Reestablish Active Directory domain controllers at the alternate site. Although a read-only domain controller exists for redundancy and speed purposes at the primary communications head-end at the corporate offices. This step is a pre-requisite for database and application servers due to authentication requirements.

System Infrastructure Team

  • Build and promote new Domain Controllers.

  • Restore domain from backup if necessary. Connect to existing domain at RODC.

Network & Communications Team

  • Modify routing from primary site to alternate site.

  • Verify AD connectivity and re-establish authentication services to Network systems.

Reestablish database server at the alternate site and recover databases. Prior to application servers being reestablished and enabled, the database server software must be reinstalled, tested, and the databases themselves must be restored.

System Infrastructure Team

  • Restore database servers from VM snapshot.

  • Verify database server software installation.

  • Configure database server software.

  • Restore databases

  • Rollback any failed database changes

Application Services Team

  • Verify database recovery date and time and work with end-users on any data loss from rollbacks.

Network & Communications Team

  • Verify connectivity to database backups and assist System Infrastructure Team with connectivity issues.

  • Replace firewall rules with alternate set rules.

Reestablish ERP application servers and insure connectivity.

System Infrastructure Team

  • Restore application servers from VM snapshot.

  • Verify web application server restored with VM

  • Reinstall ERP WAR files into application servers from backup.

  • Restore software load balancers from VM snapshot.

Application Services Team

  • Verify application components are working properly and provide system check-out.

Network & Communications Team

  • Restore software-based application load balancers at alternate site.

  • Configure application load balancers at alternate site to point to new application servers.

5. RETURN TO NORMAL OPERATIONS

This section discusses activities necessary for restoring MedCoTech’s primary production cluster at the MedCoTech’s’soriginal (MidCon) or new site. Depending on the nature of the damage, time for recovery, MedCoTech may elect to swap the Primary and Alternate facilities. In other words, to eliminate the need for restoration at the original facility, once equipment has been restored, Rack 59 should remain the primary site with MidCon becoming the new alternate site. Otherwise, MedCoTech should engage the following procedure when the primary production cluster at the original site has been restored and transition operations back to the primary site. The goal is to provide a seamless transition of operations from the alternate site to the primary site.

5.2 PLAN DEACTIVATION

Procedures should be outlined, per necessary team, to clean the alternate site of any equipment or other materials belonging to the organization, with a focus on handling sensitive information. Materials, equipment, and backup media should be properly packaged, labeled, and shipped to the appropriate location(s). Team members should be instructed to return to the original or new site.

System Infrastructure Team

  • Certify complete recovery at alternate site with original capacity.

Network & Telecommunications Team

  • Remove any BGP entries or routes for old site.

Systems & Network Security Team

  • Collect any hard drives or data storage from original site.

  • Certify data destruction and record inventory disposition.

  • Conduct After-Action Report.

6. PLAN EXERCISE

This Disaster Recovery must undergo some form of exercise at least once per year. This shall include a minimum of the following:

  • Table-Top Exercise or Structured Walk-Through (no more than once every 3 years)

  • Parallel Test or Cutover (at least once every other year), including backup recoveries.

  • Full-interruption/full-scale test (at least once every 4 years).

Each recovery exercise must include an after-action report. Necessary modifications determined during these exercises to the report must be recorded in 1.5.

7. PLAN APPENDICES

The appendices included should be based on system and plan requirements.

  • Personnel Contact List

  • Vendor Contact List

  • Equipment and Specifications

  • Service Level Agreements and Memorandums of Understanding

  • IT Standard Operating Procedures

  • Business Impact Analysis

  • Related Contingency Plans

  • Emergency Management Plan

  • Occupant Evacuation Plan

  • Continuity of Operations Plan

8. REFERENCES

[1] M. Swanson, P. Bowen, A. Phillips, D. Gallup, and D. Lynes, “Contingency Planning Guide for Federal Information Systems,” National Institute of Standards and Technology, NIST Special Publication (SP) 800-34 Rev. 1, Nov. 2010. doi: https://doi.org/10.6028/NIST.SP.800-34r1.

[2] “Data Center Infrastructure,” MIDCON DATA CENTER SOLUTIONS. https://www.midcondcs.com/data-center-infrastructure (accessed Apr. 06, 2020).

[3] “Business Continuity & Workplace Recovery | RACK59 Colocation Services,” RACK59 Data Center. https://rack59.com/data-center-services/business-continuity/ (accessed Apr. 03, 2020).