• +1-617-874-1011 (US)
  • +44-117-230-1145 (UK)
Live Chat
Follow Us:

Business Continuity and Disaster Recovery Planning

Contingency Planning

  • Information systems contingency planning refers to a coordinated strategy involving plans, procedures, and technical measures that enable the recovery of information systems, operations, and data after a disruption
  • Resilience is the state of an organization where it quickly adapts and recovers from any known or unknown changes to the environment
  • BCP and DRP are types of contingency planning
  • BCP and DRP help minimize financial impact during serious incidents by protecting tangible and intangible assets

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

BCP / DRP

  • Business Continuity Planning (BCP)
  • Preservation of business in the face of disruptions
  • Focuses on sustaining an organization’s mission/business processes during and after a disruption
  • BCP may be created for a single business unit or for the entire organization’s processes; may also be scoped for only functions deemed to be priorities
  • BCP is the responsibility of the security team since it provides availability
  • Disaster Recovery Planning (DRP)
  • DRP is concerned with restoring operability of disrupted IT systems, whereas BCP is concerned with keeping business processes available
  • DRP applies to major (usually physical) disruptions to service that deny access to the primary facility infrastructure for an extended period
  • DRP only addresses information system disruptions that require relocation to infrastructure at an alternate site

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

The Need for BCP

  • Natural disasters
  • Social unrest or terrorist attacks
  • BCP may often be triggered by an audit
  • Legislative/regulatory requirements
  • Equipment failure (such as disk crash)
  • Disruption of power supply or telecommunication
  • Application failure or corruption of database
  • Human error, sabotage or strike
  • Malicious Software (Viruses, Worms, Trojan horses) attacks
  • Hacking or other Internet attacks
  • Fire

Source: Introduction to Business Continuity Planning; SANS Institute InfoSec Reading Room

Standards

  • NFPA 1600
  • National Standard on Preparedness by the national Fire Protection Association
  • ISO 17799
  • Defense Security Services (DSS)
  • A division of the DoD
  • NIST
  • Standard of due care / best practice/good business practice

Enterprise wide Continuity Planning

Enterprise wide Continuity Planning

Critical Success Factors for BCP Implementation

• Management support

  • Ensures the management will allocate resources for this project.
  • It is the key driver of organizational change
  • Management awareness will steer the program and set priorities

• Accountability and responsibility

  • All departments/individuals know their role in incorporating BCM
  • A BCM team lead should oversee the overall process development and report to management on obstacles faced

• Integral part of information assurance management program

  • BCM is not separate from the organization’s overall IT management
  • Needs and allows continuous monitoring and improvement
  • BCM should be integrated into the total change management process

Source: Information Assurance Handbook: Effective Computer Security and Risk Management Strategies by Corey Schou and Steven Hernandez

BCP Process

BCP Process

Source: CISSP CBK

A. Project Initiation

  • BCP and DRP plan must be based on a clearly defined policy, which states:
  • Organization’s overall contingency objectives
  • Organizational framework
  • Resource requirements
  • Roles and responsibilities
  • Scope as applies to common platform types and organization functions
  • Training requirements
  • Exercise and testing schedules
  • Plan maintenance schedule
  • Minimum frequency of backups and storage of backup media

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Project Initiation

  • Project scope development and planning
  • BCP vs. DRP
  • Crisis management planning
  • Continuous availability
  • Incident Command System (ICS)
  • Executive Management Support - CIO must support the contingency program and be included in the process to develop the program policy
  • Project scope and authorization
  • Continuity Planning Project Team formation

B. Current State Assessment

  • Understand Enterprise Strategy, Goals and Objectives
  • Business Impact Analysis
  • Threat analysis
  • Identify critical business functions
  • 3rd party relationships
  • Assessment of current continuity planning process
  • Benchmarking or peer review

Business Impact Analysis (BIA)

  • BIA correlates system with critical mission/business processes and services provided to characterize the consequences of a disruption
  • Three steps are typically involved in accomplishing the BIA:
  • Determine mission/business processes and recovery criticality
  • Identify resource requirements Realistic recovery efforts of the resources required to resume mission/business processes as quickly as possible
  • Identify recovery priorities for system resources: Linkage between system resources critical to mission/business processes and functions can be identified. Priority levels can be established for sequencing recovery activities and resources.

Critical Business Functions

  • Impacts on business functions are analyzed in terms of availability, integrity and confidentiality
  • Availability (Time Sensitivity)
  • Recovery Time Objective (RTO) - the maximum amount of time that a system resource can remain unavailable before there is an unacceptable impact on other system resources, supported mission/business processes, and the MTD
  • Plan of Action and Milestone for mitigation should be initiated if RTO is not feasible
  • Maximum Tolerable Downtime (MTD) - the total amount of time the system owner is willing to accept for a mission/business process outage or disruption and includes all impact considerations
  • Max Allowable Downtime (MAD) – the total amount of time that the system can be unavailable before significant organizational impact will result.
  • Data Integrity
  • Recovery Point Objective (RPO) - the point in time, prior to a disruption, to which data can be recovered after an outage
  • Critical business functions should be classified based on the determined impact

Sample Business Impact Analysis (BIA)

Sample Business Impact Analysis

Cost Balancing

Cost Balancing
  • The longer a disruption is allowed to continue, the more costly it can become to the organization
  • Conversely, the shorter the RTO, the more expensive the recovery solutions cost to implement
  • Plotting the cost balance points will show an optimal point between disruption and recovery costs

Critical Business Functions

  • Identification of critical business functions
  • Operational impact
  • Financial impact
  • Reputation or public image impact
  • Dependencies
  • BIA enables characterization of the system components, supported business processes, and interdependencies
  • Possible business impacts due to the unavailability of systems can be determined (RTO,MTD, etc.)
  • Then sequencing recovery of information system components can be finalized which will form the basis for developing contingency solutions

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Third Party Relationships

  • Downstream liabilities
  • Who will be impacted if your business is interrupted?
  • Upstream impacts
  • What happens if a partner’s business is interrupted?
  • Enforce SLAs

Identify Preventive Controls

  • Some outage impacts identified in BIA may be mitigated or eliminated through preventive measures that deter, detect, and/or reduce impacts to the system
  • Where feasible and cost-effective, preventive methods are preferable to recovery methods. For example:
  • Appropriately sized uninterruptible power supplies (UPS)
  • Gasoline- or diesel-powered generators to provide long-term backup power;
  • Air-conditioning systems with adequate excess capacity to prevent failure of certain components, such as a compressor;
  • Fire detection and suppression systems;
  • Heat-resistant and waterproof containers for backup media and vital non electronic records;
  • Offsite storage of backup media, non electronic records, and system documentation
  • Frequent scheduled backups including where the backups are stored (onsite or offsite) and how often they are recirculated and moved to storage.

C. Development Phase

  • Develop and design recovery strategies
  • IT recovery
  • Business process recovery
  • Facilities recovery

BCP/DRP Development

BCP/DRP Development

Activation and Notification Phase

  • Defines initial actions taken once a system disruption or outage has been detected or appears to be imminent
  • Activation Criteria and Procedure - BC or DR plan should be activated if one or more of the activation criteria are met. Criteria may be based on:
  • Extent of any damage to the system
  • Criticality of the system to the organization’s mission
  • Expected duration of the outage lasting longer than the RTO
  • Notification Procedures - Describe the methods used to notify recovery personnel during business and non business hours. Notification methods can be:
  • Manual
  • Automatic
  • Outage Assessment - Assess the nature and extent of the disruption
  • Assessment should be completed as quickly as the given conditions permit
  • Outage Assessment Team is the first team notified of the disruption.

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Recovery Phase

  • Focuses on implementing recovery strategies to restore system capabilities, repair damage, and resume operational capabilities at the original or new alternate location
  • Sequence of Recovery Activities
  • Should reflect the system’s MTD to avoid significant impacts to related systems
  • Recovery Procedures
  • Should provide detailed procedures to restore the information system or components to a known state
  • Recovery procedures should be written in a straightforward, step-by-step style
  • Recovery Escalation and Notification
  • Effective escalation and notification procedures should define and describe the events, thresholds, or other types of triggers that are necessary for additional action
  • At the completion of the Recovery Phase, the information system will be functional and capable of performing the functions identified in the plan

Reconstitution Phase

  • Defines the actions taken to test and validate system capability and functionality
  • Concurrent Processing - running two systems concurrently until a level of assurance that recovered system is operating properly
  • Validation Data Testing - testing and validating recovered data to ensure complete and current recovery
  • Validation Functionality Testing - verifying that all system functionality has been tested, and that normal operations can resume
  • Deactivation of plans to return to normal operations are:
  • Notifications – notifying users using predefined procedures that normal operations have resumed
  • Cleanup - cleaning up work space or dismantling any temporary recovery locations, restocking supplies, returning manuals or other documentation to their original locations, and readying the system for another contingency event
  • Offsite Data Storage - If offsite data storage is used, retrieved backup should be returned to its offsite data storage location
  • Data Backup - system should be fully backed up and a new copy of the current operational system stored for future recovery efforts
  • Event Documentation - All recovery and reconstitution events should be well documented for an after-action report with lessons learned

Source: NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems

Backup and Recovery

  • Backup and recovery methods and strategies are a means to restore system operations quickly and effectively following a service disruption
  • These should be integrated into the system architecture during the Development/Acquisition phase of the SDLC
  • Considerations for developing or comparing backup and recovery methods:
  • Cost
  • Maximum downtimes
  • Security
  • Recovery priorities
  • Integration with organization-level BCM plans

Recovery Time

Recovery Tier

Recovery Timeframe

Recovery Requirement

I

0-24 hours

Resources must be available in advance and implemented first

II

1-3 days

Resources must be available in advance

III

3-5 days

Resources must be identified and quickly available

IV

Other

Resources must be identified

Method to Prioritize Business Processes or IT Infrastructure Components

High Availability (HA) Processes

  • HA is a process where redundancy and failover processes are built into a system to maximize its uptime and availability
  • Goal of HA is to achieve an uptime of 999% or higher
  • HA can be expensive, and is not a viable option for many systems and should be considered only for systems that cannot tolerate downtime
  • HA systems cannot be a replacement for a solid backup strategy
  • HA processes need to be extended to an alternate location
  • Mechanisms such as block mirroring to an alternate site should be considered to provide redundancy and backup of system data outside of the system facility.

IT Recovery Strategies

  • Multiple Processing Sites
  • Mirrored Sites
  • Fully redundant with identical data and equipment as well as communication capabilities
  • Highest level of availability at highest cost
  • It ensures virtually 100% availability
  • Configuration management is a challenge
  • RTO of minutes to hours

IT Recovery Strategies

  • Mobile site/trailer
  • Self contained unit with IT and communications
  • RTO of 3-5 days
  • Hot site
  • Fully equipped data center and communications
  • RTO of few minutes to hours
  • Warm site
  • Has some level of IT capabilities, but will have to be further equipped to take over IT operations
  • RTO of 5+ days
  • Cold site
  • A location capable of supporting IT operations, but with no equipment RTO of 1-2 weeks at the minimum

Alternate IT Recovery Strategies

  • Virtual business partners
  • Similar to multiple sites, but alternate sites are hosted by business partners
  • Reciprocal or mutual aid agreements with an internal or external entity
  • Dedicated site owned or operated by the organization
  • Commercially leased facility

Backup Approaches

  • Electronic vaulting
  • Sending data directly to an alternate facility
  • Can be stored on disk or tape depending on RTO requirements
  • Remote journaling
  • Replicated data transactions in real-time or near real-time at a secondary processing site
  • Offsite storage
  • Storage Area Network
  • Database shadowing and mirroring

Backup Methods

  • Data integrity involves keeping data safe and accurate on the system’s primary storage devices
  • There are three common methods for performing system backups:
  • Full Backup - captures all files on the disk or within the folder selected for backup
  • Locating a particular file or group of files is simple
  • Time required to perform a full backup can be lengthy; might also lead to excessive, unnecessary media storage requirements
  • Differential Backup - stores files that were created or modified since the last full backup
  • Takes less time to complete than a full backup
  • Incremental Backup - captures files that were created or changed since the last backup
  • Afford more efficient use of storage media; backup times are reduced
  • Media from different backup operations may be required to recover a system from an incremental backup

Backup Locations

  • On-site
  • Near-site
  • Off-site

Communications

  • Emergency communication systems
  • Remote access may serve as an important contingency capability by providing access to organization-wide data for recovery teams or users from another location
  • Wireless (or WiFi) local area networks can serve as an effective contingency solution to restore network services following a wired LAN disruption
  • Business communications systems
  • Networks
  • Some of the ways to ensure communication availability are:
  • Redundant communications links
  • Redundant network service providers
  • Redundant network-connecting devices
  • Redundancy from NSP or Internet Service Provider (ISP)
  • Monitoring software can be installed to provide warning and troubleshoot network issues before users and other nodes notice problems.

D. Implementation

  • Initial walkthroughs of design
  • Implement design
  • Test
  • Monitor

Testing, Training and Exercises (TT&E)

  • Training - personnel are trained to fulfill their roles and responsibilities within the plan
  • Exercises – plans simulated to validate their content
  • Testing - systems and system components tested to ensure their operability in a disrupted environment

Testing

  • Design short and long term continuity and crisis management testing plans
  • Update plans as necessary and document
  • Test types
  • Checklist
  • Walkthrough (table top review)
  • Simulation
  • Parallel
  • Full-interruption

BCP Program Awareness and Training

  • Recovery strategy and procedures must be documented and made available
  • Recovery personnel should be familiar with their roles and necessary teaching skills to be prepared for tests, exercises and actual outage events
  • Training should be provided at least annually, and to the extent that respective recovery roles are executed without aid of documentation
  • Leadership training – crisis management
  • Tech teams training – procedures and logistics
  • Part of onboarding training
  • Recovery personnel should be trained on the following plan elements:
  • Purpose of the plan
  • Cross-team coordination and communication
  • Reporting procedures
  • Security requirements
  • Team-specific processes
  • Individual responsibilities

BCP Program Exercises

  • An exercise is a simulation of an emergency designed to validate the viability of one or more aspects of the Business Continuity or Disaster Recovery plans
  • Exercises are scenario-driven
  • Types of exercises are:
  • Tabletop Exercises - Discussion-based exercises roles during an emergency and responses to a particular emergency situation are discussed
  • Functional Exercises - Personnel validate their operational readiness for emergencies by performing their duties in a simulated operational environment

Developing BCP/DRP culture

  • Personnel across the organization must be confident and competent with the BCP/DRP program
  • BCP must be aligned with organizational business objectives
  • Organizations must establish a BCM culture and integrate it into daily business operations with the support of the CRO and senior management.
  • Three techniques are involved in developing and establishing BCM culture within an organization:
  • Design and deliver an awareness campaign to create and promote BCM awareness and develop skills, knowledge, and commitment required to ensure a successful BCM practice.
  • Ensure the awareness campaign has achieved its goals and monitor BCM awareness for a longer term.
  • Perform an assessment on the current BCM awareness level against the management-targeted level.

Emergency Operations Center

  • A physical location to coordinate emergency response efforts
  • Virtual EOC
  • Helps in the case of a pandemic or globally dispersed key employees

E. Management of BCP/DRP

  • Program oversight
  • Continuity planning manager
  • Updating and maintenance on the plan - Changes in specific areas may require attention, for example, employee turnover, changes to organizational structure, changes to business processes, etc.
  • Regular practice of the plan
  • Validate plans by performing simulations of different scenarios by everyone involved
  • Frequency of exercises depends on the rate of changes made within the organization
  • Review the result of earlier exercises to ensure identified weaknesses have been addressed
  • Review BCP - An audit by internal or external auditors can highlight all key material weaknesses and issues

Resources