![]() |
|
Operational Resiliency By Bob Burns Overview Introduction The criteria for operational resiliency provide the foundation for building and maintaining an organization’s business continuity program. Traditionally business continuity planning included the recovery of business functions and their supporting IT infrastructure. The demand for 7 * 24 high availability services requires a change in the way we define business continuity. The new paradigm uses an operational resiliency model to define the business continuity framework supporting day-to-day operations. The traditional disaster recovery plans become a subset of the new business continuity process. This new process also leverages the internet by offering collaborative emergency management portals to share and communicate information. Most operational disruptions can be prevented, and the rest mitigated to reduce their impact. The key is how an organization prepares itself and their partners to prevent unknown interruptions. The solution appears complex because too many try to solve the problem at the wrong end of the process. Much like Total Quality Management and the Six Sigma Quality Process, operational resiliency has to be designed into every critical step and threaded throughout the organization at each critical hand-off point. The final step is how that information is communicated and shared to mitigate the overall situation. Management can play an important role at the outset of every new project by requiring that an operational resiliency strategy is defined and that funding is sufficient to ensure that the procedures, infrastructure, and planning are in place to meet the recovery time objectives for each critical process. Collaboration between organizations is also essential to define the interdependencies between people, processes, and technology. These relationships can be mapped by “following the data” through each process needed to deliver internal and external customer products and services. Not all organizations have the resources or knowledge to assess risks, mitigate their impact, and develop business continuity plans for operational resiliency and disaster recovery. The scope of managing this initiative is substantial, especially as there are many interdependent processes that can affect the success of each organization. The Operational Resiliency Process Resiliency Management – This is the most important aspect of the Operational Resiliency Process. It provides the glue which allows organizations to meet their operational resiliency objectives while maintaining continuous uninterrupted operations. If not managed, each organization will develop plans that do not support the universal needs of the organization. These plans are often in different formats, and do not interface with one another. The success of operational resiliency starts at the Board of Directors, senior management, and public officials. Only they can ensure that the proper funding, direction, standards, and resources are available to implement a realistic continuity of operations program. Once the objectives are established, specialized expertise is needed to lay the framework for the entire company. This expertise is rare within most organizations, and often requires unbiased third party expertise. Caution should be exercised when using third parties who are selling hardware or recovery services. Recommendations will include their “must have” products or services and ultimately will cost more than if business continuity and disaster recovery was part of the overall operational resiliency plan supporting day to day operations. The management of resiliency information is a formidable task even when all other aspects of the program are working effectively. This requires expert continuity management software to assist with mitigating, and planning an overall integrated solution that can be securely accessed by an unlimited number of users, and protected against unforeseen interruptions. The software should be the repository for or have direct access to all critical information needed for rapid response, notification and recovery of all key processes for each organization. The protection of the continuity management software and its data should be given the highest priority and operated from an off-site facility with replication to guarantee 100% availability. Operational Resiliency Assessment - One approach to understanding an organization’s operational resiliency is to perform a threats and vulnerabilities assessment followed by an operational impact analysis. The analysis should determine if the business recovery time requirements for each critical application can be achieved by the technology infrastructure and operational procedures supporting each business function. The analysis should compare industry best practices from ITIL, CORIT, NSA, NIMS and others against each operational, business and IT function. The result should provide a GAP analysis for critical risk areas including on and off-site facilities, business processes, security, networking, communications, applications, data center operations, storage management, disaster preparedness, process and documentation management, and incident response. The analysis should provide strategic and tactical improvement recommendations for all risk areas that jeopardize operational resiliency. Resiliency Strategies – Each business function should have resiliency strategies that define each critical process, its criteria for operational continuity, and the dependent staff and IT resources needed to maintain their required continuity level. This also includes the resiliency preparedness of internal and external organizations which provide resources and data needed to deliver customer services. Resiliency Strategies should typically define why, who, what, when, and where so previously identified risks and events can be mitigated. This translates to: why there is an interruption; who is responsible for responding; what is it they have to do; when do they have to respond; and where do they have to respond to. This may appear overly simplified, and it is, but it is a great place to start because most organizations have not defined their critical processes from an operational resiliency perspective. Business Continuity Planning – The Operational Resiliency Assessment and the Resiliency Strategies provide the basis for the Business Continuity Plan. Each business organization should define their critical resources, including critical applications and their recovery time objectives, as well as the need for data protection supporting each critical process. These criteria establish the priorities for IT to ensure that the supporting infrastructures are designed and maintained to comply with each business requirement. The Business Continuity Planning performed by IT, commonly referred to as Disaster Recovery Planning, is often performed separately and not integrated with business resiliency planning initiatives. The separation of the business and IT initiatives for business continuity and disaster recovery planning is frequently the source of unsuccessful recovery initiatives. The lack of joint planning and collaboration often result in separate agendas that don’t coincide until annual testing uncovers the need for change. Operational resiliency fixes are then glued in at the back-end of the process with limited success. An effective business or operational continuity plan includes all aspects of operational resiliency for each critical process for every operational condition including catastrophic interruptions. The planning should include business and IT operations and cover a broad scope of topics including; emergency management, disaster assessment, recovery, testing, and business resumption. Each functional area needs to define their resources, recovery teams and tasks supporting each phase of the recovery process. Operational process documentation needs to be available and priorities established to minimize the impact of any unplanned event. Validation – Once the business continuity plan is formalized it is then used for training and testing, and hopefully not needed to support an actual event. During the training process, all members of the organization should be trained on its use and have an opportunity to review content from an operational resiliency perspective. After the feedback is received, testing scenarios need to be developed to ensure their recovery time objectives can be achieved. During the Validation Phase every recovery process should be tested, and exceptions noted. There are many variations during the Validation Phase, but it is important that each phase is successfully completed, each business function is tested, and the data is recovered without loss of critical information. If any portion of the testing fails, it needs to be rerun until success is achieved. About the Author |