|
Avoiding an IT Crisis As data and applications have become critical to an organization's ability to operate, companies have invested heavily to protect these assets. One technology designed to protect both data and applications is replication. However, despite millions of dollars invested in them, replication solutions often fail in practice, when they are needed most. This paper analyzes the key points of failure and suggests how to avoid them. Simply stated, replication involves creating a second copy of data as a means of protecting that data. Replication can be leveraged to ensure high availability of mission-critical business applications. This paper focuses on replication applied to the problem of delivering high availability. This scenario is one of the most challenging and complex uses of replication, but when properly implemented, yields high levels of continuity for mission-critical applications. When used for high availability, replication is performed between two different locations, rather than within the same location. Using multiple locations reduces the risk associated with facility and regional disasters. As the problem focus is delivering high availability, replication is performed continuously rather than in batch mode. Continuous replication maintains the most consistent copy of data without the gaps that occur with a batch approach. Replication is sometimes confused with clustering, which involves using two or more storage devices at the same location to simultaneously receive the same data stream. This provides two identical copies of the data and mitigates the risk of a disk failure. It does not address other issues associated with high availability. This paper does not discuss clustering or compare clustering to replication. Replication Solution Overview
There are many ways to deploy a replication solution to safeguard against a company's most pressing protection risks. Solutions can be deployed either locally or remotely, synchronous or asynchronous. The multitude of options allows a company to deploy an optimal solution based on its need, but creates complexity both in deployment and manageability. This complexity is the root of many challenges associated with replication and often a catalyst of failure. Data Protection Versus High Availability
Data protection is primarily focused on data integrity, ensuring that the data and all changes to that data are captured in a backup copy. Immediate access and continued use of the data are not top priorities. For example, it is better to ensure 100-percent data accuracy with medical records than it is to ensure 100-percent real-time access. Inaccurate prescription data may result in loss of life, while a delay in refilling a prescription will only result in a patient becoming annoyed. High availability, on the other hand, is concerned with three attributes: data integrity, immediate access, and continued application use. A financial clearinghouse needs to ensure its data is accurate, as trade commitments are legally binding. Furthermore, trades need to execute in a timely fashion or else the clearinghouse incurs financial risks. Satisfying the additional need for data access and continued application use requires continuity-driven replication solutions to provide failover and failback capabilities. A failover is an action initiated by IT personnel to move a secondary application environment from a backup role to a primary role. Usually a failover occurs when a problem develops that prevents the primary environment from properly functioning. The secondary environment starts providing end users with the services originally provided by the primary environment. The data that was replicated to the secondary environment is now put to use. Failback is similar to failover, but returns control back to the primary environment once it is restored to a proper working state. These capabilities turn out to be far more complex and challenging than basic data replication. Further complicating the technical challenges associated with failover and failback is that high availability solutions must hedge against regional disasters and outages. This requires the deployment of the high availability replication solution in a geographically remote backup environment, ideally more than 20 miles away. Distance requires the complexity of bandwidth latency and cost constraints, factors that are negligible over a LAN. Using Replication to Provide High
Availability Data Replication - the set of processes that collect and move data from one source to a second source. Data replication includes a process to watch a data repository for new and changed data, which it then duplicates and places in a replication queue. The replication queue stores and manages the pending data transactions, which are then replicated to the secondary data repository. A process monitors and verifies that the transaction arrived safely at the destination. The backup data provides a source safe from many issues that can affect the primary data source. This component provides sufficient coverage to protect data sources when the data is not supplied to applications that require near continuous uptime. Failover - the set of processes that switch a secondary data source from its backup role to a primary role. Often, this is tightly integrated with other processes that redirect users to an alternative set of application servers. Depending on the application, further security processes and control logic are required. Assuming that all systems are synchronized, all software versions and revision levels are consistent, the systems have been tested, and personnel are standing by at both data centers, the failover can proceed smoothly. During a smooth failover process, users have near-continuous access to their critical applications with minimal interruption. Failback - involves returning operational control back to the original system. In many instances, a company's primary environment is more robust or better situated to deliver superior end user performance, benefiting from greater bandwidth or a concentration of local users. For these performance or security reasons, a systems administrator will seek to failback to the primary system as soon as possible once the primary system has been restored to a fully functional state. Failback is similar in many aspects to failover, but adds additional steps and complexity to account for resynchronizing the two systems. This involves recovering the primary system to a working state and reestablishing data integrity, as the two data sources have become out of sync since the initial failover. Replication Shortcomings
First, installation alone can be a deterrent. Companies frequently have so many issues associated with installation or management of a complex solution that they abort their implementation and "shelve" the replication software. Once installed, many companies find replication manageable. The bigger challenge, however, is encountered in the event of a large-scale systems failure or disaster when attempts are made to activate replication solutions to maintain application availability and recover. Failover processes are difficult to test and don't always work as expected. Accounting for all permutations of failover requires extensive process mapping across the entire IT environment that supports the application. Finding the expertise required to perform this mapping is difficult and rarely exists within one group in a company, or even throughout the entire company. Failure to adequately address all possible failover scenarios creates risk and possibly a false sense of security. Top 5 Causes of Replication Failure
1. Secondary Environment Not Ready for Failover Another factor impacting failover readiness in the secondary environment is that critical processes do not function properly. Most organizations do not have clear visibility into the readiness of their systems for failover, and are surprised by one of the following problems:
It is critical for a company to take the time necessary to develop processes to control the introduction and distribution of changes and updates to both environments. One undetected change can cause hours- or days-long delays in service resumption. It is equally important to monitor all critical processes that impact the readiness of the secondary environment, as early problem detection ensures system readiness when a situation demands use of the secondary system. 2. Manual Error in Failover Process People are more prone to make mistakes during a crisis. The more complex and step-intensive the failover process, the more likely mistakes will occur and result in failure. As an example, it takes 350 steps to failover a 10-server Microsoft Exchange environment. Any single human error in that sequence will break the failover process. A few examples of mistakes include:
Often, these steps must be performed under pressure or over a remote connection. The steps required to remediate errors introduced during a failover are complex, often uncharted, and should be avoided at all cost. 3. Experts Not Available During Crisis
As part of the process to develop and deploy a failover solution, it is important to establish a list of required skills and resources available for each skill. Full contact information and predetermined communications protocols need to be created, continuously updated, and readily available to all team members. These steps will aid recovery efforts and mitigate some personnel risks. 4. Failover Process Unable to Scale
5. Untested Failover Assumptions Don't Work
Some methods available to mitigate risk include incorporating "what-if" scenario planning sessions and "pre-mortems," a form of role playing that allows a technical staff to identify untested failover scenarios and potential bottlenecks. The Impact of Replication Failure
Replication failures have long recovery times. Complex recovery processes take time to implement. Restoring an entire environment requires significant skills. The availability of critical resources and personnel becomes a bottleneck. Often, there are periodic delays during the failover process as key skills are not available. If large amounts of data must be moved as part of a recovery effort, bandwidth constraints can further prolong the recovery time. Replication failures can cause other problems. Replication failures can have wide-spread impact. Database corruption can occur, requiring substantial efforts and time to restore a database to a functional state. Even when the databases are not corrupted, effort is required to determine how much data was lost during the failure and to recover the missing transactions. Replication failures have a high cost. Replication is almost exclusively used for critical systems. Data in these systems is important enough to protect and therefore to recover. The time and effort required to recover the data, returning it to a usable state, is substantial. Replication failure forces application downtime, resulting in negative economic consequences: lost revenue, missed opportunities, degraded customer satisfaction, and declines in shareholder value. Any replication failure will prove costly. To Succeed, Eliminate the Risks
About the Author |