|
By Kelly Jones, PhD
As data and applications have become critical to an organization's
ability to operate, companies have invested heavily to protect these assets.
One technology designed to protect both data and applications is replication.
However, despite millions of dollars invested in them, replication solutions
often fail in practice, when they are needed most. This paper analyzes
the key points of failure and suggests how to avoid them.
Simply stated, replication involves creating a second copy
of data as a means of protecting that data. Replication can be leveraged
to ensure high availability of mission-critical business applications.
This paper focuses on replication applied to the problem of delivering
high availability. This scenario is one of the most challenging and complex
uses of replication, but when properly implemented, yields high levels
of continuity for mission-critical applications.
When used for high availability, replication is performed
between two different locations, rather than within the same location.
Using multiple locations reduces the risk associated with facility and
regional disasters. As the problem focus is delivering high availability,
replication is performed continuously rather than in batch mode. Continuous
replication maintains the most consistent copy of data without the gaps
that occur with a batch approach.
Replication is sometimes confused with clustering, which
involves using two or more storage devices at the same location to simultaneously
receive the same data stream. This provides two identical copies of the
data and mitigates the risk of a disk failure. It does not address other
issues associated with high availability. This paper does not discuss
clustering or compare clustering to replication.
Replication Solution Overview
A company has many options for how to protect data. including tape backups,
disk mirroring, electronic vaulting, and data replication. Data replication
is a powerful alternative to other techniques. When properly installed
and managed, data replication provides more than data protection, it enables
application availability.
There are many ways to deploy a replication solution to
safeguard against a company's most pressing protection risks. Solutions
can be deployed either locally or remotely, synchronous or asynchronous.
The multitude of options allows a company to deploy an optimal solution
based on its need, but creates complexity both in deployment and manageability.
This complexity is the root of many challenges associated with replication
and often a catalyst of failure.
Data Protection Versus High Availability
When evaluating replication options, it is important to understand the
specific reasons for the solution. Two primary reasons for implementing
a replication solution are data protection and high availability.
Data protection is primarily focused on data integrity,
ensuring that the data and all changes to that data are captured in a
backup copy. Immediate access and continued use of the data are not top
priorities. For example, it is better to ensure 100-percent data accuracy
with medical records than it is to ensure 100-percent real-time access.
Inaccurate prescription data may result in loss of life, while a delay
in refilling a prescription will only result in a patient becoming annoyed.
High availability, on the other hand, is concerned with
three attributes: data integrity, immediate access, and continued application
use. A financial clearinghouse needs to ensure its data is accurate, as
trade commitments are legally binding. Furthermore, trades need to execute
in a timely fashion or else the clearinghouse incurs financial risks.
Satisfying the additional need for data access and continued application
use requires continuity-driven replication solutions to provide failover
and failback capabilities.
A failover is an action initiated by IT personnel to move
a secondary application environment from a backup role to a primary role.
Usually a failover occurs when a problem develops that prevents the primary
environment from properly functioning. The secondary environment starts
providing end users with the services originally provided by the primary
environment. The data that was replicated to the secondary environment
is now put to use. Failback is similar to failover, but returns control
back to the primary environment once it is restored to a proper working
state. These capabilities turn out to be far more complex and challenging
than basic data replication.
Further complicating the technical challenges associated
with failover and failback is that high availability solutions must hedge
against regional disasters and outages. This requires the deployment of
the high availability replication solution in a geographically remote
backup environment, ideally more than 20 miles away. Distance requires
the complexity of bandwidth latency and cost constraints, factors that
are negligible over a LAN.
Using Replication to Provide High
Availability
Understanding replication and its prominent place in delivering high availability
for mission-critical applications requires a deeper look at the key elements
that enable high availability, particularly across a geographically diverse
or complex IT environment.
Data Replication - the set of processes that collect
and move data from one source to a second source. Data replication includes
a process to watch a data repository for new and changed data, which it
then duplicates and places in a replication queue. The replication queue
stores and manages the pending data transactions, which are then replicated
to the secondary data repository. A process monitors and verifies that
the transaction arrived safely at the destination. The backup data provides
a source safe from many issues that can affect the primary data source.
This component provides sufficient coverage to protect data sources when
the data is not supplied to applications that require near continuous
uptime.
Failover - the set of processes that switch a secondary
data source from its backup role to a primary role. Often, this is tightly
integrated with other processes that redirect users to an alternative
set of application servers. Depending on the application, further security
processes and control logic are required. Assuming that all systems are
synchronized, all software versions and revision levels are consistent,
the systems have been tested, and personnel are standing by at both data
centers, the failover can proceed smoothly. During a smooth failover process,
users have near-continuous access to their critical applications with
minimal interruption.
Failback - involves returning operational control
back to the original system. In many instances, a company's primary environment
is more robust or better situated to deliver superior end user performance,
benefiting from greater bandwidth or a concentration of local users. For
these performance or security reasons, a systems administrator will seek
to failback to the primary system as soon as possible once the primary
system has been restored to a fully functional state. Failback is similar
in many aspects to failover, but adds additional steps and complexity
to account for resynchronizing the two systems. This involves recovering
the primary system to a working state and reestablishing data integrity,
as the two data sources have become out of sync since the initial failover.
Replication Shortcomings
Just as replication solutions offer promises of risk mitigation, higher
availability, and data protection, the attempted use of these solutions
often results in frustration and wasted time and money for a number of
reasons.
First, installation alone can be a deterrent. Companies
frequently have so many issues associated with installation or management
of a complex solution that they abort their implementation and "shelve"
the replication software.
Once installed, many companies find replication manageable.
The bigger challenge, however, is encountered in the event of a large-scale
systems failure or disaster when attempts are made to activate replication
solutions to maintain application availability and recover. Failover processes
are difficult to test and don't always work as expected.
Accounting for all permutations of failover requires extensive
process mapping across the entire IT environment that supports the application.
Finding the expertise required to perform this mapping is difficult and
rarely exists within one group in a company, or even throughout the entire
company. Failure to adequately address all possible failover scenarios
creates risk and possibly a false sense of security.
Top 5 Causes of Replication Failure
An examination of the top five causes of replication failure provides
insight into the challenges a company faces when seeking to successfully
implement a replication solution, as well as a useful checklist for benchmarking
purposes.
1. Secondary Environment Not Ready for Failover
A company establishing a secondary environment to provide superior protection
of its most important applications and data must address many challenges.
One of the biggest challenges is maintaining two nearly identical environments.
All software, patches, and access levels need to be consistent. Over time,
many changes that are made to one system fail to get implemented on the
other system. These differences often result in a secondary environment
that is out of sync with the primary environment to the point where failover
cannot occur.
Another factor impacting failover readiness in the secondary
environment is that critical processes do not function properly. Most
organizations do not have clear visibility into the readiness of their
systems for failover, and are surprised by one of the following problems:
- Replication is not performing normally
- The replication queue is too large
- The secondary environment is not healthy
- Primary/secondary environment software and configurations are out
of sync
- Dependent systems are not designed for failover
It is critical for a company to take the time necessary
to develop processes to control the introduction and distribution of changes
and updates to both environments. One undetected change can cause hours-
or days-long delays in service resumption. It is equally important to
monitor all critical processes that impact the readiness of the secondary
environment, as early problem detection ensures system readiness when
a situation demands use of the secondary system.
2. Manual Error in Failover Process
People are often the weak link in a failover process. Manual errors introduced
during a failover sequence can corrupt the entire process. This results
in a need to fix the problems resulting from the error and further requires
the entire process to be restarted.
People are more prone to make mistakes during a crisis.
The more complex and step-intensive the failover process, the more likely
mistakes will occur and result in failure. As an example, it takes 350
steps to failover a 10-server Microsoft Exchange environment. Any single
human error in that sequence will break the failover process. A few examples
of mistakes include:
- Missed process steps
- Steps executed out of sequence
- Process initiated before dependent steps fully executed
- Misjudge state of primary or secondary
- Typing error
- Steps out-of-date for current software versions
Often, these steps must be performed under pressure or over
a remote connection. The steps required to remediate errors introduced
during a failover are complex, often uncharted, and should be avoided
at all cost.
3. Experts Not Available During Crisis
Failover processes are dependent on expert staff who may not be available
during a crisis. The failover process touches upon all technical disciplines,
from hardware and operating systems, to applications and databases, to
networking and security. In a large organization, these disciplines are
highly specialized with different personnel responsible for each. If any
one person is unavailable, the failover process can break down. There
are many reasons people are not available, including:
- Occupied with other crisis efforts
- Physically unable to access facilities
- Can't be contacted during an emergency
- No internet access available to control failover
As part of the process to develop and deploy a failover
solution, it is important to establish a list of required skills and resources
available for each skill. Full contact information and predetermined communications
protocols need to be created, continuously updated, and readily available
to all team members. These steps will aid recovery efforts and mitigate
some personnel risks.
4. Failover Process Unable to Scale
In large organizations, the scale of a failover or recovery effort can
become a critical bottleneck. The technical staff is limited, often managing
large numbers of systems. A critical application failure or a facility
problem can result in dozens or more systems that require failover. Multi-server
failover is a serial, manual process. The technical staff comes under
tremendous pressure to rapidly restore service. Under these conditions,
scaling issues are likely, such as:
- One administrator can only failover one server at a time
- A 25+-server environment will take 10+ hours for one administrator
to failover or failback
- Many mutually dependent systems will not work until the entire environment
has failed over
5. Untested Failover Assumptions Don't Work
Complex multi-server failover is often too sensitive to fully test. Without
testing every permutation of systems and failure causes, it's impossible
to know exactly what will happen during a real crisis. Issues that contribute
to failover breakdowns include:
- Large, complex environments can have many failure types and scenarios
- Multi-server failover involves constantly changing conditions
- Server-by-server failover is very different from a holistic failover
of the entire environment
- Different failures result in different behaviors
Some methods available to mitigate risk include incorporating
"what-if" scenario planning sessions and "pre-mortems," a form of role
playing that allows a technical staff to identify untested failover scenarios
and potential bottlenecks.
The Impact of Replication Failure
Understanding the impact of replication failure is essential to making
informed business decisions about replication investments. The following
provides a summary of the key issues associated with replication failure.
Replication failures have long recovery times. Complex
recovery processes take time to implement. Restoring an entire environment
requires significant skills. The availability of critical resources and
personnel becomes a bottleneck. Often, there are periodic delays during
the failover process as key skills are not available. If large amounts
of data must be moved as part of a recovery effort, bandwidth constraints
can further prolong the recovery time.
Replication failures can cause other problems. Replication
failures can have wide-spread impact. Database corruption can occur, requiring
substantial efforts and time to restore a database to a functional state.
Even when the databases are not corrupted, effort is required to determine
how much data was lost during the failure and to recover the missing transactions.
Replication failures have a high cost. Replication
is almost exclusively used for critical systems. Data in these systems
is important enough to protect and therefore to recover. The time and
effort required to recover the data, returning it to a usable state, is
substantial. Replication failure forces application downtime, resulting
in negative economic consequences: lost revenue, missed opportunities,
degraded customer satisfaction, and declines in shareholder value. Any
replication failure will prove costly.
To Succeed, Eliminate the Risks
Replication solutions represent a powerful way to protect a company's
most important data and applications. But replication solutions, particularly
those that are used to provide failover and failback for high availability
environments, are fraught with risks, often resulting in a low success
rate for companies attempting to implement these solutions. The key to
successfully implementing a replication solution is to understand all
the risks and eliminate as many as possible. Planners should start with
the risks identified above in the top 5 reasons for failure. Through careful
evaluation of approaches, they can select a solution that matches the
best approach for the company.
About the Author
Kelly Jones, PhD, is Vice President of Technology
at MessageOne, responsible for bringing new industry-leading replication
and failover technologies to the market. Dr. Jones joined MessageOne from
Evergreen Assurance, a provider of application availability and disaster
recovery software, where he served as Senior Vice President, Technology
Development and Client Operations. Prior to Evergreen Assurance, Jones
founded Panacya, a successful provider of systems management and monitoring
software. Jones earned his doctorate degree from Texas A & M University.
Dr. Jones can be reached at kelly_jones@messageone.com
or at (512) 652-4500.
|