For 2014 and the foreseeable future, trends in business continue to favor increased dependency on information technology. Truth be told, while some of the economic circumstances that accompanied the Great Recession have begun to recede, fewer staff continue to shoulder the workload once managed by a greater number of personnel – with automated systems providing the means to maintain acceptable productivity levels.
This situation translates into vulnerability. Even a comparatively brief interruption in IT services can have a calamitous impact on the business, underscoring the need for well-defined and tested continuity plans. However, at least one 2013 survey* of the state of business continuity in US and UK firms suggested that companies aren’t doing very much either to prevent avoidable disasters, or to prepare for events they can’t avoid.
Holding back planning, according to the survey, are a number of challenges ranging from lack of budget, executive support or time to the perceived inefficacy of planning in the face of fast changing business processes, staff, and technology infrastructure. The only “trend” defined by the survey is that continuity planning itself is undergoing change – converging or merging with other resilience-related functions such as data governance or risk management in some companies, while settling in as a business unit-level responsibility rather than a centralized enterprise planning task in other firms.
This seems to dovetail with the stories we read in information technology trade press publications and in analyst literature. 2013 has seen a shift in terminology, though not necessarily any meaningful improvements in the accompanying technologies, describing
future data centers. A few years ago, virtual servers and server hypervisors were all the rage. Then came a change in the rhetoric describing roughly the same technologies, but using the terms “public” and “private clouds” – arguably another airy bit of marketecture, substituting for practical architecture, describing various “infrastructure as a service” cobbles that, themselves, rely heavily on server virtualization. Today, fashionable tech folk speak in terms of “software-defined” networks, storage, data centers, etc.
Getting past the nonsensical nomenclature, all of these terms refer to the same idea: abstracting software functionality and management away from commodity hardware, then recentralizing functionality and management into a proprietary software layer, usually part of a stack of software advanced by a particular hypervisor vendor. Virtually all of these abstraction strategies promise significant reductions in physical hardware requirements, better utilization efficiency of the commodity hardware that is retained, improvements in availability through a combination of integral clustering-with-failover (aka high availability) capabilities, and OPEX cost savings in the form of fewer software licenses and/or IT staff.
It all sounds pretty good: low cost IT services paid for on demand and obtained from third party providers so firms don’t need to build, staff and operate their own data centers. And it would be, if the claims were true. The problem is that the hype often doesn’t stand up to reality.
There are three realities that cannot be ignored as we look at contemporary technology trends and their ramifications for business continuity.
The first of these has to do with the core underlying technology for all of these concepts: server virtualization. While touted as the big fix for what ails infrastructure, virtualization doesn’t really “fix” anything, especially not at the hardware level, and may in fact hide burgeoning fault conditions from view until they result in an actual failure state.
While abstracting software-based services away from hardware might facilitate the more efficient allocation of those services to workload, improved service management does nothing to resolve the central problem that we all confront: a dearth of physical resource monitoring and management. To prevent disasters, we need management tools and knowledgeable personnel to monitor, groom and maintain the underlying physical plumbing of the infrastructure. Service management alone will not do the job.
Secondly, the reality of server virtualization schemes advanced today is that they do violence to storage infrastructure memes that have been in place for the last couple of decades. Companies have been eliminating isolated islands of infrastructure in part by consolidating all storage into Fibre Channel fabrics, sometimes called SANs. Now, virtual server purveyors insist that SANs are too inflexible and encourage a reverse course toward direct attachment of storage to each virtual server platform and the use of data mirroring and “replication” – mirroring over distance – to ensure that every server has a copy of the data required by any guest machine that might make a temporary home on that box. This movement of workload from server to server – so-called template cut and paste – is where the server hypervisor folk get their argument that they deliver high availability to IT application hosting. However, data mirroring and replication, almost always disk to disk, is the Achilles Heel of the arrangement (and also a huge cost accelerator by dramatically increasing storage capacity demand rates).
Conventional array to array mirroring confronts many challenges well known to disaster recovery planners. For one, the complexity of checking and verifying that a mirror or replica is working is such that checks are rarely performed. You need to quiesce the application whose data is being mirrored, flush the caches that hold data that has not yet been written to primary disk targets, replicate the primary disk target to the secondary or copy disk, then shut down the mirror/replication process long enough to check the number and state of files on the primary and replication targets. Once you have verified that the right data is being replicated and that the deltas (differences) between primary and replica files are within acceptable parameters, you need to cross your fingers and hope that everything re-starts successfully. When it doesn’t, you might experience a career limiting day.
Bottom line, mirrors are rarely checked. Moreover, most mirrors entail a lock-in to a particular vendor’s hardware kit. Hardware vendors see no advantage in enabling copying between their array and one of their competitor’s rigs: their value-add software for replication and mirroring usually commits the user only to the vendor’s hardware. That drives up the cost of data protection significantly.
The third reality is that high availability clustering of virtual (or physical) servers requires several ingredients to work well. First, you need rock solid logic for failing over or transitioning workload from box A to box B. Under what circumstances will failover occur, which host is the failover target of choice and what happens if that host is unavailable or already maxed out in terms of resources, how will fail back be accomplished if the situation is resolved…the question list can be quite lengthy. Get any of the answers wrong and failovers may not be successful or they may become disruptive, occurring even when interruption conditions are absent – like a security alarm system on your home that sets itself off for no apparent reason.
Another reality of such an HA cluster strategy is that networks between clustered servers must work very well, both during a pre-disaster period when they are being used to replicate data and to transport each server’s “heartbeat” information, and when a disaster occurs and workload must be re-instantiated on the replica system. Complaints with current hypervisors include that their local or “subnetwork” failover capability only works about 40% of the time. WAN-based failover (so-called “geo-clustering”) is both extremely expensive (requiring the aforementioned duplicate gear and an on-going WAN connection of sufficient bandwidth and throughput to support data replication) and usually requires the cobbling together of third party remote replication software (think CA Replicator, NeverFail Group Neverfail, DoubleTake Software’s Doubletake, etc.) with server hypervisor wares.
High availability architecture doesn’t trump disaster recovery, despite the brochure-ware from the hypervisor vendors. HA has always been part of the spectrum of recovery strategies, not appropriate for all applications and data and usually more expensive than all other approaches.
These realities do not eliminate all of the advantages enumerated by the server virtualization crowd, but they must be taken into account when considering the claim that new technology tools are changing the rules of traditional business continuity. From what we are seeing in the field, the new technology tools aren’t changing much of anything from a continuity perspective, but they are making recovery in many cases a more daunting task.
About the Author
Jon Toigo is CEO of Toigo Partners International and a consultant who has aided over 100 companies in the development of their continuity programs. Feel free to contact him at firstname.lastname@example.org.