IT Complexity and the Risk of IT Failure

By Jon Toigo

For the past few years, the unabashed enthusiasm around clouds, virtualization and software-defined everything has led to some unfortunate side effects.  One is the suspension of disbelief around the vulnerability of all of this “new” architecture to unplanned interruption events.  Even today, it is not uncommon to read a trade press article or listen to a conference speaker claiming that disaster recovery is obsolete – replaced by the high availability architecture of the cloud era.  Nice if it were true, but it really isn’t.

A year or two back, when we were calling cloud-based computing infrastructure “Infrastructure as a Service” or IaaS, the National Institute of Standards and Technology (NIST) did a fair job of taking the hype down a few pegs.  They noted that it would be wonderful when servers were all virtualized and when networks and storage were all software-defined, and when administration and orchestration tools had been developed to the point where they could effectively automate many of the functions for provisioning hardware and software resources to workload.

eG080316Pic01

FIGURE 1:  NIST model for virtualized/software-defined resource pools.  SOURCE:  The Data Management Institute LLC.  All Rights Reserved.

They were also careful to describe the reality of the situation. Cloud actually did nothing to change the requirements for operating an enterprise class data center.  Doing IaaS, or Platform as a Service (PaaS) or Storage as a Service (SaaS) required a number of layers of operations and management tasks – most of which were familiar to data center operations staff and management.

eG080316Pic02

FIGURE 2:  Management and Operations tasks also need to be performed in the cloud data center.  SOURCE:  The Data Management Institute LLC.  All Rights Reserved.

And, of course, whether the data center was of the “private cloud” variety – servicing the needs of internal or corporate users solely – or a “public cloud” – providing services to numerous subscribers on a multi-tenant basis – there were still a set of overarching business management functions that would need to be attended to.  NIST called them a “service delivery layer.”

eG080316Pic03

 

FIGURE 3:  The service delivery layer is also required for cloud-based infrastructure delivery.  SOURCE:  The Data Management Institute LLC.  All Rights Reserved.

The point that NIST researchers were making was that not all that much had changed from a procedural or process perspective between legacy and cloud-based data centers.  Other commentators added that the most significant change was that new technology – mainly software stacks – were being introduced at a breakneck pace.  Whenever this happens in IT history, the propensity for unplanned interruption events (aka disasters) generally becomes more likely than less.

Part of the explanation, of course, is the “learning curve” confronting those who must deploy and administer the new technology.  Making new technology do what it says it does in the brochure is often a challenge.  Absent a pedigree or experiential record on which to base “best practices,” and absent any sort of agreed-upon definitions or standards, the IT administrator must work their way through technology by old fashioned trial and error.

Today, after nearly 11 years of trying, server virtualization advocates have realized their vision of placing up to 75% of workload in virtual machines running on hypervisor software using about 29% of data center hosting hardware.  Non-virtualized workload, mainly transaction-oriented systems, continue to run on the other 71% of hardware, creating at least two distinctly separate targets for data protection and disaster recovery planning.

eG080316Pic04

FIGURE 4:  Distribution of workload and computing platforms c. 2016 (a composite of several leading industry analysts).  SOURCE:  The Data Management Institute LLC.  All Rights Reserved.

 

The “gotcha” in this picture is the 75% of virtualized workload.  The latest surveys conducted by media firms suggests that companies are abandoning the single sourcing of hypervisor software technology.  They are diversifying their hypervisor software and using different hypervisors for different workload.  The result is the creation of yet more “siloes” of hardware and software technology, especially when each hypervisor vendor has a preferred and proprietary stack of software-defined networking and software-defined storage beneath its hypervisor software “head.”

So, rather than reducing the complexity of infrastructure, new technologies for cloud-ification and software-defined are actually adding to complexity – with potentially hazardous results.  A simple example is in the inability to share software-defined storage created with the technology of one hypervisor vendor with the data from the workload hosted by a different hypervisor technology.  We seem to have returned to many of the issues we had when Sun Microsystems and Microsoft could not share data (pre-SAMBA):  VMware will not allow data from Microsoft Hyper-V workload to be stored on VMware-controlled Virtual SAN, their flavor of software-defined storage.  By contrast, Microsoft says that their “clustered storage spaces” can store data from a VMware host, provided it (the VMware VMDK workload file) is first transformed via a software utility into a native Hyper-V workload file (a VHD).

eG080316Pic05

FIGURE 5:  Storage sharing challenges in different hypervisor-controlled infrastructure.  No sharing VMware VSAN space with Microsoft Hyper-V workload.  Hyper-V storage – clustered storage spaces – can be shared, but only if the VMware workload is first converted to Hyper-V format.  SOURCE:  The Data Management Institute LLC.  All Rights Reserved.

Clearly, the technology silo-ing that is accompanying the march toward clouds is also adding complexity.  Within a single vendor’s software-defined storage technology, there tends to be a requirement for multiple nodes of storage (for high availability data replication) and also a requirement for “identicality” between nodes and between clustered server hosting environments.  Identicality is the hob goblin of efficient recovery planning and a huge cost accelerator for business continuity.

For example, a hypervisor provides a software-defined storage (SDS) definition and software stack that is proprietary to the hypervisor vendor.  Not only can data not be replicated from the storage on this kit to the storage on a rival hypervisor vendor’s kit, it may also be difficult to replicate data between the storage nodes controlled by the same hypervisor…if all components of each node are not identical.  This was traditionally a problem with monolithic legacy storage arrays, in which synchronous mirroring or asynchronous replication software functionality delivered on the array controller required an identical controller, running identical software, on identical storage media, in order to operate.  We are seeing the same thing with the SDS storage nodes defined by VMware, Microsoft, et al.  Only DataCore Software, and perhaps IBM with its SAN Volume Controller (if this is ever part of their SDS product offering) are currently hardware agnostic.

Another by-product of proprietary SDS, by the way, is dramatically accelerating rates of annualized storage capacity demand and cost.  In 2011, IDC said there were roughly 21.2 exabytes of external storage deployed worldwide and that capacity demand would grow by about 42% per year through 2016.  In 2014, the analyst said that the rate of capacity demand in highly virtualized environments was closer to 300% per year – owing in part to the preference among hypervisor vendors for an SDS topology requiring three identical storage nodes (minimum), each with identical components for local data replication and protection.  Gartner revised this estimate to 650% annual capacity demand growth, acknowledging that companies would field two or more identical platforms of three storage nodes each to achieve high availability.

Lack of technology pedigree and best practices, plus technology silo-ing, and identicality requirements are joined by one more factor that is increasing the complexity of IaaS environments – and by extension the risk of downtime.  The fourth problem is network dependency.

Truth be told, most public cloud services are well within the 80 kilometer radius believed by experienced disaster recovery planners to be the absolute minimum safe distance that a recovery site or off-premise backup facility should be from the primary data center or original data storage facility.  The reason is simple.  To obtain a 1 Gb per second link speed that most experts regard as minimally acceptable for delivering a “good user experience” of cloud-hosted applications, companies typically utilize high speed metropolitan area network facilities to connect their data center or user work areas to cloud service providers.  Network facilities like MPLS are available in most NFL cites and can deliver the networking bandwidth required at a reasonable cost.  To get 1 Gb/s speeds on a WAN, you need much more expensive facilities, like OC-192.

Why MPLS or OC-192?  Simple.  To move 10 TB of data across one of these facilities will require about 2.25 hours.  Moving the same quantity of data across a T-1/DS-1 facility like those typically used for Internet connectivity would require over 400 days.

The problem is that, even with the availability of MPLS, the separation distance between the primary data center and the cloud is usually inadequate to protect against a disaster with a broad geographical footprint.  Hurricane Sandy pummeled Manhattan and forced many firms to abandon their data centers because of flooding and power outages.  The same storm caused water to encroach on a hot site facility (a commercial disaster recovery facility) in Philadelphia – 94 miles (or 151 kilometers) away.  Nervous high availability advocates observed that a storm with such a footprint was a once in 200 year event, and reassured customers that most disasters – those that cause the most annual downtime – were logical and localized, not large-scaled CNN-style events.  That might be true, but it came as no consolation to the Manhattan firms that received another massive weather disaster roughly nine months after the first 1-in-200-year event.

The point is that placing a redundant facility within 80 km of the primary data center does not necessarily provide much protection against a milieu level disaster event.  Yet, many companies go this route to save money and to keep distance-induced latency to a minimum.  The problem is that there is so much traffic today that network pipes are being saturated.  Even short haul networks are being swamped by serialization, queuing and buffering that are collectively referred to a “jitter” – even over short distance connections.  So even with a MAN, companies are finding that challenges about to failing over a hosting system with interdependent applications.  Invariably, the mirrored systems at the remote site or public cloud are lagging behind the primaries – by several transactions in the case of on-line transaction processing databases.  IBM and Duke University are doing some very public testing of this situation in an effort to surmount it.

Bottom line:  the issues raised in this article are only a high level summary of four of the leading issues that are adding complexity and risk in the cloud computing era.  Planners need to be aware of them and to develop workarounds as best they can.  Remember that virtualization, software-defined and cloud technologies are all in their comparative infancy, operating in a standards-free environment with vendors appearing on (and disappearing from) the horizon on a monthly basis.  Companies are taking a risk in adopting many of these technologies before they have been thoroughly vetted in the marketplace, so they need to maintain a clear-headed perspective of their exposures and take more (not less) steps to prevent avoidable disasters and to prepare strategies for coping with the vulnerabilities they cannot eliminate.

 

About the AuthorJon Toigo is CEO of Toigo Partners International and a consultant who has aided over 100 companies in the development of their continuity programs. Feel free to contact him at jtoigo@toigopartners.com

.