If you want to send chills up the spine of cloud vendors and consumers alike, just say the word Nirvanix. Once declared a comer in the “new” world of infrastructure (storage, backup, archive...) as a service, Nirvanix garnered considerable venture capital ($70 million) and trade press. They were counted among the top three or four providers of on-demand storage capacity, alongside Google and Amazon Web Services offerings – a firm with all the right connections and a bright outlook going forward. Even Ibm was said to be eyeing the company for possible acquisition.
Many firms used the service provided by Nirvanix, from NBC Universal and Fox News, to a host of other large and small company clients, until several tens of petabytes of data found their home in the Nirvanix cloud. The storage didn’t amass in the Nirvanix infrastructure all at once. It was uploaded over time and over WAN connections ranging from Internet speed links to somewhat bigger pipes. Once received, the data was spread out over many thousands of disk drives arranged in proprietary configurations that represented to Nirvanix its secret sauce for deriving profit from economies of scale.
Even to skeptics, the Nirvanix model looked profitable and practicable. More than any technology, that accounts for why so much data made its way into the service and why little heed was paid to one niggling concern that should have been foremost for all customers: how would you get the data back from Nirvanix if any of the hundreds of easily imagined disaster scenarios came to pass?
The consensus of onlookers is that Nirvanix’ failure proved to be a financial one. IBM bought a different firm, causing investors to rethink the vendor’s future. When money dried up, so did the company’s ability to continue operations. In an undignified end, the proprietors had to tell their clients that they had about two weeks to retrieve their data – and no way whatsoever to accomplish that task!
For one thing, there is no tape. Like most hard-disk-for-everything ideologues, Nirvanix made no provisions for tape storage in their kit. Most of its rivals share this “tape is dead” view, by the way – a point that must not escape those who are considering parking their data, or even their archives or backups, in the “great disk drive in the sky.”
For another, there is simply no public WAN service capable of delivering the speeds and feeds required to move a lot of data over any distance in anything like an acceptable timeframe – most certainly not within a week or two. Some clients had tens of petabytes of data at Nirvanix – that’s thousands of terabytes. Moving just 10 terabytes over a T-1 link takes more than one year; doing so over OC-192 takes about four hours, assuming you can get access to an OC-192 and can manage the fees for the “if you have to ask the price you probably can’t afford it” WAN service.
Getting data into a cloud doesn’t take very much bandwidth for most firms, and that accounts for the appeal to companies large or small. But many fail to think the problems of access, gets, and puts through to their logical conclusion. Here are some WAN basics that need to be closely considered when looking at WAN-centric data storage generally and “cloud-based” data protection or DR as a service offerings in particular.
First, to use WAN-based disk to disk replication as a data protection technique, it is important to ensure that there is adequate physical separation (aka distance) between the primary or production copy of data and the remote or safety copy of same. The separation must be sufficient to ensure that both copies are not compromised by the same disaster that impacts the primary.
Opinions vary regarding what the minimum safe distance should be. Following the tragedy of 9-11, some analysts considered another terrorist scenario – a dirty bomb whose radioactive effluent might contaminate a much broader geographical area and cause much greater damage than hijacked aircraft used as missiles. Their recommendation for the New York financial district was to build redundant data centers to shadow its mostly New Jersey-based IT at least 80 kilometers away. The 80 km distance was thought to be beyond the contamination zone of a low yield dirty bomb (conventional explosives strapped to radioactive material) on a windy day.
Surveys of DR planners around the time of Hurricane Katrina resulted in another rule of thumb: separate data copies by at least 100 kilometers, which was viewed as the diameter of numerous Category 4 and 5 hurricanes in recent years. (Cat 1 hurricane, so-called “Superstorm” Sandy, in 2012 had an impact area of 1000 miles.)
The point is that the safety copy of data needs to be replicated and staged at considerable distance from the original. So-called SONET ring networks or MPLS WANs, services which now bless most NFL cities with more affordable broadband pipes, simply do not offer the physical separation required to meet this distance requirement. That should set off an alarm bell somewhere in the planner’s head.
Second, deploying replication over an adequately dispersed WAN link introduces several technically non-trivial issues. First of these is latency.
According to Einstein, you can’t push data along a wire (or through a glass fiber) faster than the speed of light. So, some latency will accrue to data moving over distance. One rule of thumb holds that every 18 km of distance traversed by data is equivalent to one full traverse of a read-write head on a 3.5 inch disk drive platter. That’s a fancy way of saying that deltas – differences in the data at the source from the data at the target device – accrue the further that the data must travel. This difference in data states may be inconsequential or it may mean the difference between recovery and failure.
Truth be told, data does not move in a straight line between points A and B. Like roads, WANs are built around natural obstacles and tend to follow other terrain, including building risers, bridges and tunnels, and rail/ subway/motorway rights of way. Rarely does signal go “as the crow flies,” a point that is demonstrated annually by college kids who participate in “IPoAC” (Internet Protocol over Avian Carrier) contests that pit WAN-based a lot of public shared WAN’s are oversubscribed and poorly utilized. Data transfers over distance against passenger pigeons, used to move the same data over the same distance. Year after year, IPoAC contests in Africa, Europe and North America show the avian carriers to be faster.
In addition to latency, part of the explanation for this outcome is “jitter” – a set of factors that further impair or impede the speedy progress of data transfers over distance. Jitter is the result of a range of factors in a publicly shared WAN facility, from routing delays to queuing and sequencing delays. In most public WANs, protocols similar to “open shortest path first” in local area networks are applied to WAN operations. This means that the path preferred for your data is the one that touches the smallest number of switches rather than the path that represents the shortest physical distance between end points.
Moreover, a lot of public shared WANs are oversubscribed and poorly utilized. Carriers report “buffer bloat” in their switch routers as data packets are queued awaiting their turn on the pipe. Meanwhile, impatient applications interpret delays as lost packets and ask for packet resends, further adding to the problem.
It is also worth mentioning that the shared nature of WAN pipes, among WAN subscribers (customers) AND local independent and competitive exchange carriers (ILECs and CLECs) often add more complexity to the problem. One client of mine, who must replicate about 36 GB nightly over approximately 100 miles – through a WAN link “owned” by NINE different carriers – doesn’t know from packet to packet whether the trip will take milliseconds or hours!
Additional bandwidth does not help with latency or jitter in WANs, nor does compression or de-duplication of payload. If you have ever been stuck in a traffic jam on a city highway, you have probably noticed that the diminutive SMART car is moving no faster than the big 18 wheeler – so it is with WANs. Even using the best technology (think Bridgeworks SANSlide and a few others) to pack the link efficiently with data, you still cannot escape the problem of latency.
This brings us to a third, and very important consideration, regarding WAN-based replication at distance to another site (whether your own or a cloud service provider’s). WAN replication is about as difficult to test and validate as local disk mirroring. To determine the size of data deltas, it is necessary to quiesce applications that are generating data for replication, to flush whatever local memory caches that may be holding data while waiting for write to primary disk (e.g., write this data to the disk completely), then replicate the data over distance to the remote site, then shut down the replication process. Finally, you must compare the data at both locations for consistency.
The above is a time consuming and potentially career ending activity (if you can’t restart the application, buffering and replication processes), so it is very rarely performed. Moreover, according to evidence provided by 21st Century Software, many companies avoid the test so completely that they find they have been replicating the wrong data – or even blank space – when they attempt to cut over to their remote mirror. (21st Century Software offers some software tools to help detect such issues.)
The good news is that there are some tools to help rectify the problems of test and validation, known as geo-clustering or stretch-clustering suites, including CA Technologies Replicator, Neverfail Group’s NeverFail, and DoubleTake Software’s DoubleTake. Another approach is to virtualize all of your storage with a storage hypervisor such as DataCore Software’s SANsymphony-V, then use the hypervisor’s replication facility to copy data across a WAN to a target infrastructure also virtualized with SANsymphony. The bad news is that you will probably need to turn off on-hardware-based replication services to use these software approaches. And, of course, none of these products address the WAN problems listed above: latency and jitter persist.
All of the above apply not just to site to site WAN-based data replication, but also to data transfers between your site and “clouds.” While cloud service providers may take every reasonable step to ensure that they are meeting and even exceeding their own Service Level Agreements, they do not own the wires that connect your site to theirs. That makes it exceedingly difficult for the cloud to promise a predictable service level with a straight face.
In the case of storage clouds, whether they are providing more production storage “elbow room” or off-site storage for archival data or backups, the twin problems of distance sufficiency (is the site far enough away to be out of harm’s way) and distance latency (is the link between the site introducing unmanageable data deltas) persist. These must be addressed both from the standpoint of business survival in the event of a natural or man-made disaster that impacts the primary site, and from the standpoint of resiliency in the face of a disaster impacting the service provider.
Nirvanix had no way to return customer data back to them. Customers are now realizing that existing networks are insufficient to move petabytes of data back to owners within 10 working days, and there is no tape for mass portable storage with off-line transport. If you want to use a cloud storage service, you might want to consider one that leverages tape as the storage medium, such as the recently introduced D:ternity or its more pedigreed cousin well known in medical imaging circles, Permivault.
Such services do not eliminate the issues of WANs, but they do provide an alternate way to secure the return of your data when and if the cloud service fails.
About the Author
Jon Toigo is CEO of Toigo Partners International and a consultant who has aided over 100 companies in the development of their continuity programs. Feel free to contact him at firstname.lastname@example.org.