Depending on an SLA? See-ya Later, Alligator
Planning & Management
Written by Christopher Burton   

To some, an “SLA,” or service-level agreement, represents a formal contract between a customer and a supplier. to others, it signifies a guarantee that a service will always be available and operational.

To business continuity practitioners, however, the SLA should be viewed as an opportunity to apply familiar risk management principles in order to better protect an organization from financial, operational or reputational impairment. As customers come to expect 24/7 business operations and 100% uptime, the need for “disaster-resilient” SLAs is greater than ever before.

On Thursday, April 21, 2011, hundreds of the internet’s most popular websites all simultaneously failed. Millions of internet users were denied use of sites such as Foursquare, Quora, and even portions of the NY Times. The underlying issue was quickly identified: Amazon’s Elastic Compute Cloud (Amazon EC2) service had suffered a major failure despite Amazon’s assurances of geographic redundancy.

Setting the Groundwork for SLAs

With the spread of outsourced processes and technology, the term “SLA” has become a part of everyday business vocabulary. Still, it’s important to understand the basics prior to discussing the relationship between SLAs and business continuity/disaster recovery:

What Is It?

An SLA is the part of a service contract where key parameters of the service are formally defined. In practice, the SLA most commonly refers to a contracted delivery time, response time, or frequency of downtime. Many SLAs have penalties associated with failure to meet the requirements.

Why Is It Important?

SLAs record a common understanding about services, priorities, responsibilities, guarantees and penalties between a customer and a supplier. These oftentimes legally binding contracts provide assurance (albeit sometimes false) that a service will be available when needed.

What Is Force Majeure?

Literally meaning “superior force,” this common contract clause protects suppliers from unforeseeable circumstances that prevent them from fulfilling a contract. In other words, the clause excuses suppliers from liability if an outside event prevents them from performing their obligations under contract.

SLAs have largely remained unchanged since they became popular in the 1980s and should contain the following components: definition of service(s), performance measurement, problem management, customer duties, warranties, disaster recovery, and termination of agreement. The latter two components, commonly taken for granted in today’s “always-on” environment, oftentimes escape an organization’s radar screen and contribute to unnecessary risk.

SLAs Aren’t Enough

Amazon’s cloud failure and many similar events are important reminders that outsourced processes and applications will sometimes fail – even when the SLA says otherwise. Here are three reasons SLAs are not enough to protect you from a business interruption:

1. “High availability,” especially when referring to cloud computing, is oftentimes confused with “disaster recovery.” While high availability helps mitigate the risk of individual server, disk or equipment failure (typically, but not always, in the same physical location), it’s important to remember that nothing is perfect. As such, the decision to add disaster recovery arrangements to an SLA should be aligned to organizational risk appetite and availability requirements.

2. SLAs with wording like “100% uptime” imply that the possibility of downtime is near zero. However, without specific business continuity and disaster recovery arrangements in place (i.e., geographically dispersed sites that provide application and data availability/protection within defined recovery objectives), 100% uptime does not protect from catastrophic failures. Most organizations advertising a 100% uptime SLA are using “high availability,” not “disaster recovery.” Organizations must understand the fine print in the SLA to understand the difference between perception and reality. For example, Amazon provides several “Availability Zones” (separate data centers) around the world that each commit to a 99.95% uptime. Based on this reliability and Amazon’s recommendations for implementation, customers typically use two availability zones to ensure disaster recovery. While these zones are supposed to fail independently of each other without bringing the whole system down, just the opposite happened on April 21, 2011. Amazon’s failure to deliver on its availability promises affected numerous organizations that relied solely on Amazon to provide disaster recovery.

3. Most importantly, most SLAs today include penalty clauses that reimburse a customer when a supplier fails to meet the requirements specified in the contract. While a penalty clause is important, as it gives suppliers incentive to deliver on their SLA, the reimbursement (typically limited to one month of service fees) is often miniscule compared to the negative financial, operational and reputational impact the outage can have on an organization. For example, imagine hosted e-mail, ERP or CRM systems being unavailable for one week – is the prolonged impact of the outage equivalent to one month of service fees? Probably not.

For companies like Reddit and Hootsuite, which are hosted on Amazon’s high-availability “cloud,” sole reliance left their customers in the dark. For other companies, like Netflix, SimpleGeo and SmugMug, having third-party disaster recovery arrangements in place at the time of the cloud failure enabled them to continue to deliver their products and services without interruption (and without all of the negative press).


Addressing Supply Chain Availability Risk

Business continuity and disaster recovery practitioners, armed with their knowledge and understanding of risk management, have an opportunity to positively impact organizational resiliency and recovery through their active involvement with supplier SLAs. Business continuity and disaster recovery practitioners can take action in several ways:

1. Partner with internal procurement and purchasing owners, as an advocate and advisor, to drive change with existing and new suppliers.

2. Evaluate suppliers currently under contract using risk assessment and measurement techniques similar to those used to evaluate internal activities – evaluating impact of failure.

3. Consider establishing criticality tiers for suppliers. (See table.) These tiers, based on the impact of failure identified through risk assessments, organize suppliers into tiers based on their direct impact on the organization’s ability to deliver its critical products and services. Supplier rankings will vary widely between organizations and industries based on their maturity, core outputs, and even regulatory requirements. The organization’s most important suppliers (tier one suppliers) receive the greatest scrutiny on their ability to deliver on their SLAs.

For tier one suppliers, a demonstration of their ability to operate from a disaster recovery site is the only way to truly confirm their ability to recover. Even then, many organizations will retain third parties that are ready and able to step in if the service cannot be restored. Due to their importance, business continuity and disaster recovery practitioners should spend roughly 80% of their supply chain continuity time on tier one suppliers.

For tier one and tier two suppliers, a simple and cost-effective survey may be used to gather information regarding their internal processes, planning, and attitude toward business continuity. (See table.)


4. Leverage existing processes and resources to improve the SLA process through the creation of a supplier continuity program. This program, involving both business continuity/disaster recovery and procurement/purchasing processes, helps ensure that contracting activities are aligned with the overall risk appetite and availability requirements of the organization.

As consumers come to expect 24/7 business operations and 100% uptime, the need for disaster-resilient SLAs is greater than ever before. Business continuity and disaster recovery practitioners are perfectly positioned to leverage their knowledge of risk management in order to partner with procurement/purchasing activities and positively impact existing and future supplier contracts. In doing so, they can increase organizational resiliency by reducing the potential for negative financial, operational and reputational impact resulting from a supplier’s inability to deliver products or services.

About the Author
Christopher Burton is a senior consultant at Avalution Consulting, where he specializes in the development of business continuity programs and solutions for organizations in both the public and private sectors. Christopher is a Member of the Business Continuity Institute (MBCI) and member of Contingency Planners of Ohio. He serves on the Technical Committee to develop an American National Standards Institute (ANSI) Standard entitled, “Organizational Resilience Maturity Model – Phased Implementation.” In addition to serving as a consultant, Christopher is a frequent author and speaker. Christopher can be reached via phone at 866.533.0575 or via email at