The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk | Opinions | ChannelWorld.in
Close
%%CLICK_URL_UNESC%%

The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk

Added on Nov 08, 2012 by Bernard Golden
Bernard Golden About the author

Bernard Golden

Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.

An endless stream of tweets and blog posts have noteddescribed andbewailed last week's Amazon Web Services outage. Some people characterized the outage as an indictment of public cloud computing in general. Others, some of whom work at other cloud providers, characterized it as indicative of AWS-specific shortcomings. Still others used the event as an opportunity to outline how users have to be sure to hammer home SLA penalty clauses during contract negotiations, just to ensure protection from outages. 
Most of these responses are reflective of bias or the commenter's own agenda and fail to draw the proper lessons from this outage. More crucially, they fail to offer really useful advice or recommendations, preferring to proffer outmoded or alternative solutions that do not provide appropriate risk mitigation strategies appropriate for the new world of IT.

Analysis: Amazon Outage Started Small, Snowballed Into 12-Hour Event

The first thing to look at is what risk really is. Wikipedia calls it "the probable frequency and probable magnitude of future loss." In other words, risk can be ascertained by how often a problem occurs and how much that problem is likely to cost. Naturally, one has to evaluate how valuable mitigation efforts to address a risk are, given the cost of mitigation. Spending $1 to protect oneself against a $1,000 loss would seem to make sense, while spending $1,000 to protect oneself against a $1 loss is foolish. 
Amazon Outages Show That Failure Is An Option

The question for users is whether this outage presents a large enough loss that continuing to use AWS is no longer justified (i.e., is too risky) and that other solutions should be pursued. Certainly there are now applications running on AWS that represent millions or even tens of millions of dollars of annual revenue, so this question is quite germane.

In terms of this specific outage, Amazon posted an explanation that describes it as a combination of some planned maintenance, a failure to update some internal configuration files and a programmatic memory leak. The result was poor availability of Amazon's Elastic Block Storage (EBS) service. 
Interestingly, the last large AWS outage was also an EBS failure, although even more interestingly, it had an entirely different cause, though human error was the trigger for the previous outage as well. In both cases, someone misconfigured an EBS resource, which triggered an unexpected condition, resulting in a service outage. 
Most interesting of all, AWS says users shouldn't be surprised by this occurrence. Amazon's No. 1 design principle: "everything fails all the time; design your application for failure and it will never fail."

Many people are outraged by this, feeling that a service provider should take responsibility for ensuring 100% (or at least "five nines") of service availability. Amazon's attitude, they imply, is irresponsible. The right solution, they say, is that users should look to a provider that is willing to take responsibility and provide a service that is truly reliable, made possible by use of so-called "enterprise-grade" hardware and software backstopped by ironclad change control.

There Is No "Right" Equipment, No Matter What Your SLA Says

There's only one problem: the solution proposed by commenters is outmoded, inappropriate and unsustainable.

First, it assumes that availability can be increased by use of enterprise-grade equipment. The fact is, every type of equipment fails, often at inconvenient times. Believing that availability can magically improve by simply using the "right" equipment is doomed to failure.

Resource failure is an unfortunate reality. The primary issue is what user organizations should do to protect themselves from hardware failure. It's what they should really do, too. I view the "negotiate harder on the SLA" strategy as akin to "the beatings will continue until morale improves," meaning that it makes the SLA-demander feel better but is unlikely to result in any actual improvement.

Commentary: Cloud Computing and the Truth About SLAs

Many of the cloud providers commenting on the AWS outage propose this kind of solution. In my view, this demonstrates how poorly they understand this issue. Their hardware will fail, too. Those engaged in taunting a competitor when it experiences a service failure should remember that pride goes before a fall
Second, ironclad change control processes are not actually going to reduce resource failure. This is because anything involving human interaction is subject to mistakes, which results in failure. It's instructive to note that both major AWS outages were not the result of hardware failure, but of human error-specifically, human error that interacted with system design assumptions that failed to account for the type of error that occurred. And even organizations that are strongly ITIL-oriented experience human-caused problems. 
Finally, the solutions proposed don't account for the world of the future. Every company is going to experience a massive increase in IT scale; believing that just putting in place rigid enough processes, with enough checks and balances, will reduce failure just doesn't recognize how inadequate that approach is for this new IT world. No IT organization (and no cloud provider) will be able to afford enough people (or enough enterprise-grade equipment) to pursue this type of solution. 
Redundancy, Failover Have Been Best Practices For a Long Time

The true solutions for resource failure has long been known: redundancy and failover. Instead of a single server, use two; if one goes down, it's possible to switch over to the second to keep an application running. It's just that, in the past, implementing redundancy was unaffordable except for a tiny percentage of truly mission-critical applications, given the cost of hardware and software. 
The genius of cloud computing is that it offers the ability to address this redundancy easily and cheaply. Many users have designed their apps to be resilient in the face of individual resource failure and have protected themselves against it-unlike those who pursue the traditional solutions proffered by many commenters which will, inevitably, result in an outage when the enterprise-grade equipment fails.

Perspective: Do Customers Share Blame in Amazon Outages?

The more troubling situation is the infrequent failures that have human error involved, which result in more widespread service failure. In other words, it's not just one application's resources being unavailable, but a service being out for a large number of applications.

It's tempting to believe the problem is that Amazon just doesn't have good process or smart enough people working for it and that, if those aspects were addressed by it (or another provider), then these infrequent failures wouldn't occur.

This attitude is wrong. These corner case outages will continue, unfortunately. We are building a new model of computing-highly automated and vastly scaled, with rich functionality-and the industry is still learning how to operate and manage this new mode of computing. Inevitably, mistakes will occur. The mistakes are typically not simple errors but, rather, unusual conditions triggering unexpected events within the infrastructure. While cloud providers will do everything they can to prevent such situations, they will undoubtedly occur in the future.

In the End, It Comes Down To Risk

What is the solution for these infrequent yet widespread service outages? AWS recommends more extensive redundancy measures that span geographic regions. Given AWS scoping, that would protect against region-wide resource unavailability. There's only one problem. Implementing more expansive redundancy is complex and expensive-far more so than the simpler measures associated with resource redundancy.

Tips: Mitigating the Risk of Cloud Services Failure: How to Avoid Getting Amazon-ed

This brings us back to the topic of risk. Remember, it's frequency probability measured against magnitude of loss associated with a failure. You have to evaluate how frequently you expect these less-frequent, larger-scale resource failures to occur and compare that to the cost of preventing them via design and operations. In some sense, one is evaluating the cost of careful design and operation vs. the cost of a more general failure.

Certainly the cost of the design and operation can be worked out, while many people prefer to avoid thinking of the cost of a more widespread failure that would take their application offline. However, as more large revenue applications move to AWS, failing to evaluate risk and implement appropriate failure-resistant measures will be imprudent.

Overall, it's not as though the possibility of these outages is unknown, or that the appropriate mitigation techniques are easily discoverable as well. You should expect that CSPs will suffer general resource outages and not blame the provider in the event of such an outage. Instead, you should recognize that you made a decision without perhaps acknowledging the risk associated with it. Those who look at these outages and choose to do nothing more than damn the provider and demand perfection don't recognize how dangerous a game they are playing.

Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.

Follow everything from CIO.com on Twitter @CIOonline, on Facebook, and on Google +.

Latest Opinions

Editor's Pick
Customer service: The proof is in the numbers

Customer service: The proof is in the numbers

Customer satisfaction increases when you find your greatest weaknesses and then hone in with a laser focus to address them.

5 principles for great vendor relationships

5 principles for great vendor relationships

Unhappy with your vendor? Challenge them to commit to a strategic relationship. And don’t hesitate to move on if they don’t.

Services is a Multi-Pronged Strategy: Jayanth Gojer, Vitage Systems

Services is a Multi-Pronged Strategy: Jayanth Gojer, Vitage Systems

Services is the goldmine for us, asserts Jayanth Gojer, COO of Bangalore’s Vitage Systems.

We Help Customers Effectively Integrate the SMAC Stack: Pankaj Ratra, Path Infotech

We Help Customers Effectively Integrate the SMAC Stack: Pankaj Ratra, Path Infotech

Customers need applications to effectively adopt and integrate the SMAC stack and we help them achieve the objective, says Pankaj Ratra, Director-Sales & Marketing, Path Infotech

How Ujjivan is Using a Mobile App to Uplift the Underprivileged

How Ujjivan is Using a Mobile App to Uplift the Underprivileged

Enough is said about Digital India. But it’s actually happening here and now. Ujjivan Financial Services, a microfinance company, is leveraging a mobile app to make sure that its underprivileged customers get loans real quick in order to grow their businesses.

Latest Videos
Cloud Computing is Not Hype, But Real: Channel Partners

Cloud Computing is Not Hype, But Real: Channel Partners

Cloud Computing has come out of the hype cycle and is now the defining point for enterprise IT, channel partners observe, about the potential for cloud, cloud services and the financing models around cloud computing. 

Services is a Multi-Pronged Strategy: Jayanth Gojer, Vitage Systems

Services is a Multi-Pronged Strategy: Jayanth Gojer, Vitage Systems

Services is the goldmine for us, asserts Jayanth Gojer, COO of Bangalore’s Vitage Systems.

We Help Customers Effectively Integrate the SMAC Stack: Pankaj Ratra, Path Infotech

We Help Customers Effectively Integrate the SMAC Stack: Pankaj Ratra, Path Infotech

Customers need applications to effectively adopt and integrate the SMAC stack and we help them achieve the objective, says Pankaj Ratra, Director-Sales & Marketing, Path Infotech

How Ujjivan is Using a Mobile App to Uplift the Underprivileged

How Ujjivan is Using a Mobile App to Uplift the Underprivileged

Enough is said about Digital India. But it’s actually happening here and now. Ujjivan Financial Services, a microfinance company, is leveraging a mobile app to make sure that its underprivileged customers get loans real quick in order to grow their businesses.

How Mobility Has Helped Our Business Grow: Channel Partners

How Mobility Has Helped Our Business Grow: Channel Partners

Indian channel partners share how mobility has been instrumental in growing the business of their customers and how that, in turn, has taken partners' business to another level.

ChannelWorld Research
Why Channels Have Changed Their Go-to-Market Strategy in Six Months

Why Channels Have Changed Their Go-to-Market Strategy in Six Months

According to our SOTM Mid-Year Survey, a majority of Indian channels said their top go-to-market strategy was to offer services in a recurring revenue model—not introducing new technologies, like SMAC, which was their top GTM strategy six months ago.

SOTM Mid-Year Survey 2015: Your Roadmap for the Next 6 Months

SOTM Mid-Year Survey 2015: Your Roadmap for the Next 6 Months

According to the survey, the next six months of the year are going to be eventful. Here are the challenges, emerging technologies and trends that will shape the rest of the year for you.

Big Data, Mobility Top Partners’ Wish List: SOTM Mid-Year Survey 2015

Big Data, Mobility Top Partners’ Wish List: SOTM Mid-Year Survey 2015

According to the ChannelWorld's State of the Mart Mid-Year Survey (SOTM) 2015, big data is the way to go for the enterprise channels for the second half of this year, with 34 percent partners planning to invest in it.

Private Cloud Still Remains a ‘Safe Bet’ for Partners: SOTM Survey 2015

Private Cloud Still Remains a ‘Safe Bet’ for Partners: SOTM Survey 2015

According to the STOM 2015 survey, channel partners who are into cloud computing feel that 42 percent of their business will be generated by private cloud.

Partners Pin Hope on Government, Expects IT Spend to Grow: SOTM Mid-Year Survey 2015

Partners Pin Hope on Government, Expects IT Spend to Grow: SOTM Mid-Year Survey 2015

According to the ChannelWorld's State of the Mart Mid-Year survey (SOTM) 2015, 30 percent of the channel partners believe that IT spend in the government sector is likely to increase in next six months.

No More Baby Steps, Partners Ready to ‘Risk’ Gambling with Disruptive Tech: SOTM Survey 2015

No More Baby Steps, Partners Ready to ‘Risk’ Gambling with Disruptive Tech: SOTM Survey 2015

According to ChannelWorld’s State of the Mart Survey (SOTM) 2015, 43 percent of channel partners say their business will have a ‘high risk appetite’ in the second half of this year.

Channels’ 3 Great Expectations from Tech OEMs

Channels’ 3 Great Expectations from Tech OEMs

Introduced as an individual category under ‘Expectations from tech OEMs’ list in SOTM Mid –Year Survey, ‘transparent deal registration’ emerged as the biggest concern for channel companies in India.

SLIDESHOWS
What Quarterly Earnings Reveal About Top IT Companies

What Quarterly Earnings Reveal About Top IT Companies

As top companies reported their quarterly earnings, we find out what strategy worked best for them and the causes of concern in the future.

10 Things You Need to Know About Bi-Modal IT

10 Things You Need to Know About Bi-Modal IT

No longer just another buzzword, bi-modal IT is soon becoming a necessary organizational setup in most companies, especially the ones which find it difficult to go completely digital. Read on to know what CIO.com (http://bit.ly/1Rz1Jti) wants you to know about this system.

Five Hybrid Cloud Benefits Your Customers Can’t Ignore

Five Hybrid Cloud Benefits Your Customers Can’t Ignore

According to IDC, the global cloud market, including private, public and hybrid clouds, will hit $118 billion in 2015 and crest at $200 billion by 2018. If that isn’t enough for you to convince customers to take the hybrid cloud route, here are five more from across the web.

Horrible Bosses: Five Bosses You’d Love to Kill

Horrible Bosses: Five Bosses You’d Love to Kill

It takes all kinds to make the world go round. But it takes only one to bring yours to a standstill: Your boss. Here are five types of bosses you wouldn’t miss—when they’re gone (under mysterious circumstances).

India's Leading VADs
Why Channels Want to Partner With Inflow Technologies

Why Channels Want to Partner With Inflow Technologies

Inflow Technologies’ tie up with 39 vendor companies, an extensive tech portfolio, and a services play, are great value propositions for enterprise channels, says its President and CEO, Byju Pillai.

iValue Creates Real Value for Channels in India

iValue Creates Real Value for Channels in India

Focused on niche vendor alliances around data, network and app management backed by a robust channel ecosystem marked iValue's success in 2014. What clicked for the seven-year-old VAD?

RAH Infotech Shows Channels the Way Ahead

RAH Infotech Shows Channels the Way Ahead

Mutual trust and long lasting bond with vendor companies and channel partners helps VADs to evolve and succeed in today’s aggressively competitive market. Leveraging competent channel partners and forge niche vendor alliances marks RAH Infotech’s success in 2014.

How Satcom Infotech is Adapting to New Security Landscape

How Satcom Infotech is Adapting to New Security Landscape

As a leading value added distributor, Satcom Infotech is emerging as an end-to-end security player, helping both customers and partners grow.

How ComGuard Shields Channel Partners

How ComGuard Shields Channel Partners

As emerging technologies introduce new threats to the enterprise landscape, they are making channel partners anxious. But VADs like ComGuard are putting their worries to rest. Here's how.

Tech Chat
Collaborating To Outcome Based World: Priyadarshi Mohapatra, Avaya

Collaborating To Outcome Based World: Priyadarshi Mohapatra, Avaya

Priyadarshi Mohapatra, Managing Director, India and SAARC, Avaya, on how IT is transitioning from a keep-the-lights-on role to one that enables customers to deliver results.

The Dawn of the Digital Age: Akhilesh Tuteja, KPMG

The Dawn of the Digital Age: Akhilesh Tuteja, KPMG

The development of digital infrastructure will be a key growth driver for technology and solution providers. 

Paradigm Shift from End-Users to User-First : Parag Arora,Citrix

Paradigm Shift from End-Users to User-First : Parag Arora,Citrix

Parag Arora, Area Vice President and India Head, India Sub-continent, Citrix, says new technologies will force organizations to take a user-first approach in 2015.

Mobile and Cloud Are Gamechangers of the Future: Karan Bajwa, Microsoft

Mobile and Cloud Are Gamechangers of the Future: Karan Bajwa, Microsoft

Karan Bajwa, Managing Director, Microsoft India, says, in  2015, organizations will adopt a mobile-first and cloud-first strategy to get ahead of competition.

A Network for the Internet of Everything : Dinesh Malkani,Cisco

A Network for the Internet of Everything : Dinesh Malkani,Cisco

Dinesh Malkani, President, India and SAARC, Cisco, talks about IoT and the significant technology transitions in the networking world.

Moving to the Third Platform: Jaideep Mehta, IDC

Moving to the Third Platform: Jaideep Mehta, IDC

Cloud and mobility are the two technologies that will fuel the rapid adoption of the third platform in India.

Envisaging a Holistic Security Strategy For 2015: Sanjay Rohatgi,Symantec

Envisaging a Holistic Security Strategy For 2015: Sanjay Rohatgi,Symantec

Sanjay Rohatgi, President–Sales, Symantec India, says the company has a set of holistic solutions in place to secure organizations from security threats. 

Beating the Bad Guys: Sivarama Krishnan, PwC

Beating the Bad Guys: Sivarama Krishnan, PwC

Organizations will need to turn inwards to establish robust information security strategies.

Building Capabilities for a Digital Tomorrow: Alok Ohrie,Dell

Building Capabilities for a Digital Tomorrow: Alok Ohrie,Dell

Alok Ohrie, President and Managing Director, Dell India, on the company’s investments to build end-to-end solutions and delivery capabilities for a digital world.

FAST TRACK
Kamtron Systems

Kamtron Systems

Transitioning towards a service-oriented company will boost our growth, believes Kavita Singhal, director, Kamtron Systems.

TIM Infratech

TIM Infratech

Delivering ‘best of breed’ technologies to enterprises is key to success, says Monish Chhabria, MD, TIM Infratech

Mudra Electronics

Mudra Electronics

A vendor-agnostic strategy helped us sustain business, says Bharat Shetty, CMD, Mudra Electronics.

Systematix Technologies

Systematix Technologies

Our USP is a customer-friendly approach backed by services, says Akhilesh Khandelwal, Director, Systematix Technologies.

CorporateServe Solutions

CorporateServe Solutions

Our ability to turnaround complex ERP projects in record time is what gets us customer referral, says Vinay Vohra, Founder & CEO, CorporateServe Solutions.

KernelSphere Technologies

KernelSphere Technologies

We are emerging as an end-to-end systems integrator, says Vinod Kumar, MD, KernelSphere Technologies.

Uniware Systems

Uniware Systems

We constantly validate emerging technologies for first-mover advantage, says Vergis K.R., CEO, Uniware Systems.

Astek Networking & Solutions

Astek Networking & Solutions

An innovative approach helps us stay successful, says Ashish Agarwal, CEO, Astek Networking & Solutions.

CSM Technologies

CSM Technologies

Our approach is backed by innovation and simplicity, says Priyadarshi Nanu Pany, CEO, CSM Technologies.

SOCIAL MEDIA @ CW India
SIGNUP FOR OUR NEWSLETTER

Signup for our newsletter and get regular updates.