The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk | Opinions | ChannelWorld.in

The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk

Added on Nov 08, 2012 by Bernard Golden
Bernard Golden About the author

Bernard Golden

Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.

An endless stream of tweets and blog posts have noteddescribed andbewailed last week's Amazon Web Services outage. Some people characterized the outage as an indictment of public cloud computing in general. Others, some of whom work at other cloud providers, characterized it as indicative of AWS-specific shortcomings. Still others used the event as an opportunity to outline how users have to be sure to hammer home SLA penalty clauses during contract negotiations, just to ensure protection from outages. 
Most of these responses are reflective of bias or the commenter's own agenda and fail to draw the proper lessons from this outage. More crucially, they fail to offer really useful advice or recommendations, preferring to proffer outmoded or alternative solutions that do not provide appropriate risk mitigation strategies appropriate for the new world of IT.

Analysis: Amazon Outage Started Small, Snowballed Into 12-Hour Event

The first thing to look at is what risk really is. Wikipedia calls it "the probable frequency and probable magnitude of future loss." In other words, risk can be ascertained by how often a problem occurs and how much that problem is likely to cost. Naturally, one has to evaluate how valuable mitigation efforts to address a risk are, given the cost of mitigation. Spending $1 to protect oneself against a $1,000 loss would seem to make sense, while spending $1,000 to protect oneself against a $1 loss is foolish. 
Amazon Outages Show That Failure Is An Option

The question for users is whether this outage presents a large enough loss that continuing to use AWS is no longer justified (i.e., is too risky) and that other solutions should be pursued. Certainly there are now applications running on AWS that represent millions or even tens of millions of dollars of annual revenue, so this question is quite germane.

In terms of this specific outage, Amazon posted an explanation that describes it as a combination of some planned maintenance, a failure to update some internal configuration files and a programmatic memory leak. The result was poor availability of Amazon's Elastic Block Storage (EBS) service. 
Interestingly, the last large AWS outage was also an EBS failure, although even more interestingly, it had an entirely different cause, though human error was the trigger for the previous outage as well. In both cases, someone misconfigured an EBS resource, which triggered an unexpected condition, resulting in a service outage. 
Most interesting of all, AWS says users shouldn't be surprised by this occurrence. Amazon's No. 1 design principle: "everything fails all the time; design your application for failure and it will never fail."

Many people are outraged by this, feeling that a service provider should take responsibility for ensuring 100% (or at least "five nines") of service availability. Amazon's attitude, they imply, is irresponsible. The right solution, they say, is that users should look to a provider that is willing to take responsibility and provide a service that is truly reliable, made possible by use of so-called "enterprise-grade" hardware and software backstopped by ironclad change control.

There Is No "Right" Equipment, No Matter What Your SLA Says

There's only one problem: the solution proposed by commenters is outmoded, inappropriate and unsustainable.

First, it assumes that availability can be increased by use of enterprise-grade equipment. The fact is, every type of equipment fails, often at inconvenient times. Believing that availability can magically improve by simply using the "right" equipment is doomed to failure.

Resource failure is an unfortunate reality. The primary issue is what user organizations should do to protect themselves from hardware failure. It's what they should really do, too. I view the "negotiate harder on the SLA" strategy as akin to "the beatings will continue until morale improves," meaning that it makes the SLA-demander feel better but is unlikely to result in any actual improvement.

Commentary: Cloud Computing and the Truth About SLAs

Many of the cloud providers commenting on the AWS outage propose this kind of solution. In my view, this demonstrates how poorly they understand this issue. Their hardware will fail, too. Those engaged in taunting a competitor when it experiences a service failure should remember that pride goes before a fall
Second, ironclad change control processes are not actually going to reduce resource failure. This is because anything involving human interaction is subject to mistakes, which results in failure. It's instructive to note that both major AWS outages were not the result of hardware failure, but of human error-specifically, human error that interacted with system design assumptions that failed to account for the type of error that occurred. And even organizations that are strongly ITIL-oriented experience human-caused problems. 
Finally, the solutions proposed don't account for the world of the future. Every company is going to experience a massive increase in IT scale; believing that just putting in place rigid enough processes, with enough checks and balances, will reduce failure just doesn't recognize how inadequate that approach is for this new IT world. No IT organization (and no cloud provider) will be able to afford enough people (or enough enterprise-grade equipment) to pursue this type of solution. 
Redundancy, Failover Have Been Best Practices For a Long Time

The true solutions for resource failure has long been known: redundancy and failover. Instead of a single server, use two; if one goes down, it's possible to switch over to the second to keep an application running. It's just that, in the past, implementing redundancy was unaffordable except for a tiny percentage of truly mission-critical applications, given the cost of hardware and software. 
The genius of cloud computing is that it offers the ability to address this redundancy easily and cheaply. Many users have designed their apps to be resilient in the face of individual resource failure and have protected themselves against it-unlike those who pursue the traditional solutions proffered by many commenters which will, inevitably, result in an outage when the enterprise-grade equipment fails.

Perspective: Do Customers Share Blame in Amazon Outages?

The more troubling situation is the infrequent failures that have human error involved, which result in more widespread service failure. In other words, it's not just one application's resources being unavailable, but a service being out for a large number of applications.

It's tempting to believe the problem is that Amazon just doesn't have good process or smart enough people working for it and that, if those aspects were addressed by it (or another provider), then these infrequent failures wouldn't occur.

This attitude is wrong. These corner case outages will continue, unfortunately. We are building a new model of computing-highly automated and vastly scaled, with rich functionality-and the industry is still learning how to operate and manage this new mode of computing. Inevitably, mistakes will occur. The mistakes are typically not simple errors but, rather, unusual conditions triggering unexpected events within the infrastructure. While cloud providers will do everything they can to prevent such situations, they will undoubtedly occur in the future.

In the End, It Comes Down To Risk

What is the solution for these infrequent yet widespread service outages? AWS recommends more extensive redundancy measures that span geographic regions. Given AWS scoping, that would protect against region-wide resource unavailability. There's only one problem. Implementing more expansive redundancy is complex and expensive-far more so than the simpler measures associated with resource redundancy.

Tips: Mitigating the Risk of Cloud Services Failure: How to Avoid Getting Amazon-ed

This brings us back to the topic of risk. Remember, it's frequency probability measured against magnitude of loss associated with a failure. You have to evaluate how frequently you expect these less-frequent, larger-scale resource failures to occur and compare that to the cost of preventing them via design and operations. In some sense, one is evaluating the cost of careful design and operation vs. the cost of a more general failure.

Certainly the cost of the design and operation can be worked out, while many people prefer to avoid thinking of the cost of a more widespread failure that would take their application offline. However, as more large revenue applications move to AWS, failing to evaluate risk and implement appropriate failure-resistant measures will be imprudent.

Overall, it's not as though the possibility of these outages is unknown, or that the appropriate mitigation techniques are easily discoverable as well. You should expect that CSPs will suffer general resource outages and not blame the provider in the event of such an outage. Instead, you should recognize that you made a decision without perhaps acknowledging the risk associated with it. Those who look at these outages and choose to do nothing more than damn the provider and demand perfection don't recognize how dangerous a game they are playing.

Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.

Follow everything from CIO.com on Twitter @CIOonline, on Facebook, and on Google +.

Latest Opinions

  • Accessibility and Apple Watch apps

    MG Siegler shares his thoughts on Apple Watch "three months in." He writes:

    Steven Aquino
  • Flawsome software: Making educated platform choices

    The internecine warfare between Android and iOS continues to rage, leaving no one unscathed. This week brings another volley as researchers discovered a Major Flaw In Android Phones Would Let Hackers In With Just A Text.

    The Macalope
  • Talkler iOS app reads email aloud so you can be (nearly) hands-free

    The freemium iOS app Talkler bills itself as "email for your ears." By reading messages aloud, the app enables you to catch up on email while exercising, driving, cooking, or otherwise busy. The app works as promised, though I ran into issues with setup and voice commands.

    James A. Martin
  • Comparing Apples to lemons: Windows Phone beats Apple Watch

    As the artist formerly knowns as Prince and is possibly again known as Prince (who can keep up?) once wrote: Oh, no, let's go crazy.

    The Macalope
  • Cell phone alerts are dangerous to drivers

    I know. You've been nagged and nagged by writers, including me and my colleague, Al Sacco, to put down your phone and other devices when you're behind the wheel. But before you stop reading note this: A new study from Florida State University indicates that even putting your phone on vibrate while you drive doesn't make you all that much safer.

    Bill Snyder
All About Windows 10

The Windows 10 upgrade: Who should do it, who could wait

The day is finally here: Windows 10's launch. And if you haven't been participating in Microsoft's Windows Insider previews, you have a single, simple question: Should I upgrade to Windows 10?

Windows 10 vs. Windows 8: Performance benchmarks show a close battle for fastest

Windows 10 performance is one of the hardest things to nail down right now. Here are some initial tests.

Windows 10 installation files already sneaking onto Windows 7 and 8 PCs

Starting around midnight Eastern time on Wednesday, June 29, 2015 Windows 10 will start rolling out to Windows Insider members.

Windows 10: Fact vs. fiction

It's a few days before Windows 10 is officially slated to drop, and still, confusion abounds. Worse, many fallacies regarding Microsoft's plans around upgrades and support for Win10 remain in circulation, despite efforts to dispel them.

IT hears the siren call of free Windows 10 upgrades

Even enterprise IT professionals are taken with Microsoft's free upgrade offer to Windows 10, according to a recently-released survey by Spiceworks.

ChannelWorld Research

Why Channels Have Changed Their Go-to-Market Strategy in Six Months

According to our SOTM Mid-Year Survey, a majority of Indian channels said their top go-to-market strategy was to offer services in a recurring revenue model—not introducing new technologies, like SMAC, which was their top GTM strategy six months ago.

SOTM Mid-Year Survey 2015: Your Roadmap for the Next 6 Months

According to the survey, the next six months of the year are going to be eventful. Here are the challenges, emerging technologies and trends that will shape the rest of the year for you.

Big Data, Mobility Top Partners’ Wish List: SOTM Mid-Year Survey 2015

According to the ChannelWorld's State of the Mart Mid-Year Survey (SOTM) 2015, big data is the way to go for the enterprise channels for the second half of this year, with 34 percent partners planning to invest in it.

Private Cloud Still Remains a ‘Safe Bet’ for Partners: SOTM Survey 2015

According to the STOM 2015 survey, channel partners who are into cloud computing feel that 42 percent of their business will be generated by private cloud.

Partners Pin Hope on Government, Expects IT Spend to Grow: SOTM Mid-Year Survey 2015

According to the ChannelWorld's State of the Mart Mid-Year survey (SOTM) 2015, 30 percent of the channel partners believe that IT spend in the government sector is likely to increase in next six months.

No More Baby Steps, Partners Ready to ‘Risk’ Gambling with Disruptive Tech: SOTM Survey 2015

According to ChannelWorld’s State of the Mart Survey (SOTM) 2015, 43 percent of channel partners say their business will have a ‘high risk appetite’ in the second half of this year.

Channels’ 3 Great Expectations from Tech OEMs

Introduced as an individual category under ‘Expectations from tech OEMs’ list in SOTM Mid –Year Survey, ‘transparent deal registration’ emerged as the biggest concern for channel companies in India.

Latest Videos

Why We Believe in Servicing a Single Vertical: Satish Pendse, Highbar Technologies

The biggest pro of taking a vertical approach is that you put all your efforts into one thing and this excessive focus helps you succeed, says Satish Pendse, President, Highbar Technologies.

Fortinet Geared to Enhance Visibility: Rajesh Maurya

Fortinet will engage in intense dialogue with customers and partners through events, roadshows, POCs and various other initiatives, says Rajesh Maurya, Country Manager- India & SAARC, Fortinet

Hybrid SDN is a Big Opportunity for Channels: Subhasish Gupta, Allied Telesis

Subhasish Gupta, Country Manager India & SAARC, Allied Telesis, says hybrid SDN and surveillance solutions can open new doors for channel partners in India.

EDITOR'S PICK

The Windows 10 upgrade: Who should do it, who could wait

The day is finally here: Windows 10's launch. And if you haven't been participating in Microsoft's Windows Insider previews, you have a single, simple question: Should I upgrade to Windows 10?

Windows 10 vs. Windows 8: Performance benchmarks show a close battle for fastest

Windows 10 performance is one of the hardest things to nail down right now. Here are some initial tests.

Windows 10 installation files already sneaking onto Windows 7 and 8 PCs

Starting around midnight Eastern time on Wednesday, June 29, 2015 Windows 10 will start rolling out to Windows Insider members.

Windows 10: Fact vs. fiction

It's a few days before Windows 10 is officially slated to drop, and still, confusion abounds. Worse, many fallacies regarding Microsoft's plans around upgrades and support for Win10 remain in circulation, despite efforts to dispel them.

IT hears the siren call of free Windows 10 upgrades

Even enterprise IT professionals are taken with Microsoft's free upgrade offer to Windows 10, according to a recently-released survey by Spiceworks.

SLIDESHOWS

7 Apps By The Indian Government You Need to Use

There are over hundreds of government initiated apps for Android, Apple and Windows devices. With Narendra Modi’s Digital India initiative, the government has released several new feature-rich smartphone applications.

SOTM Mid-Year Survey 2015: Your Roadmap for the Next 6 Months

According to the survey, the next six months of the year are going to be eventful. Here are the challenges, emerging technologies and trends that will shape the rest of the year for you.

The State of the Internet

Akamai's Q1 2015 state of the internet report provides insights into key global statistics including connection speeds, broadband adoption (fixed and mobile networks), and IPv4 exhaustion and IPv6 implementation.

India Software Market on an Upswing

According to IDC, the Indian software market has witnessed consistent growth of 10 percent since the second half of 2014, showing signs of growth and revival. 

7 Jobs Technology Has Replaced

Albert Einstein said once that it has become appallingly obvious that our technology has exceeded our humanity. With every invention of technology some poor soul becomes vulnerable to losing his or her job in some corner of the world. Here are few jobs that will cease to exist soon.

India's Leading VADs

Why Channels Want to Partner With Inflow Technologies

Inflow Technologies’ tie up with 39 vendor companies, an extensive tech portfolio, and a services play, are great value propositions for enterprise channels, says its President and CEO, Byju Pillai.

iValue Creates Real Value for Channels in India

Focused on niche vendor alliances around data, network and app management backed by a robust channel ecosystem marked iValue's success in 2014. What clicked for the seven-year-old VAD?

RAH Infotech Shows Channels the Way Ahead

Mutual trust and long lasting bond with vendor companies and channel partners helps VADs to evolve and succeed in today’s aggressively competitive market. Leveraging competent channel partners and forge niche vendor alliances marks RAH Infotech’s success in 2014.

How Satcom Infotech is Adapting to New Security Landscape

As a leading value added distributor, Satcom Infotech is emerging as an end-to-end security player, helping both customers and partners grow.

How ComGuard Shields Channel Partners

As emerging technologies introduce new threats to the enterprise landscape, they are making channel partners anxious. But VADs like ComGuard are putting their worries to rest. Here's how.

Tech Chat

Collaborating To Outcome Based World: Priyadarshi Mohapatra, Avaya

Priyadarshi Mohapatra, Managing Director, India and SAARC, Avaya, on how IT is transitioning from a keep-the-lights-on role to one that enables customers to deliver results.

The Dawn of the Digital Age: Akhilesh Tuteja, KPMG

The development of digital infrastructure will be a key growth driver for technology and solution providers. 

Paradigm Shift from End-Users to User-First : Parag Arora,Citrix

Parag Arora, Area Vice President and India Head, India Sub-continent, Citrix, says new technologies will force organizations to take a user-first approach in 2015.

Mobile and Cloud Are Gamechangers of the Future: Karan Bajwa, Microsoft

Karan Bajwa, Managing Director, Microsoft India, says, in  2015, organizations will adopt a mobile-first and cloud-first strategy to get ahead of competition.

A Network for the Internet of Everything : Dinesh Malkani,Cisco

Dinesh Malkani, President, India and SAARC, Cisco, talks about IoT and the significant technology transitions in the networking world.

Moving to the Third Platform: Jaideep Mehta, IDC

Cloud and mobility are the two technologies that will fuel the rapid adoption of the third platform in India.

Envisaging a Holistic Security Strategy For 2015: Sanjay Rohatgi,Symantec

Sanjay Rohatgi, President–Sales, Symantec India, says the company has a set of holistic solutions in place to secure organizations from security threats. 

Beating the Bad Guys: Sivarama Krishnan, PwC

Organizations will need to turn inwards to establish robust information security strategies.

Building Capabilities for a Digital Tomorrow: Alok Ohrie,Dell

Alok Ohrie, President and Managing Director, Dell India, on the company’s investments to build end-to-end solutions and delivery capabilities for a digital world.

FAST TRACK

Kamtron Systems

Transitioning towards a service-oriented company will boost our growth, believes Kavita Singhal, director, Kamtron Systems.

TIM Infratech

Delivering ‘best of breed’ technologies to enterprises is key to success, says Monish Chhabria, MD, TIM Infratech

Mudra Electronics

A vendor-agnostic strategy helped us sustain business, says Bharat Shetty, CMD, Mudra Electronics.

Systematix Technologies

Our USP is a customer-friendly approach backed by services, says Akhilesh Khandelwal, Director, Systematix Technologies.

CorporateServe Solutions

Our ability to turnaround complex ERP projects in record time is what gets us customer referral, says Vinay Vohra, Founder & CEO, CorporateServe Solutions.

KernelSphere Technologies

We are emerging as an end-to-end systems integrator, says Vinod Kumar, MD, KernelSphere Technologies.

Uniware Systems

We constantly validate emerging technologies for first-mover advantage, says Vergis K.R., CEO, Uniware Systems.

Astek Networking & Solutions

An innovative approach helps us stay successful, says Ashish Agarwal, CEO, Astek Networking & Solutions.

CSM Technologies

Our approach is backed by innovation and simplicity, says Priyadarshi Nanu Pany, CEO, CSM Technologies.

SOCIAL MEDIA @ CW India
SIGNUP FOR OUR NEWSLETTER

Signup for our newsletter and get regular updates.