The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk | Opinions | ChannelWorld.in

PARTNER HOTLINES

The Amazon Outage in Perspective: Failure Is Inevitable, So Manage Risk

By Bernard Golden on Nov 08, 2012
Bernard Golden About the author

Bernard Golden

Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.

An endless stream of tweets and blog posts have noteddescribed andbewailed last week's Amazon Web Services outage. Some people characterized the outage as an indictment of public cloud computing in general. Others, some of whom work at other cloud providers, characterized it as indicative of AWS-specific shortcomings. Still others used the event as an opportunity to outline how users have to be sure to hammer home SLA penalty clauses during contract negotiations, just to ensure protection from outages. 
Most of these responses are reflective of bias or the commenter's own agenda and fail to draw the proper lessons from this outage. More crucially, they fail to offer really useful advice or recommendations, preferring to proffer outmoded or alternative solutions that do not provide appropriate risk mitigation strategies appropriate for the new world of IT.

Analysis: Amazon Outage Started Small, Snowballed Into 12-Hour Event

The first thing to look at is what risk really is. Wikipedia calls it "the probable frequency and probable magnitude of future loss." In other words, risk can be ascertained by how often a problem occurs and how much that problem is likely to cost. Naturally, one has to evaluate how valuable mitigation efforts to address a risk are, given the cost of mitigation. Spending $1 to protect oneself against a $1,000 loss would seem to make sense, while spending $1,000 to protect oneself against a $1 loss is foolish. 
Amazon Outages Show That Failure Is An Option

The question for users is whether this outage presents a large enough loss that continuing to use AWS is no longer justified (i.e., is too risky) and that other solutions should be pursued. Certainly there are now applications running on AWS that represent millions or even tens of millions of dollars of annual revenue, so this question is quite germane.

In terms of this specific outage, Amazon posted an explanation that describes it as a combination of some planned maintenance, a failure to update some internal configuration files and a programmatic memory leak. The result was poor availability of Amazon's Elastic Block Storage (EBS) service. 
Interestingly, the last large AWS outage was also an EBS failure, although even more interestingly, it had an entirely different cause, though human error was the trigger for the previous outage as well. In both cases, someone misconfigured an EBS resource, which triggered an unexpected condition, resulting in a service outage. 
Most interesting of all, AWS says users shouldn't be surprised by this occurrence. Amazon's No. 1 design principle: "everything fails all the time; design your application for failure and it will never fail."

Many people are outraged by this, feeling that a service provider should take responsibility for ensuring 100% (or at least "five nines") of service availability. Amazon's attitude, they imply, is irresponsible. The right solution, they say, is that users should look to a provider that is willing to take responsibility and provide a service that is truly reliable, made possible by use of so-called "enterprise-grade" hardware and software backstopped by ironclad change control.

There Is No "Right" Equipment, No Matter What Your SLA Says

There's only one problem: the solution proposed by commenters is outmoded, inappropriate and unsustainable.

First, it assumes that availability can be increased by use of enterprise-grade equipment. The fact is, every type of equipment fails, often at inconvenient times. Believing that availability can magically improve by simply using the "right" equipment is doomed to failure.

Resource failure is an unfortunate reality. The primary issue is what user organizations should do to protect themselves from hardware failure. It's what they should really do, too. I view the "negotiate harder on the SLA" strategy as akin to "the beatings will continue until morale improves," meaning that it makes the SLA-demander feel better but is unlikely to result in any actual improvement.

Commentary: Cloud Computing and the Truth About SLAs

Many of the cloud providers commenting on the AWS outage propose this kind of solution. In my view, this demonstrates how poorly they understand this issue. Their hardware will fail, too. Those engaged in taunting a competitor when it experiences a service failure should remember that pride goes before a fall
Second, ironclad change control processes are not actually going to reduce resource failure. This is because anything involving human interaction is subject to mistakes, which results in failure. It's instructive to note that both major AWS outages were not the result of hardware failure, but of human error-specifically, human error that interacted with system design assumptions that failed to account for the type of error that occurred. And even organizations that are strongly ITIL-oriented experience human-caused problems. 
Finally, the solutions proposed don't account for the world of the future. Every company is going to experience a massive increase in IT scale; believing that just putting in place rigid enough processes, with enough checks and balances, will reduce failure just doesn't recognize how inadequate that approach is for this new IT world. No IT organization (and no cloud provider) will be able to afford enough people (or enough enterprise-grade equipment) to pursue this type of solution. 
Redundancy, Failover Have Been Best Practices For a Long Time

The true solutions for resource failure has long been known: redundancy and failover. Instead of a single server, use two; if one goes down, it's possible to switch over to the second to keep an application running. It's just that, in the past, implementing redundancy was unaffordable except for a tiny percentage of truly mission-critical applications, given the cost of hardware and software. 
The genius of cloud computing is that it offers the ability to address this redundancy easily and cheaply. Many users have designed their apps to be resilient in the face of individual resource failure and have protected themselves against it-unlike those who pursue the traditional solutions proffered by many commenters which will, inevitably, result in an outage when the enterprise-grade equipment fails.

Perspective: Do Customers Share Blame in Amazon Outages?

The more troubling situation is the infrequent failures that have human error involved, which result in more widespread service failure. In other words, it's not just one application's resources being unavailable, but a service being out for a large number of applications.

It's tempting to believe the problem is that Amazon just doesn't have good process or smart enough people working for it and that, if those aspects were addressed by it (or another provider), then these infrequent failures wouldn't occur.

This attitude is wrong. These corner case outages will continue, unfortunately. We are building a new model of computing-highly automated and vastly scaled, with rich functionality-and the industry is still learning how to operate and manage this new mode of computing. Inevitably, mistakes will occur. The mistakes are typically not simple errors but, rather, unusual conditions triggering unexpected events within the infrastructure. While cloud providers will do everything they can to prevent such situations, they will undoubtedly occur in the future.

In the End, It Comes Down To Risk

What is the solution for these infrequent yet widespread service outages? AWS recommends more extensive redundancy measures that span geographic regions. Given AWS scoping, that would protect against region-wide resource unavailability. There's only one problem. Implementing more expansive redundancy is complex and expensive-far more so than the simpler measures associated with resource redundancy.

Tips: Mitigating the Risk of Cloud Services Failure: How to Avoid Getting Amazon-ed

This brings us back to the topic of risk. Remember, it's frequency probability measured against magnitude of loss associated with a failure. You have to evaluate how frequently you expect these less-frequent, larger-scale resource failures to occur and compare that to the cost of preventing them via design and operations. In some sense, one is evaluating the cost of careful design and operation vs. the cost of a more general failure.

Certainly the cost of the design and operation can be worked out, while many people prefer to avoid thinking of the cost of a more widespread failure that would take their application offline. However, as more large revenue applications move to AWS, failing to evaluate risk and implement appropriate failure-resistant measures will be imprudent.

Overall, it's not as though the possibility of these outages is unknown, or that the appropriate mitigation techniques are easily discoverable as well. You should expect that CSPs will suffer general resource outages and not blame the provider in the event of such an outage. Instead, you should recognize that you made a decision without perhaps acknowledging the risk associated with it. Those who look at these outages and choose to do nothing more than damn the provider and demand perfection don't recognize how dangerous a game they are playing.

Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.

Follow everything from CIO.com on Twitter @CIOonline, on Facebook, and on Google +.

Latest Opinions

  • Yogesh Gupta

    Plainspeak: Scope, Cope, Hope in 2015

    If hope played a role in helping enterprise IT companies in India get through 2014; the next year will demand a new strategy: Scope, Cope, and Hope.

    Yogesh Gupta
  • Vijay Ramachandran

    From the Editor: Help Simplify IT

    A can-do attitude, a shared goal and being first to market helps get the innovation magic going.

    Vijay Ramachandran
  • Vijay Ramachandran

    From the Editor: Innovation Master Class

    A can-do attitude, a shared goal and being first to market help get the innovation magic going.

    Vijay Ramachandran
  • Yogesh Gupta

    Plainspeak: Going Beyond the CIO

    It’s time to re-think the assumption that convincing CIOs alone will win you deals. Deeper connects with other stakeholders is the order of the day.

    Yogesh Gupta
  • Vijay Ramachandran

    From the Editor: Take it Easy

    It’s gravity, often with you at its center, rather than inertia, that holds your organization back.

    Vijay Ramachandran

EDITOR'S PICK

Forecast 2015: IT Spending On An Upswing

As purse strings loosen up, CIOs blend innovation into 2015 IT budgets, but security and cost containment remain top priorities.

‘Security Compliance is Not a Proactive Phenomenon in India’

Pavan Duggal, Cyber Law Expert at the Supreme Court of India, explains why channel partners need to look beyond the IT Act 2000 as the security standards, given today’s fast-changing threat landscape, rapidly evolve.

IT is Indispensable for Business Optimization: David Aires, Intel

David L. Aires, VP, Information Technology Group, and GM, Information Technology Operations, believes security to be the biggest challenge in the current IT environment.

Is the CIO Role Nearing Extinction?

New technologies are shifting power to the hands of the user, endangering the CIO role. But do Indian CIOs consider that a threat or an opportunity? 

The Authentication Market is Big Play for Channels: Gaurav Chawla, Gemalto

We are building a partner network to address the increased demand for authentication solutions across India, says Gaurav Chawla, Director, IAM, Gemalto India.

Versatile Infosecurity: Riding the Security Wave

It takes vision and persistence to stay on top of the security curve. Versatile Infosecurity has mastered that art.

How Futurenet Technologies Helped Sterlite Copper Adopt Next-gen Client Computing

Sterlite Copper was able to successfully adopt next-gen client computing facilities with hand-in-hand assistance from Chennai-based Futurenet Technologies.

DigitalTrack Solutions: Right on the Security Track

DigitalTrack is keeping pace with the changes in the IT security space through DDoS and WAF solutions and is pushing security audits as part of its next move.

SLIDESHOWS

6 Leaders Who Headed for an Abrupt Exit

The abrupt exit of top leaders of Indian and global tech companies this year, with many of them citing ambiguous reasons, surprised the technology world.

Gartner Executive Summary Survey 2014

Gartner's Annual CIO Survey highlights the trends that will drive organizational IT spend in 2014.

10 Overhyped Tech Products That Crashed and Burned

The demos blew everyone away. Then reality hit.

Gartner Executive Summary Survey 2014

Gartner's Annual CIO Survey highlights the trends that will drive organizational IT spend in 2014.

ChannelWorld Survey: State of the Market 2014

Partners poll their sentiments, expectations, pain points, and challenges for the coming year.

FAST TRACK

Mudra Electronics

A vendor-agnostic strategy helped us sustain business, says Bharat Shetty, CMD, Mudra Electronics.

Systematix Technologies

Our USP is a customer-friendly approach backed by services, says Akhilesh Khandelwal, Director, Systematix Technologies.

CorporateServe Solutions

Our ability to turnaround complex ERP projects in record time is what gets us customer referral, says Vinay Vohra, Founder & CEO, CorporateServe Solutions.

KernelSphere Technologies

We are emerging as an end-to-end systems integrator, says Vinod Kumar, MD, KernelSphere Technologies.

Uniware Systems

We constantly validate emerging technologies for first-mover advantage, says Vergis K.R., CEO, Uniware Systems.

Astek Networking & Solutions

An innovative approach helps us stay successful, says Ashish Agarwal, CEO, Astek Networking & Solutions.

CSM Technologies

Our approach is backed by innovation and simplicity, says Priyadarshi Nanu Pany, CEO, CSM Technologies.

ETSC Computers

We want to be recognized as a complete solution provider, says Kailash Gupta, Director, ETSC Computers.

VIDEOS

Arun Parameswaran on VMware’s Cloud, Mobile, SDx Strategy

Arun Parameswaran, MD, VMware India, talks about transformation, strategy, roadmap, and VMware’s role in driving the shift to cloud, mobile, and SDx.

Parag Arora, Citrix: Our Portfolio Will Augment Our Strategy

Parag Arora, Area Vice President, Citrix India, elaborates on his action plan for the company after taking over operations in India.

Shibu Paul, Array Networks: ADN is a Great Business Opportunity for Channels

Shibu Paul elaborates on how Array Networks is empowering its partner ecosystem to address the modern datacenter challenges in India.

Scott Robertson, WatchGuard: We are an End-to-End Security Solutions Company

Scott Robertson of WatchGuard elaborates on the company’s partner roadmap in India and its subsequent shift in the security space.

Gaurav Ahluwalia, R&M: Channels Will Accelerate Our Datacenter Business

Gaurav Ahluwalia of R&M speaks on the company’s renewed focus to build its channel ecosystem and address the datacenter demands of India Inc.

Venkat Murthy, 22by7 Solutions: Real Value is in Solutions

Venkat Murthy, Prime Mover, 22by7 Solutions, elaborates on the need to look at a solutions approach rather than a mere hardware approach.

What Channel Partners Can Learn from a Sahara Adventurer

Steve Donahue, a desert adventurer and a best-selling author, takes experiences from this travels in the Sahara and turns them into lessons for channel partners, as they navigate the shifting sands of today's business and IT environment.

Rahul Agarwal, Lenovo: Profitability and Value Proposition are Vital

Rahul Agarwal, executive director, Commercial Business Segment, Lenovo India, talks about Lenovo’s renewed channel strategy and why the company is now an attractive proposition for its partners.

EMC PARTNER SHOWCASE

Partnering for Profitability

Atul H. Gosar, Director, Network Techlab, shares how the company’s association with EMC has provided it with a competitive edge and a wide customer base, leading to increased profitability.

Sponsored Content

Promising Pipeline

Venkat Murthy, Prime Mover, 22by7 Solutions, shares how EMC brings in competitive edge by enabling technology, GTM and lead generation, helping 22by7 acquire new customers and retain old ones.

Sponsored Content

Powerful Performance

Deepak Jadhav, Director, VDA Infosolutions, says initiatives by EMC around training and certification have helped the company’s staff improve its performance and enhance customer experience.

Sponsored Content

Performance Booster

Rajiv Kumar, CEO, Proactive Data Systems, says that the solution provider’s association with EMC has helped expand its customer base and added value to existing offerings.

Sponsored Content

Pursuit of Profitability

Santosh Agrawal, CEO, Esconet Technologies, shares insights on how the systems integrator’s association with EMC has spelled sustained success over the years.

Sponsored Content

Non-Performance is Not an Option

Nitin Aggarwal, Director, Trifin Technologies, shares insights on how the association with EMC has helped the system integrator stand out and empowered its personnel to deliver consistent performance.

Sponsored Content

STRATEGIC DIRECTIONS 2014

Driving IT to Make an Impact: IDC

IT is being increasingly viewed as something which would help drive revenue rather than just another cost line-item.

Software-Defined Infrastructure: Forrester

Firms must invest in transforming infrastructure to eradicate complex infrastructure to keep pace with business needs.

Better Safe Than Sorry: PwC

Organizations should create a culture of security that starts with commitment of top executives and cascades to all employees and third parties.

New Skills for a New Era: Gartner

A new talent strategy is required—one that is a key part of the evolving IT strategy and one that focuses on a blend of business and modern IT skills.

The Rise and Growth of Big Data: Ernst & Young

Leading organizations are reaping rich rewards on their investment in big data even as competition struggles to keep pace.

SOCIAL MEDIA @ CW India
SIGNUP FOR OUR NEWSLETTER

Signup for our newsletter and get regular updates.