Amazon Cloud Outage: 10 Lessons Learned10:00 AM EST Wed. Apr. 27, 2011
Amazon's cloud outage rocked the IT world, when its Elastic Compute Cloud (EC2) and Relational Database Service (RDS) went black and knocked several high-profile customer sites offline or caused sluggishness and disruptions.
Amazon is still searching for the root cause of the outage, but suggested that an issue with its Elastic Block Storage (EBS) service in its North Virginia data center is to blame.
The Amazon outage angered customers and called into question the reliability of the cloud while also questioning Amazon's lack of communication around the outage. It prompted cloud solution providers to call Amazon's cloud outage a "cautionary tale" about cloud services.
But it also taught the industry several lessons. Here are 10 lessons learned from Amazon's cloud computing outage.
Cloud outages are going to happen. They're a reality, and if you can't stand the outage, get out of the cloud, many solution providers said.
"Outages, for better or worse, are part of the IT industry," said David Hoff, vice president of technology for Atlanta-based cloud solution provider Cloud Sherpas, later adding, "There are going to be outages at least for the foreseeable future, because the technology is so immature."
"Cloud computing will fail, and you need to plan for that failure," said Jeremy Przygode, CEO of Los Angeles-based cloud solution provider Stratalux.
Amazon's cloud outage affected the North Virginia availability zones. Amazon customers that leverage different availability zones and spread their cloud infrastructures around weren't nearly as impacted as the Amazon customers that put their cloud eggs all on one basket.
"It will highlight the need for a better and more comprehensive backup/disaster recovery plan. Today, most customers assume that there is one in place but that is not always the case," said Forrester Research analyst Vanessa Alvarez. "It will force customers to ask more questions, put a disaster recovery plan in place that includes more than one provider for example, or fail over into another region. Service providers are in a tough spot now to do what they should have been doing since the beginning, and that's educating their customers with all the options available to them."
Amazon's cloud outage sent some customers scrambling to determine what kind of up time they were guaranteed as part of an SLA, and what kind of compensation they'd be afforded for the downtime. The outage brought to light the need to understand SLAs before signing on the dotted line and inking a cloud commitment.
"I believe this outage will force cloud clients to step back and really educate themselves on what they are purchasing from cloud vendors," said Joseph Coyle, CTO of North America for global solution provider Capgemini. "The exposure here is that when leveraging the cloud, the buyer needs to fully understand the technology and the SLAs that each cloud provider offers. High availability and data center failover are offered at different levels. Clients need to fully understand what they are signing up for, but also what their tolerance is for each system or environment that is being migrated to the cloud."
The outage highlighted the importance of teaming up with providers and integrators that understand how to build and provision cloud solutions and is available to answer cloud support questions.
"Companies really need to partner with service providers/integrators who have experience architecting solutions in the cloud in order to minimize impact to their services during outages like those seen last week," said Jeremy Przygode, CEO of Stratalux, a Los Angeles-based cloud solution provider. "If companies simply think of IaaS services as virtualization of existing services without re-architecting their solutions then they will get into trouble as many did last week. Those companies who partnered with cloud specific service providers and/or integrators will weather the storm a lot better than those who don’t."
Michal Kirven, co-founder and principal of solution provider Bluewolf, N.Y., said the outage showed support from a trusted advisor is key: "They need to partner with a firm they can get on the phone and talk through issues with."
Until now, many cloud users put applications and data into the cloud and had faith that it would just work and run. Solution providers said it's OK to trust the cloud, but that era of blind trust is done for.
"SLA's will be more important, and trust will still be there, but it won't be in the form of 'blind trust," said Tony Safoian, CEO of SADA Systems, a North Hollywood cloud solution provider.
"I hope it helps move customers away from blind trust. If anyone wants to run mission critical applications in the cloud or even in their own datacenters, they need to take some additional steps to ensure availability," added Paul Burns, president of cloud analyst firm Neovise.
Amazon's cloud outage has some competing cloud vendors licking their chops and waiting to scoop up angered customers who abandon the cloud. But some cloud experts cautioned to be wary of cloud charlatans looking to bamboozle unhappy Amazon customers.
"All current Amazon customers have greater availability than they had before they went to the cloud. There are less reliable providers and cloud charlatans out there. We suggest partnering with a knowledgeable cloud specialist or consultancy to help with selection and contractual requirements," said Bob Shinn, senior managing partner of cloud strategy for Grayslake, Ill.-based cloud consultancy Cloud Silver Lining.
Shinn continued: "Uneducated individuals who have an alternate agenda to the massive business value the cloud offers will hype the situation to create fear, uncertainty and doubt. Technologists who created 'pseudo clouds' will point to Amazon's outage as an excuse for slow adoption and possibly as an excuse for unrelated issues."
Leveraging cloud infrastructure doesn't absolve IT shops or solution providers of their duties to maintain and manage the cloud. It's not a 'set it and forget it' environment, as Amazon's cloud outage has proved.
"Just because I can move it into the cloud, that doesn't mean I can ignore it," said Michael Kirven, co-founder and principal for New York-based solution provider Bluewolf. "It still needs to be managed. It still needs to be maintained."
Jim Damoulakis, CTO of solution provider GlassHouse Technologies, Framingham, Mass., added: "You can't just simply write a check and your problems go away."
It may go against logic, but Amazon's cloud outage could actually be a boost to the cloud computing industry. By shining a light on what can go wrong, the outage can act as an education of how to prevent it from happening again or adapt when an outage occurs.
"Over the long term I actually believe the Amazon outage and other publicized failures will make the industry stronger," said Paul Burns, president of cloud analyst firm Neovise. "The failures will provide some good lessons in disaster planning and designing applications for redundancy. I guess it is a paradox, but cloud computing can actually improve availability and DR if applications and management processes take advantage of it. Some big names like Netflix were able to stay up and running during this outage in large part due to their application architecture."
Cloud customers fall under the misconception that resiliency, backup, disaster recovery and other services are offered by cloud providers. While this is true in some cases, it is not in all cases. Amazon's cloud outage showed that when it comes to cloud providers and resiliency, it's best to assume nothing.
"When you use a cloud service, whether you are consuming an application (backup, CRM, email, etc), or just using raw compute or storage, how is that data being protected? A lot of companies assume that the provider is doing regular backups, storing data in geographically redundant locations or even have a hot site somewhere with a copy of your data. Here's a hint: ASSUME NOTHING. Your cloud provider isn't in charge of your disaster recovery plan, YOU ARE!" wrote Forrester analyst Rachel Dines in a blog post examining the resiliency of cloud providers in the wake of Amazon's outage.
Amazon came under criticism for its lack of communication during and following the cloud outage, and radio silence won't fly when customer infrastructures aren't running as promised.
"Amazon has been extremely quiet around how the failure occurred and how it will be avoided in the future," said Joseph Coyle, CTO for North America for global solution provider Capgemini. "We need to remember that this was not a system wide outage so although it is a major hit to Amazon, I believe that if they explain the issue and how to avoid it they can hold back the damage."
"They really need to communicate during the outage," Paul Burns, president of cloud analyst firm Neovise said. "There have been a lot of complaints about Amazon's lack of communication during this outage."