No One To Call: One Company’s Amazon Elastic Compute Cloud Outage Story


Stephen Foley's heart skips a beat when he gets e-mail messages from Amazon these days regarding the Amazon Elastic Compute Cloud service.

Just last Thursday, for example, Amazon Web Services sent an e-mail alert that read: "3:28 PM PDT We are investigating inaccessibility for a small number of instances in a single availability zone in the US-East-1 Region."

The May 12 incident did not affect him, but Foley says he has good reason to be nervous given the outage that took down his company's iCyte web-based research annotation software as a service (SaaS) offering on April 21.

The founder and CEO of iCyte says an Amazon Elastic Compute cloud outage took down his San Francisco-based iCyte service for about 38 hours from April 21 at 1:41 a.m. PST to about 4 p.m. PST on April 22. Not only that, he says he had another server for personal purposes that was hit by the Amazon Elastic Compute cloud outage. That server, Foley said, was down for over 100 hours and did not return until April 25. Fortunately, he says, that personal server did not have mission critical data.

"I am still left feeling nervous about the scale of this outage, the complexity of their system, and with anything so complex it could all come down like a pack of cards again," says Foley of the Amazon Elastic Compute Cloud .

Foley is now considering whether to move his business off of Amazon. "My leaning at this stage is to evaluate the cloud companies, look at their service record and go with someone where we have confidence that if something goes wrong we will be well supported. While Amazon is promising to lift their game in this area, it is still unproven. For the sake of our business and our users I can't afford to take that risk."

Amazon would not comment specifically on the iCyte Web-based research annotation SaaS outage impact, but the company has issued a lengthy summary on the Amazon Elastic Compute Cloud service disruption, an apology to users and provided users of the affected Availability Zone a 10-day cloud services credit.

In its event summary, Amazon blamed the outage on an error made during a network configuration change. What's more, Amazon has made changes to prevent a similar outage in the future. "We will audit our change process and increase the automation to prevent this mistake from happening in the future," said the Amazon Web Services team in the post-mortem. What's more Amazon Web Services promised that it "will be making a number of changes to prevent a cluster from getting into a re-mirroring storm in the future."

The online shopping site's Amazon Web Services team has also created a new Amazon Web Services Architecture Center that includes best practice guides and Webinars for hosting Web applications on Amazon Web Services, designing fault tolerant applications on Amazon Web Services , and best practices for architecting in the cloud.

Next: Why iCyte Chose The Amazon Elastic Compute Cloud