Email this article   Print article 


Amazon Offers Explanations, Apologies For Dual Cloud Outages

By Andrew R Hickey
August 16, 2011    10:11 AM ET

Page 2 of 3

That outage was the second to hit Amazon's cloud services in as many days. On August 7, Amazon's cloud, along with Microsoft's Business Productivity Online Suite (BPOS), suffered a massive cloud outage in Dublin, Ireland that took down its EC2 and Relational Database Service (RDS) for several days. Originally thought to be caused by a lightning strike, investigators found that the power failure that caused the cloud outage started when a transformer exploded and caught fire, which cut the power to Amazon's Dublin data center.

In an extremely lengthy postmortem for the Dublin cloud outage, which lasted several days for some customers, Amazon said that backup generators should have kicked on once the power went out, but that did not happen due to a failure in a programmable logic controller (PLC), which was supposed to assure that the electrical phase synchronized between generators before their power is brought online. Amazon was forced to manually bring the generators online and after several hours restored enough power to re-establish connectivity.

Power supplies, however, quickly drained and Amazon lost power to almost all EC2 instances and 58 percent of the Elastic Block Store (EBS) volumes in that Availability Zone. That was followed by launch delays and API errors. Meanwhile, AWS uncovered an EBS software error that exacerbated the issues.

Amazon said that in the future it will take preventative action against similar outages by adding redundancy and more isolation for PLCs to avoid failure. Amazon also plans to address the resource saturation that affected API calls and reduce the time needed to recover stuck EBS volumes.

The pair of Amazon cloud outages came roughly four months after Amazon's April cloud outage, which took down Amazon cloud services for many customers for several days. Amazon said that outage was caused by a network traffic shift that was "executed incorrectly" and instead of routing traffic to the other router on the primary network, traffic was shifted to the lower-capacity redundant EBS network. Amazon said the issue caused EBS volumes in the Northern Virginia Availability Zone to become "stuck" in a "re-mirroring storm." That made the volumes unavailable and created latency and outages.

A week later, Amazon apologized to customers for the cloud outage and offered them a cloud credit.

NEXT: Amazon Vows To Improve Communication During Outages



<< Previous | 1 | 2 | 3 | Next >>

To continue reading this article, please download the free CRN Tech News app for your iPad or Windows 8 device.
Related: Videos | Slide Shows | Comments

SHARE THIS ARTICLE

More Cloud

Recent Articles

AVG CloudCare: 5 Cool Features For Remote Virus Protection

Whether managing one store or many, CloudCare is free until customers pay the fee.

Top 10 Things To Know About Cisco's Push To The Cloud

Cisco's message to partners at the 2013 Partner Summit in Boston was all about cloud. Executives called partners to embrace the cloud and assured partners Cisco is here to help.

6 Tips To Mapping Out Cloud Migrations

Rackspace provides key themes that partners and customers need to consider when planning their cloud migrations.

  More Slide Shows




Related Videos
Loading...