Page 2 of 3
That outage was the second to hit Amazon's cloud services in as many days. On August 7, Amazon's cloud, along with Microsoft's Business Productivity Online Suite (BPOS), suffered a massive cloud outage in Dublin, Ireland that took down its EC2 and Relational Database Service (RDS) for several days. Originally thought to be caused by a lightning strike, investigators found that the power failure that caused the cloud outage started when a transformer exploded and caught fire, which cut the power to Amazon's Dublin data center.
In an extremely lengthy postmortem for the Dublin cloud outage, which lasted several days for some customers, Amazon said that backup generators should have kicked on once the power went out, but that did not happen due to a failure in a programmable logic controller (PLC), which was supposed to assure that the electrical phase synchronized between generators before their power is brought online. Amazon was forced to manually bring the generators online and after several hours restored enough power to re-establish connectivity.
Power supplies, however, quickly drained and Amazon lost power to almost all EC2 instances and 58 percent of the Elastic Block Store (EBS) volumes in that Availability Zone. That was followed by launch delays and API errors. Meanwhile, AWS uncovered an EBS software error that exacerbated the issues.
Amazon said that in the future it will take preventative action against similar outages by adding redundancy and more isolation for PLCs to avoid failure. Amazon also plans to address the resource saturation that affected API calls and reduce the time needed to recover stuck EBS volumes.
The pair of Amazon cloud outages came roughly four months after Amazon's April cloud outage, which took down Amazon cloud services for many customers for several days. Amazon said that outage was caused by a network traffic shift that was "executed incorrectly" and instead of routing traffic to the other router on the primary network, traffic was shifted to the lower-capacity redundant EBS network. Amazon said the issue caused EBS volumes in the Northern Virginia Availability Zone to become "stuck" in a "re-mirroring storm." That made the volumes unavailable and created latency and outages.
A week later, Amazon apologized to customers for the cloud outage and offered them a cloud credit.
NEXT: Amazon Vows To Improve Communication During Outages