Amazon Offers Explanations, Apologies For Dual Cloud Outages10:11 AM EST Tue. Aug. 16, 2011
Amazon Web Services (AWS) has detailed the causes behind a pair of cloud outages this month -- one in Dublin, Ireland, and one in the U.S. -- that knocked Amazon cloud services offline.
In the U.S., Amazon's Elastic Compute Cloud (EC2) went down around 10:25 p.m. Eastern August 8 in Amazon's U.S. East Region. The cloud outage lasted roughly 30 minutes, but took down the Web sites and services of many major Amazon cloud customers, including Netflix, Reddit and Foursquare.
According to Amazon's postmortem of the cloud outage, the downtime affected connectivity between three of its different Availability Zones and the Internet.
"The issue happened in the networks that connect our Availability Zones to the Internet," Amazon wrote on its AWS Service Health Dashboard. According to Amazon, all Availability Zones must have network connectivity to the Internet and to each other to let customer resources in one zone communicate with resources in other zones. Meanwhile, safeguards are in place to prevent network issues in one Availability Zone from impacting other zones and AWS uses a routing architecture it calls "north/south" where northern routers are at the border, facing the Internet, and southern routers are part of the individual zones. Southern routers are prevented from advertising Internet routes to other southern routers in another zone and are also prohibited from telling northern routers which routes to use, ultimately causing routes to only propagate from north to south.
In the recent AWS EC2 cloud outage, a southern router inside one Availability Zone briefly went into an incommunicative state and stopped exchanging route information with adjacent devices. Then the router began advertising an unusable route to other southern routers in other zones, bypassing how routes are allowed to flow. That bad route was picked up and used by routers in other zones.
"Internet traffic from multiple Availability Zones in US East was immediately not routable out to the Internet through the border. We resolved the problem by removing the router from service," Amazon wrote.
Amazon was able to reproduce the failure two days later and located a software bug in the router and confirmed that a protocol violation had occurred. There had been no human access of the router that failed or automated changes made to the device. Amazon quickly created a fix.
"We've developed a mitigation that can both prevent transmittal of a bad internet route and prevent another router from incorporating that route and using it. We've tested the mitigation thoroughly and are carefully deploying it throughout our network following our normal change and promotion procedures," Amazon wrote.
Amazon apologized for the outage and said it will continue to work to avoid future cloud outages.
"We apologize for any impact this event may have caused our customers," Amazon wrote. "We build with the tenet that even the most serious network failure should not impact more than one Availability Zone beyond the extremely short convergence times required by the network routing protocols. We will continue to work on eliminating any issues that could jeopardize that important tenet."
NEXT: Amazon Details Dublin Cloud Outage
That outage was the second to hit Amazon's cloud services in as many days. On August 7, Amazon's cloud, along with Microsoft's Business Productivity Online Suite (BPOS), suffered a massive cloud outage in Dublin, Ireland that took down its EC2 and Relational Database Service (RDS) for several days. Originally thought to be caused by a lightning strike, investigators found that the power failure that caused the cloud outage started when a transformer exploded and caught fire, which cut the power to Amazon's Dublin data center.
In an extremely lengthy postmortem for the Dublin cloud outage, which lasted several days for some customers, Amazon said that backup generators should have kicked on once the power went out, but that did not happen due to a failure in a programmable logic controller (PLC), which was supposed to assure that the electrical phase synchronized between generators before their power is brought online. Amazon was forced to manually bring the generators online and after several hours restored enough power to re-establish connectivity.
Power supplies, however, quickly drained and Amazon lost power to almost all EC2 instances and 58 percent of the Elastic Block Store (EBS) volumes in that Availability Zone. That was followed by launch delays and API errors. Meanwhile, AWS uncovered an EBS software error that exacerbated the issues.
Amazon said that in the future it will take preventative action against similar outages by adding redundancy and more isolation for PLCs to avoid failure. Amazon also plans to address the resource saturation that affected API calls and reduce the time needed to recover stuck EBS volumes.
The pair of Amazon cloud outages came roughly four months after Amazon's April cloud outage, which took down Amazon cloud services for many customers for several days. Amazon said that outage was caused by a network traffic shift that was "executed incorrectly" and instead of routing traffic to the other router on the primary network, traffic was shifted to the lower-capacity redundant EBS network. Amazon said the issue caused EBS volumes in the Northern Virginia Availability Zone to become "stuck" in a "re-mirroring storm." That made the volumes unavailable and created latency and outages.
A week later, Amazon apologized to customers for the cloud outage and offered them a cloud credit.
NEXT: Amazon Vows To Improve Communication During Outages
Amazon also vowed to improve its communication with customers during a massive service outage it suffered in the U.S. in April, an incident that cast a dark shadow over the cloud giant.
"Communication in situations like this is difficult," Amazon wrote. "Customers are understandably anxious about the timing for recovery and what they should do in the interim. We always prioritize getting affected customers back to health as soon as possible, and that was our top priority in this event, too. But, we know how important it is to communicate on the Service Health Dashboard and AWS Support mechanisms."
While Amazon said it communicated more frequently during the Dublin outage than during prior bouts of downtime, there is still room for improvement.
"First, we can accelerate the pace with which we staff up our support team to be even more responsive in the early hours of an event," Amazon wrote. "Second, we will do a better job of making it easier for customers (and AWS) to tell if their resources have been impacted. This will give customers (and AWS) important shared telemetry on what's happening to specific resources in the heat of the moment. We’ve been hard at work on developing tools to allow you to see via the APIs if your instances/volumes are impaired, and hope to have this to customers in the next few months. Third, as we were sending customers recovery snapshots, we could have been clearer and more instructive on how to run the recovery tools, and provided better detail on the recovery actions customers could have taken. We sometimes assume a certain familiarity with these tools that we should not."
For the Dublin cloud outage, Amazon said it will provide a 10-day credit equal to 100 percent of usage of EBS volumes, EC2 instances and RDS instances affected, and customers that were impacted by the EBS software bug will receive a 30-day credit along with access to Amazon's Premium Support Engineers via the AWS Support Center. The credits will be automatically applied to customers' next AWS bill, Amazon said. Microsoft also issued a credit to BPOS customers affected by the Dublin cloud outage.
"Last, but certainly not least, we want to apologize," Amazon wrote. "We know how critical our services are to our customers' businesses. We will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes."