Amazon Web Services (AWS) has detailed the causes behind a pair of cloud outages this month -- one in Dublin, Ireland, and one in the U.S. -- that knocked Amazon cloud services offline.
In the U.S., Amazon's Elastic Compute Cloud (EC2) went down around 10:25 p.m. Eastern August 8 in Amazon's U.S. East Region. The cloud outage lasted roughly 30 minutes, but took down the Web sites and services of many major Amazon cloud customers, including Netflix, Reddit and Foursquare.
According to Amazon's postmortem of the cloud outage, the downtime affected connectivity between three of its different Availability Zones and the Internet.
"The issue happened in the networks that connect our Availability Zones to the Internet," Amazon wrote on its AWS Service Health Dashboard. According to Amazon, all Availability Zones must have network connectivity to the Internet and to each other to let customer resources in one zone communicate with resources in other zones. Meanwhile, safeguards are in place to prevent network issues in one Availability Zone from impacting other zones and AWS uses a routing architecture it calls "north/south" where northern routers are at the border, facing the Internet, and southern routers are part of the individual zones. Southern routers are prevented from advertising Internet routes to other southern routers in another zone and are also prohibited from telling northern routers which routes to use, ultimately causing routes to only propagate from north to south.
In the recent AWS EC2 cloud outage, a southern router inside one Availability Zone briefly went into an incommunicative state and stopped exchanging route information with adjacent devices. Then the router began advertising an unusable route to other southern routers in other zones, bypassing how routes are allowed to flow. That bad route was picked up and used by routers in other zones.
"Internet traffic from multiple Availability Zones in US East was immediately not routable out to the Internet through the border. We resolved the problem by removing the router from service," Amazon wrote.
Amazon was able to reproduce the failure two days later and located a software bug in the router and confirmed that a protocol violation had occurred. There had been no human access of the router that failed or automated changes made to the device. Amazon quickly created a fix.
"We've developed a mitigation that can both prevent transmittal of a bad internet route and prevent another router from incorporating that route and using it. We've tested the mitigation thoroughly and are carefully deploying it throughout our network following our normal change and promotion procedures," Amazon wrote.
Amazon apologized for the outage and said it will continue to work to avoid future cloud outages.
"We apologize for any impact this event may have caused our customers," Amazon wrote. "We build with the tenet that even the most serious network failure should not impact more than one Availability Zone beyond the extremely short convergence times required by the network routing protocols. We will continue to work on eliminating any issues that could jeopardize that important tenet."
NEXT: Amazon Details Dublin Cloud Outage