Amazon Breaks Cloud Outage Silence With Apology, Credit9:20 AM EST Fri. Apr. 29, 2011
Amazon Web Services broke more than a week of silence Friday in an apology to users who were rocked by last week's Amazon cloud outage; and in an attempt to stop the bleeding, Amazon is giving users of the affected Availability Zone a 10-day cloud services credit.
In Amazon's lengthy mea culpa about the cloud outage, the company said it is sorry for the outage that started early Thursday April 21 and knocked a host of customers' Web sites offline or caused sluggish performance. In some cases, the Amazon cloud outage lasted several days.
"We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services," Amazon said. "As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes."
In its event summary, Amazon said the outage was sparked by an error made during a network configuration change. That error led to disruptions and service outages for its Elastic Compute Cloud (EC2) and Relational Database Service (RDS) customers leveraging Amazon's North Virginia data center. A network traffic shift, Amazon said, was "executed incorrectly" and instead of routing traffic to the other router on the primary network, traffic was shifted to the lower-capacity redundant Elastic Block Store (EBS) network. Amazon said the issue caused EBS volumes in the North Virginia Availability Zone to become "stuck" in a "re-mirroring storm." That made the volumes unavailable and created latency and outages.
Along with highlighting what caused the outage, Amazon is giving cloud customers that leverage EBS or run RDS database instances in the Availability Zone that was affected a 10-day credit equal to 100 percent of their usage volumes and instances, whether or not they were affected by the cloud outage.
"For customers with an attached EBS volume or a running RDS database instance in the affected Availability Zone in the US East Region at the time of the disruption, regardless of whether their resources and application were impacted or not, we are going to provide a 10 day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. These customers will not have to do anything in order to receive this credit, as it will be automatically applied to their next AWS bill. Customers can see whether they qualify for the service credit by logging into their AWS Account Activity page," Amazon said.
NEXT: Preventing Another Amazon Cloud Outage; Open Lines Of Communication
To prevent similar issues and cloud outages from occurring in the future, Amazon Web Services said it "will audit our change process and increase the automation." Additionally, Amazon said it has now put three separate protections in place to avoid a repeat: It has increase its capacity buffer; it will modify its retry logic in the EBS server nodes to prevent clusters from getting into a re-mirroring storm; and it is testing a fix that will to avoid EBS node failure.
Amazon said it will also invest in increasing its visibility, control and automation to recover volumes in an EBS cluster, which would have saved significant time in the recovery process and would have enabled customers to more easily recover their applications in other Availability Zones in the Region.
Amazon said it also intends to make it easier for users to leverage different and multiple Availability Zones to avoid future issues. Many of the customers affected most in the Amazon cloud outage only leveraged the North Virginia Availability Zone and did not have failover into another zone. Amazon is putting measure in place to make it easier to create multiple Availability Zones and will host a number of free Webinars to offer customers and partners tips and best practices for architecting in the cloud.
Amazon also addressed its lack of communication during the cloud outage, a source of contention for users.
"In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications," Amazon said. "We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what's going on, how long it will take to fix, and what we are doing so that it doesn't happen again."
Amazon said most of the AWS team, including its entire senior leadership team, was directly involved in resolving the outage. Amazon said it felt "focusing our efforts on a solution and not the problem" was the best way to go. Amazon said it updated customers when it had information that was accurate.
"That said, we think we can improve in this area," Amazon said. "We switched to more regular updates part of the way through this event and plan to continue with similar frequency of updates in the future. In addition, we are already working on how we can staff our developer support team more expansively in an event such as this, and organize to provide early and meaningful information, while still avoiding speculation."
Amazon continued: "We also can do a better job of making it easier for customers to tell if their resources have been impacted, and we are developing tools to allow you to see via the APIs if your instances are impaired."