Page 2 of 2
To prevent similar issues and cloud outages from occurring in the future, Amazon Web Services said it "will audit our change process and increase the automation." Additionally, Amazon said it has now put three separate protections in place to avoid a repeat: It has increase its capacity buffer; it will modify its retry logic in the EBS server nodes to prevent clusters from getting into a re-mirroring storm; and it is testing a fix that will to avoid EBS node failure.
Amazon said it will also invest in increasing its visibility, control and automation to recover volumes in an EBS cluster, which would have saved significant time in the recovery process and would have enabled customers to more easily recover their applications in other Availability Zones in the Region.
Amazon said it also intends to make it easier for users to leverage different and multiple Availability Zones to avoid future issues. Many of the customers affected most in the Amazon cloud outage only leveraged the North Virginia Availability Zone and did not have failover into another zone. Amazon is putting measure in place to make it easier to create multiple Availability Zones and will host a number of free Webinars to offer customers and partners tips and best practices for architecting in the cloud.
Amazon also addressed its lack of communication during the cloud outage, a source of contention for users.
"In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications," Amazon said. "We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what's going on, how long it will take to fix, and what we are doing so that it doesn't happen again."
Amazon said most of the AWS team, including its entire senior leadership team, was directly involved in resolving the outage. Amazon said it felt "focusing our efforts on a solution and not the problem" was the best way to go. Amazon said it updated customers when it had information that was accurate.
"That said, we think we can improve in this area," Amazon said. "We switched to more regular updates part of the way through this event and plan to continue with similar frequency of updates in the future. In addition, we are already working on how we can staff our developer support team more expansively in an event such as this, and organize to provide early and meaningful information, while still avoiding speculation."
Amazon continued: "We also can do a better job of making it easier for customers to tell if their resources have been impacted, and we are developing tools to allow you to see via the APIs if your instances are impaired."