Azure Cloud Outage: Microsoft Offers Customer Credits


Microsoft, in response to a Leap Year bug that knocked out its Azure cloud services last month, said it is issuing 33 percent credits to Azure customers and overhauling its cloud disaster recovery, testing and customer services.

Microsoft said all customers using Windows Azure Compute, Access Control, Service Bus and Caching will receive the discounts for all of February, whether or not they were affected by the service.

Bill Laing, Microsoft's corporate vice president for the Server and Cloud Division, wrote in a blog that a software bug related to incorrect date/time values associated with Leap Year triggered automatic responses that shut down the regular exchange of "transfer certificates" that encrypt applications moving among virtual servers in the data center.

"We know that many of our customers were impacted by this event and we want to be transparent about what happened, what issues we found, how we plan to address these issues, and how we are learning from the incident to prevent a similar occurrence in the future," Laing wrote.

Laing said the bug triggered at 4 p.m. PST, Feb. 28, starting a series of cascading server system failures in Microsoft’s data centers. Developers identified the bug at 6:38 p.m. PST. They disabled Azure service management functionality in all clusters worldwide at 6:55 p.m. PST to stop escalating failures.

Fixes were rolled out, and at 5:23 a.m. PST, Feb. 29, Microsoft said that service management had been restored to the majority of systems. However, the last services were not brought back online and Microsoft Azure was not pronounced completely healthy until 2:15 a.m. PST, March 1. Azure Storage and SQL Azure were not affected by the outage.

Going forward, the company will redesign the Windows Azure Dashboard, the interface that communicates the functionality of the cloud service to customers. More resources will be committed to make the dashboard more robust to prevent it from being overwhelmed by requests, as it was during the outage. Summaries also will be published more frequently and include more details.

In addition, bug testing will be improved and more detailed controls will be developed for automatic shutdown of the service and to record failures more quickly.

Laing wrote that Microsoft is also re-evaluating customer support and staffing and taking steps to provide more information through more channels. The company also is considering expanding social media communications through Facebook and Twitter.

Azure customers reacted positively to Microsoft's upcoming changes.

"I am glad for the candidness in this disclosure and hope to see more of this from all of the cloud vendors for the future," wrote Magnus Mårtensson noopman on the Azure comment page.

"Your transparency is refreshing and certainly appreciated," Marc Dersen wrote. "I believe we can all learn from this incident, especially since both disruptions were the direct result of human errors."

David Geevaratne, president of New Signature, a Washington, D.C.-based Microsoft partner and provider of cloud and Office 365 services, applauded Microsoft in an interview for revising its Azure best practices, especially for improving testing.

"Partners and customers think testing and quality assurance are extremely important parts of any system, and I'm glad Microsoft is devoting more attention to them," he said.