Microsoft: Corrupted File Took Out Cloud Services

The cloud outage that took down several Microsoft Windows Live cloud services including Hotmail and SkyDrive, and also impacted Microsoft Office 365, was caused by a single corrupted file in Microsoft's DNS service, the software giant said this week.

In a blog post explaining the Sept. 8 cloud outage, which started just before 8 p.m. Pacific and lasted more than three hours, Microsoft Vice President Windows Live Test and Service Engineering Arthur de Haan wrote that no data was lost or compromised during the cloud outage and Microsoft has taken steps to improve its services.

"So, what happened?" de Haan wrote. "A tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption."

De Haan continued: "At 10:23 p.m. PDT we began to see service restoration. We confirmed that the incident was resolved by 11:35 p.m. PDT, although it took some time for the changes to replicate around the world and reach all our customers."

Sponsored post

According to de Haan, Microsoft found the cause to be a corrupted file in Microsoft's DNS service. Microsoft explained that the two rare conditions occurred at the same time, creating the file corruption.

"The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file)," de Haan wrote. "The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client. Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service."

When cloud service was restored, Microsoft identified two streams of work to drive specific service improvements around monitoring, problem identification and recovery, de Haan wrote. Additionally, Microsoft will focus on further hardening its DNS service to improve redundancy and fail-over capabilities. Microsoft said it will also develop an additional recovery processes that will allow a specific property the ability to fail over to restore service and then fail back when DNS service is restored. Microsoft will also review its recovery tools to see if other improvements are necessary to reduce the length of outages and the time it takes to resolve them.

De Haan added that Microsoft regrets the inconvenience caused by the outage.

The late night into early morning cloud outage marked the second major bout of downtime for Microsoft Office 365, which suffered its first outage last month, less than two months after Microsoft officially launched Office 365. Microsoft Office 365, Microsoft's suite of cloud applications, was launched as a new and more reliable alternative to Microsoft Business Productivity Online Suite (BPOS). In the months leading up to Office 365's release, BPOS frequently battled cloud outage demons.

The most recent outage also wasn't the first time Microsoft Windows Live services, like Hotmail, were taken out. In December 2010 and into January 2011, a load balancing issue deleted the inboxes and other messages of more than 17,000 Microsoft Windows Live Hotmail cloud e-mail users, a massive hiccup that persisted for more than four days for some users.

Along with Microsoft, other major cloud vendors have been plagued by cloud outages. Amazon Web Services has suffered two colossal outages this year, meanwhile Microsoft rival Google has also battled cloud downtime; most recently its Google Docs service was conked out by a memory management bug.

Despite the seemingly frequent number of outages, several cloud and industry experts have said that fear of cloud outages and downtime is no reason to avoid cloud services.