Microsoft Sorry For E-Mail-Killing BPOS Cloud Outages

Microsoft apologized for a string of cloud outages last week that affected its Business Productivity Online Services (BPOS) cloud computing suite, which caused massive delays with BPOS users' e-mails.

"I'd like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused," Microsoft Corporate Vice President of Microsoft Online Services David Thompson said in a blog post detailing the BPOS cloud outages. "We know that e-mail is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider."

Trouble started around 12:30 p.m. Eastern on Tuesday, May 10, when the BPOS-S Exchange service experienced an issue with one of the hub components due to malformed e-mail traffic on the service, Thompson said in the blog post detailing the BPOS cloud outages. Thompson added that Exchange features a built-in capability to handle malformed traffic but "encountered an obscure case" where that also didn't work correctly, creating a backlog of e-mail. By 3 p.m. Eastern, the malformed traffic was isolated and the mail queues were cleared, but not before customers suffered delays of between six and nine hours. Microsoft created a short-term fix for the issue and went to work on a remedy.

Then, at 12:10 p.m. Thursday, May 13, malformed e-mail traffic was again detected. Microsoft fixed it by 1:03 p.m., but not before users suffered an e-mail delay of up to 45 minutes. And at 2:35 p.m. Eastern Thursday, a second related issue was detected that caused e-mail to become stuck in some users' outboxes. In that issue, more than 1.5 million e-mail messages queued and were awaiting delivery, Thompson wrote. Microsoft fixed that issue by 3:04 p.m. Eastern and the backlog of e-mail messages was 90 percent clear by 7:12 p.m. Eastern, though some users experienced delays of as much as three hours.

Sponsored post

To make matters worse, Microsoft BPOS experienced a failure in Domain Name Service (DNS) hosting around 3 a.m. Eastern on Thursday that prevented users from accessing Outlook Web Access hosted in the Americas and partially impacted some functionality of Microsoft Outlook and Microsoft Exchange ActiveSync devices. Microsoft diagnosed and fixed the problem in the DNS servers and restored service by 7:52 a.m. Eastern.

Microsoft's BPOS cloud collapse follows a recent string of high-profile cloud outages.

Last month, Amazon Web Services suffered a massive outage that took many of its customers offline for several days. Amazon later said the outage, which affected its North Virginia data center, was caused by a "re-mirroring storm" in its Elastic Block Store (EBS) service. During the outage, Amazon was criticized for its lack of communication regarding the cloud outage. More than a week after the outage first occurred, Amazon issued an apology to cloud customers and offered many a 10-day credit for the trouble.

NEXT: Recent Cloud Outage Not The First For Microsoft BPOS

Amazon's cloud outage was followed by a pair of outages suffered by VMware's new open-source PaaS play, Cloud Foundry, which late last month suffered two separate incidents of prolonged downtime.

That same week, Yahoo Mail also suffered an outage that lasted several hours.

Last week's Microsoft BPOS cloud outage was also not the first for the Microsoft cloud service.

In January 2010, Microsoft Online Services users in North America were met with intermittent access to services, including BPOS. According to Microsoft, some users served by a North American data center were affected. In that instance, monitoring alerted Microsoft to a possible issue and troubleshooting found there was a problem with network infrastructure, resulting in intermittent access for customers. In response, Microsoft found the root cause and took the steps necessary to fix the issue. Microsoft also reached out to affected business customers and offered them a credit if they were impacted.

Microsoft BPOS suffered another string of cloud outages in August 2010 and September 2010, which prompted the software giant to launch an Online Service Health Dashboard, an online tool where customers can obtain up-to-date information on the status of BPOS apps as well as a 35-day status history of service performance and availability.

In the blog post last week, Microsoft's Thompson was apologetic for the Microsoft BPOS cloud outage and said Microsoft should have clued in customers and partners immediately after the first BPOS cloud outage was noticed.

"As a result of Tuesday's incident, we feel we could have communicated earlier and been more specific," Thompson wrote. "Effective today, we updated our communications procedures to be more extensive and timely. We understand that it is critical for our customers to be as fully informed as possible during service impacting events. We will continue to improve the timeliness and specificity of our communications."

The BPOS cloud issues only affected customers served from Microsoft's Americas data center, and Thompson said the incidents were specific to BPOS and unrelated to Microsoft Office 365 or any other Microsoft cloud services. Thompson said Microsoft will provide a full post mortem in coming days and will also offer updates on how its SLAs were impacted. Microsoft said it will issue a services credit to impacted customers.

"As I've said before, all of us in the BPOS team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business -- that's not acceptable," Thompson said. "I want to assure you that we are investing the time and resources required to ensure we are living up to your -- and our own -- expectations for a quality service experience every day."