Microsoft 365 Nine-Hour-Plus Outage Is Resolved; Software Giant Blames ‘Elevated Service Load’ During Maintenance
The software giant says the outage was ‘caused by elevated service load resulting from reduced capacity during maintenance for a subset of North America hosted infrastructure.’
Microsoft said in a message to administrators that a Microsoft 365 outage that lasted nine hours and 22 minutes had been resolved effective 12 a.m. ET on Jan. 23.
“We’ve confirmed that the affected infrastructure has returned to a healthy state and is operating as expected. We’ll continue to closely monitor the remediation actions taken and make any necessary adjustments to maintain stability,” said Microsoft in a further status update at 1:27 a.m. ET.
Downdetector logged 12,380 reports of an outage in Microsoft’s Outlook email service as of 3:15 p.m. ET on Jan. 22; 15,745 reports of an outage in the Microsoft 365 suite of cloud applications as of 3:17 p.m. ET; and 2,246 reports for the Microsoft Store as of 3:29 p.m. ET.
The outage affected a number of Microsoft 365 services including Outlook, Exchange online, and searching within SharePoint Online, Microsoft OneDrive and Microsoft Teams.
The outage also impacted accessing service portals for Microsoft Purview, Microsoft Defender XDR and the Microsoft 365 admin center.
CRN reached out to Microsoft for additional comment on the outage but had not heard back at press time.
Microsoft said in an admin center update that the outage was “caused by elevated service load resulting from reduced capacity during maintenance for a subset of North America hosted infrastructure.”
Furthermore, Microsoft noted that during “ongoing efforts to rebalance traffic” it introduced a “targeted load balancing configuration change intended to expedite the recovery process, which incidentally introduced additional traffic imbalances associated with persistent impact for a portion of the affected infrastructure.”
David Stinner, president of US itek, a Buffalo, N.Y.-based MSP, said it appears that Microsoft did not have enough capacity on its backup system while doing maintenance on its main system.
“It looks like the backup system was overloaded, and it brought the system down while they were still doing maintenance on the main system,” he said. “That is why it took so many hours to get back up and running. If your primary system is down for maintenance and your backup system fails due to capacity issues, then it is going to take a while to get your primary system back up and running.”
Stinner said the Microsoft 365 outage, which impacted his company and his customers, was the “most stress-free outage we have ever experienced” because his company communicated with customers through its ITSupportPanel application.
Stinner said it is absolutely critical that MSPs proactively communicate with customers during an outage. “In the past our phones would have lit up and our after-hours calls to the OnCall techs would have been crazy just because people would be in the dark about the outage,” he said.
“More MSPs need to get a Customer Experience [CX] platform like we have with Invarosoft’s ITSupportPanel to be able to communicate and focus on the human side of professional services, not just the tech side,” he said.