A Deeper Dive Into Microsoft’s Latest Cloud Outage
A software “code issue” resulted in a five-hour Monday evening outage of Microsoft 365 and some Azure Cloud services.
Microsoft customers started reporting their inability to access their cloud-based apps on Downdetector.com at 5:21 pm Monday —within an hour, more than 18,000 posts documenting those problems had flooded the website that tracks cloud outages.
Microsoft said the issue impacted users from 5:25 pm EST to 10:25 pm EST.
Here’s what you need to know about what happened, what services were affected, and how it was resolved.
On its status page, Microsoft said customers encountered errors attempting to authenticate logins to Microsoft 365, Azure, Dynamics 365, and custom applications using its Azure Active Directory single sign-on service.
Only users not already signed-in saw those authentication request failures.
The preliminary analysis found “a combination of three separate and unrelated issues” behind the problem: a code defect in a service update; a tooling error in the Azure AD safe deployment system that impacted regional scoping; a code defect in Azure AD’s rollback mechanism, which delayed an attempt to revert the service update.
The outage mostly affected customers in the Americas because the problem was “exacerbated by load,” Microsoft said, though other regions may have also seen disruptions.
In an email update sent to administrators impacted by the outage, Microsoft said: “a code issue caused a portion of our infrastructure to experience delays processing authentication requests, which prevented users from being able to access multiple M365 services.”
Microsoft is “reviewing our code” to understand what caused applications to “stop processing authentication requests in a timely fashion.”
The cloud giant promised a post-incident report within five business days.
What Services Were Affected?
Users couldn’t access multiple Microsoft 365 services that needed their identities verified by Azure Active Directory including Outlook email, Microsoft Teams and Teams Live Events collaboration services and Office.com.
Power Platform and Dynamics365 properties were also impacted by the outage.
How Was Azure IaaS Impacted?
An Azure status update reported a “subset of customers in the Azure Public and Azure Government clouds may have encountered errors performing authentication operations for a number of Microsoft or Azure services, including access to the Azure Portals.”
The Azure issues lasted from 5:25 pm EST Monday to 8:23 pm EST Monday.
Microsoft attributed the Azure service outage to a “recent configuration change impacted a backend storage layer, which caused latency to authentication requests.” The configuration was rolled back to “mitigate the issue.”
Services that “still experience residual impact will receive separate portal communications,” Microsoft said, promising a full post-incident report within 72 hours.
What Was Fixed?
Microsoft said it’s monitoring systems automatically detected the problems within a minute of initial impact, and engineering teams immediately began troubleshooting.
“Impact was variable based on regional load patterns and we immediately scaled out the services to help process the increased volume as a result of authentication retries due to the issue,” Microsoft said.
After a successful rollback, most customers confirmed full recovery by 8:23 EST on Monday.
“Our engineers are engaged and monitoring the system to help ensure it continues to operate within normal parameters,” Microsoft reported on its status page.
A senior executive for one of Microsoft’s top partners, who did not want to be identified, said it appears that a Microsoft software developer made a code change that brought down Office 365 and Azure.
“It’s amazing to me that a change in code could cause a platform as big as Azure to go down,” the executive said. “It sounds like someone wrote some code that was merged into a production environment and it broke authentication. That’s ridiculous. If you can’t get into email or documents for five hours, it’s pretty bad.”
Microsoft is going to need to do a deep dive to determine how someone could deploy a software code change that causes a five-hour outage, that partner told CRN.
“Everyone expects outages, a hiccup here or there is understandable,” the partner said. “But this appears to be a faulty source control software policy issue. They would presumably be in a source control/DevOps environment that should have prevented this.”
The outage could have a ripple effect in the sales trenches, the executive said, noting that large companies with mission-critical applications often use such outages as an excuse to avoid public cloud.
“There are a lot of frozen middle accounts that hang onto an issue like this and it causes another three year evaluation cycle. In industries like oil and gas and financial services they hang on to something like this. It has a snowball effect.”
Microsoft’s cloud services experienced a few major outages about six months ago, just as the coronavirus crisis was driving more users to the cloud.
On March 3, a six-hour outage struck the U.S. East data center for Microsoft’s Azure cloud, limiting the availability of Azure cloud services for some North American customers. A few days later Microsoft disclosed that a cooling system failure was to blame.
Almost two weeks later, on March 16, Microsoft Teams suffered an outage lasting two hours in Europe as a surge of new users turned to the collaboration platform amid the onset of the coronavirus crisis, stressing its capacity.
And then a little more than a week later, Microsoft confirmed a series of March outages impacting European customers were caused by strains placed on several cloud services by the COVID-19 pandemic.
Developers were uniquely impacted in that incident, as the first casualty on March 24 was Azure Pipelines, a continuous delivery services used by DevOps teams.