Microsoft’s Eight-Hour Azure Outage: 5 Things We’ve Learned So Far
‘Anytime these happen … everyone realizes what services run on what platforms,’ Zac Paulson, vice president of technology for ABM Technology Group, said.
Microsoft has released more details on the eight-plus-hour Azure outage it experienced this week that affected the performance of a variety of the tech giant’s products and services–not to mention at least one airline, an airport and a telecommunications giant.
The Redmond, Wash.-based vendor on Wednesday published a preliminary post incident review (PIR) of the incident and steps it is taking to prevent the issue from happening again.
The outage appeared less disruptive than an Amazon Web Services issue the prior week that caused potentially hundreds of millions of dollars in losses.
[RELATED: Amazon’s Outage Root Cause, $581M Loss Potential And ‘Apology:’ 5 AWS Outage Takeaways]
Microsoft Azure Outage
Zac Paulson, vice president of technology for Fargo, N.D.-based ABM Technology Group–a member of CRN’s 2025 MSP 500–told CRN in an interview that the outage affected Azure and Microsoft 365 products and services the Microsoft solution provider leverages.
The outage rendered inaccessible some vendor portals ABM uses, with Paulson figuring the portals were hosted in Azure.
But with software vendors embracing multi-cloud environments to get the advantages of each cloud vendor and to even have a failover should one cloud go down, AWS’ outage was more eye-opening to Paulson with some of the products and services he and his clients use still going down even though ABM specializes in Microsoft products and its customers are mostly Microsoft users.
“We just waited it out,” Paulson said. “Anytime these happen … everyone realizes what services run on what platforms.
Here’s more on what went down Wednesday with the Microsoft Azure outage.
Eight-Plus-Hour Outage
The Azure outage started at 15:45 UTC Wednesday and ended at 0:05 UTC Thursday, according to Microsoft’s preliminary report.
The Azure Front Door (AFD) cloud content delivery network (CDN) and security service was the focus of the issues.
AFD’s issues resulted in latencies, timeouts and errors for a variety of Microsoft products and services including Azure Active Directory B2C, Azure Databricks, Azure Healthcare APIs, Azure Portal, Azure SQL Database, Azure Virtual Desktop (AVD), Container Registry, Microsoft Copilot for Security, parts of Microsoft Entra ID, Microsoft Defender External Attack Surface Management, Microsoft Purview, Microsoft Sentinel Threat Intelligence and Video Indexer.
Outside of Microsoft’s own products, other companies that said they experienced issues at the time of the outage include Kroger and Alaska Airlines. The Independent reported the Scottish Parliament suspended voting during the outage.
Inadvertent Tenant Configuration Blamed
Microsoft put the blame for the outage on “an inadvertent tenant configuration change within” AFD.
The change triggered disruptions to Microsoft services and customer applications dependent on AFD for global content delivery, according to the vendor. Microsoft started investigating issues at 16:15 UTC Wednesday.
AFD nodes failed to load properly with the invalid or inconsistent configuration state, hitting downstream services. The impact became amplified by imbalanced traffic distribution across “healthy nodes” as the “unhealthy nodes” dropped out of the global pool. Even partially healthy regions saw intermittent availability, according to Microsoft. The vendor failed the Azure portal away from AFD at 17:26 UTC.
Microsoft blocked further configuration changes to stop dissemination of the faulty state. It deployed the “last known good” configuration across its global fleet in a phased recovery to stabilize the system, restore scale and prevent the issue from recurring. The last known good configuration deployment started at 17:40 UTC and started manual node recovery at 18:45 UTC.
Next Steps, Prior AFD Issue
Microsoft called the outage ended at 00:005 UTC Thursday with confirmation that customers saw successful mitigation of the AFD issue.
Microsoft has reviewed its safeguards and implemented additional validation and rollback controls to prevent a similar issue from happening in the future, according to the preliminary report.
A software defect allowed the faulty tenant configuration deployment to get around safety validations already in place. The vendor expects a more detailed report in less than 14 days.
Interestingly, Microsoft saw a prior AFD issue on Oct. 9 that led to latency and timeouts across Africa, Europe, Asia Pacific and the Middle East. The issue hit customers starting at 7:50 UTC and ended at 16:00 UTC that same day.
The culprit in this case was cleanup of tenants with erroneous metadata generated by a particular sequence of profile update operations, a previously unknown bug to Microsoft. The cleanup started 20 minutes ahead of customer impact, and Microsoft’s bypassing of its protection system inadvertently allowed the erroneous metadata to show up in later stages, crashing data plane service and disrupting edge sites in Europe and Africa.
The error caused interruption for 26 percent of AFD data plane infrastructure resources in these regions, according to Microsoft.
Although it was not immediately clear if any of the changes Microsoft made after the Oct. 9 incident had any bearing on the latest AFD outage, after the Oct. 9 incident Microsoft said it had hardened its standard operating procedures to ensure that the configuration protection system is not bypassed for any operation.
The vendor said it should finish improvements to Azure Portal failover systems from AFD by December.
Solution Providers React
ABM’s Paulson was far from the only solution provider hit by the outage.
Wayne Roye, CEO of New York-based Microsoft solution provider Troinet, told CRN in an interview that he saw Microsoft tools he uses internally and impacted developer work, with Troinet’s entire dev environment running on Azure.
The outage goes to show that even the best-regarded systems in the market don’t necessarily have a 100 percent uptime guarantee, Roye said. He often seeks to educate clients on business continuity plans and third-party infrastructure in case of mass outages and other incidents.
“We have seatbelts, but it is not a guarantee to save you in an accident.” Roye said. “A lot of companies will reevaluate their infrastructure design for redundancy.”
John Snyder, CEO of Durham, N.C.-based Microsoft solution provider Net Friends, told CRN in an interview that he saw single sign-on (SSO) disrupted, blocking sales and marketing teams from logging into HubSpot.
“It impacted just about everything,” Snyder said. Although the outage wasn’t “crippling,” it still “made for a weird day of several team members discovering that a tool they thought wouldn’t be impacted was actually impacted due to our dependency on Microsoft for authentication.”
Cloud Still Dominant
The reality of outages every now and then with the major cloud vendors didn’t dampen their quarterly financial results reported this week, nor did any analysts bring up the outages on the earnings calls.
Microsoft continues to have Azure application programming interface (API) exclusivity with ChatGPT maker OpenAI after the two organizations published some details of their latest agreement. OpenAI has even contracted an incremental $250 billion of Azure services.
Microsoft’s intelligent cloud (IC) segment saw $30.9 billion in revenue for the quarter, up 27 percent ignoring foreign exchange. Azure and other cloud services saw a 39 percent increase in revenue year on year.
Amazon’s AWS sales increased 20 percent year on year to $33 billion in its latest quarter. The business now has a $132 billion annualized revenue run rate. That growth rate was also the highest in 11 quarters, marking growth not seen since 2022 and an acceleration of 270 basis points over last quarter.