Microsoft 365 Nine-Hour-Plus Outage: 5 Things To Know

The IT channel will be watching for the impact of compute-heavy AI workloads on the cloud in 2026.

This week, Microsoft became the latest vendor to experience a massive outage that affected multiple products and services. And although the technology giant didn’t tie the failure to anything related to artificial intelligence, the growing strain of more compute-intensive workloads by businesses will have the channel watching for increased volume and deeper impacts to their customers by these outages.

Illustrating the widespread effects of this latest outage by Redmond, Wash.-based Microsoft–which has one of the largest ecosystems in the channel at about 500,000 partners–Downdetector logged 12,380 reports of an outage in Microsoft’s Outlook email service as of 3:15 p.m. ET on Jan. 22.

The outage detection website also logged 15,745 reports of an outage in the Microsoft 365 suite of cloud applications as of 3:17 p.m. ET and 2,246 reports for the Microsoft Store as of 3:29 p.m. ET.

[RELATED: The 10 Biggest Cloud Outages Of 2025: AWS, Google And Microsoft]

Microsoft Outage

CRN has reached out to Microsoft for comment now that the issue has been resolved.

David Stinner, president of US itek, a Buffalo, N.Y.-based member of CRN’s 2025 MSP 500, credited his company’s use of the Invarosoft ITSupportPanel application for effective communication with customers during the outage, making it the “most stress-free outage we have ever experienced.”

Stinner said it is absolutely critical that MSPs proactively communicate with customers during an outage. “In the past our phones would have lit up and our after-hours calls to the OnCall techs would have been crazy just because people would be in the dark about the outage,” he said.

In May, the Uptime Institute’s annual Outage Analysis Report, the organization noted that “soaring demand for AI is straining existing infrastructure designs — especially around power and cooling — while electricity grid limitations and global trade tensions introduce new uncertainty in supply chains and expansion plans.”

With the generative AI era now in its fourth year and more projects leaving the experimental phase for production, IT professionals will have to stay vigilant on the impact to cloud reliability and stability.

Here’s more of what you need to know about the nine-plus-hour Microsoft 365 outage that happened this week.

Multiple Microsoft Products Hit

The outage affected a number of Microsoft 365 services including Outlook, Exchange online, and searching within SharePoint Online, Microsoft OneDrive and Microsoft Teams.

The outage also impacted accessing service portals for Microsoft Purview, Microsoft Defender XDR and the Microsoft 365 admin center.

During the outage, Outlook users received a “451 4.3.2 temporary server issue” error message when attempting to send or receive email. Users did not have the ability to send and receive email through Exchange Online, including notification emails from Microsoft Viva Engage, according to the vendor.

Other issues that cropped up include an inability to send and receive subscription email through Microsoft Fabric, collect message traces, search within SharePoint online and Microsoft OneDrive and create chats, meetings, teams, channels or add members in Microsoft Teams.

Teams users also had trouble receiving presence and location information. Fabric users saw an inability to apply and manage sensitivity labels, interactive operations on reports and artifacts with sensitivity labels, according to Microsoft.

Recovery Lagged Issue Resolution

As with past cloud outages with other vendors, even after Microsoft fixed the issues, recovery efforts by its users to return to a normal state took additional time.

The technology giant acknowledged the outage at 2:37 p.m. Eastern Thursday, posting to X that it is “investigating a potential issue impacting multiple Microsoft 365 services.” The vendor “identified a portion of service infrastructure in North America that is not processing traffic as expected” at 3:17 p.m.

Microsoft confirmed in a post on X at 4:14 p.m. ET that it “restored the affected infrastructure to a (healthy) state” but “further load balancing is required to mitigate impact.”

At 5:21 p.m. Microsoft said it was “rebalancing traffic across all affected infrastructure to ensure the environment enters into a balanced state … as quickly as possible.” This approach allowed the vendor to “identify any additional actions needed for recovery.”

The company reported “residual imbalances across the environment” at 7:02 p.m., “restored access to the affected services” and stable mail flow at 12:33 a.m. Jan. 23.

At that time, Microsoft still saw a “small number of remaining affected services” without full service stability. The company declared impact from the event “resolved” at 1:29 p.m. Eastern. Microsoft sent out another X post at 8:20 a.m. asking users experiencing residual issues to try “clearing local DNS caches or temporarily lowering DNS TTL values may help ensure a quicker remediation.”

Followed Other Issues In January

This was not Microsoft’s first outage of 2026, with the vendor handling access issues with Teams, Outlook and other M365 services on Wednesday, a Copilot issue on Jan. 15 plus an Azure outage earlier in the month.

Microsoft acknowledged the Wednesday M365 issues at 12:11 p.m. Eastern in a post on X. The company called access issues “resolved” at 1:29 p.m. and blamed “a third-party network issue,” noting that “that the Microsoft service environment remained healthy.”

On Jan. 15, the vendor experienced issues with its Microsoft Copilot artificial intelligence application in North America. Microsoft acknowledged the issue at 7:42 p.m. Pacific and called it “resolved” at 8:24 p.m. The vendor blamed the issue on a configuration change to the service and reverted the change to resolve impact.

And as for the January Azure incident, between 17:50 UTC on Jan. 10 and 1:23 UTC on Jan. 11, West U.S. 2 region–located in Washington state–users saw “intermittent connectivity issues, timeouts, increased error rates, or delays when performing operations on affected resources,” according to Microsoft’s preliminary post incident review (PIR).

The company blamed the disruption on “a power interruption affecting infrastructure within a single Availability Zone (AZ01) within the West US 2 region, which resulted in some infrastructure being temporarily unavailable.” Microsoft recovered compute and storage infrastructure by 19:51 UTC Jan. 10, but residual impact to newly created virtual machines and VMs updated during the impact timeframe in this AZ remained until 1:23 UTC Jan. 11.

The outage affected Azure Cache for Redis, Azure Cosmos DB, Azure Data Explorer, Azure Database for PostgreSQL, Azure Databricks, Azure Synapse Analytics, Azure Service Bus, Azure SQL Database, Azure Storage and other Azure services.

As a result of the incident, Microsoft said it will improve alerting for specific rack infrastructure impacted by such data center issues and improve standard operating procedures, troubleshooting guides and escalation workflows to reduce the time to mitigate residual impact.

The company also pledged to improve automated recovery of SLB infrastructure services impacted following localized infrastructure interruptions.

Microsoft also recommended that users leverage AZs to run services across physically separate locations within an Azure region for greater resiliency to data center level failures and consider evaluating the reliability of applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review.

‘Reduced Capacity During Maintenance’ Receives Blame

Microsoft said in an admin center update that the outage was “caused by elevated service load resulting from reduced capacity during maintenance for a subset of North America hosted infrastructure.”

Furthermore, Microsoft noted that during “ongoing efforts to rebalance traffic” it introduced a “targeted load balancing configuration change intended to expedite the recovery process, which incidentally introduced additional traffic imbalances associated with persistent impact for a portion of the affected infrastructure.”

US itek’s David Stinner said it appears that Microsoft did not have enough capacity on its backup system while doing maintenance on its main system.

“It looks like the backup system was overloaded, and it brought the system down while they were still doing maintenance on the main system,” he said. “That is why it took so many hours to get back up and running. If your primary system is down for maintenance and your backup system fails due to capacity issues, then it is going to take a while to get your primary system back up and running.”

Cloud and edge computing technology data transfer concept. A large cloud icon is in the center. abstract code Interconnected polygons and multicolored dots on a dark blue background.

Cloud’s Still King

While the frequency of cloud outages might lead some in IT to consider repatriating workloads to on-premises, it’s important to remember the continued benefits of cloud adoption, especially with cloud being a popular way to deliver AI products and services.

Even in the May Uptime Institute report that warned about AI’s effects on infrastructure, the organization noted that outages have been becoming less frequent and less severe relative to the rapid growth of digital infrastructure over the past several years, with industry progress in risk management and reliability.

The institute found that 53 percent of data center operators reported an outage in the past three years, a notable drop from the 60 percent of 2022, 69 percent of 2021 and 78 percent of 2020. The rate came close to the 55 percent logged in 2023.

For severity, 9 percent of reported incidents in 2024 were classified as serious or severe, the lowest level recorded by Uptime to date.

The organization noted that AWS Outposts, Azure Arc and other hybrid cloud platforms plus VMware Cloud Foundation and other management frameworks can help enterprises improve cloud integration with on-premises infrastructure.

Lydia Leong, a distinguished vice president and analyst with Gartner, said in a November post that repatriating cloud workloads on premises or moving from hyperscale providers to smaller sovereign clouds won’t eliminate outage risk.

Modern cloud-native apps should distribute workloads across multiple availability zones and be ready to fail over quickly to another region when needed, for example. Single clouds tend to deliver better uptime and simpler operations compared to juggling multiple providers with different processes.

Businesses that can’t have any downtime for certain functions might want to consider manual workarounds ready to go if a primary system fails, which can satisfy regulators while keeping IT budgets down. Leong even dismissed multicloud use unless demanded by regulators.

“Gartner research shows that pursuing multicloud resilience can cost more than it saves, introducing technical complexity without truly eliminating systemic risk,” she said.