The 10 Biggest Cloud Outages Of 2025 (So Far)
Outages with Google Cloud, Microsoft 365 and SentinelOne were among the biggest disruptors for channel partners so far this year.
Configuration changes. Software updates. Cyberattacks.
These are just some of the causes of the biggest cloud outages the channel has experienced so far this year, hitting tech giants including Microsoft and Google, plus channel-specific companies such as Conduent and Ingram Micro.
Some initial data on 2025 outages appears to show that IT workers have more on their plate this year compared to 2024. Global network outages experienced modest growth from April to May, increasing 2 percent to 1,843 incidents, according to a June report by Cisco’s ThousandEyes division.
[RELATED: The 10 Hottest Cloud Computing Startups Of 2025 (So Far)]
2025 Cloud Outages
In 2024, the April-to-May increase was more significant but with a lower volume of incidents–rising 20 percent to 822 outages. The prior year saw an even more dramatic seasonal surge of 27 percent to 1,304 incidents, according to ThousandEyes.
Still, the muted growth in 2025 led ThousandEyes to suggest “either improved global network stability or more effective seasonal planning.”
Read on for more information on some of the biggest cloud outages the channel has seen so far in 2025. And see some of CRN’s other looks at 2025 so far, including lists of the hottest collaboration tools, data storage startups and AI tools.
January Conduent Outage
Conduent–No. 29 on CRN’s 2025 Solution Provider 500–had a rough start to the year with a major service outage caused by a cyberattack.
Florham Park, N.J.-based Conduent—whose systems are used to enable government services such as child support payments and food assistance—saw an outage that affected some support payments and benefits in the U.S.
On a quarterly earnings call Conduent held in May, CFO Giles Goodburn told analysts the company incurred “$3 million and accrued $22 million of non-recurring expenses in the first quarter related to the event based on potential notification requirements,” but that amount was not a “material financial impact to our operations,” according to a call transcript.
CEO Cliff Skelton told analysts the company saw “virtually no operational impact” from the event and was “in some cases, only down for a couple of hours.” A regulatory filing Conduent made in April, however, said that in some cases, affected systems weren’t restored to normal operations for days.
The company was examining how many exfiltrated records, if any, contain protected data. “That's a very complex process that we're underway with now with our clients,” he said, according to the transcript. “And so, yes, the event is behind us. All the aspects of what happened is behind us. All the protections and the vulnerabilities, if there were any, were mitigated.”
February Asana Outages
Asana, an enterprise work management platform vendor that launched a new partner program earlier this year, saw twin outages Feb. 5 and Feb. 6.
First, at 21:05 UTC Feb. 5, “a configuration change caused a large increase in server logs, overloading logging infrastructure and causing server restarts,” according to the company’s report on the incident.
Other servers overloaded and failures cascaded. Asana reverted the configuration change and the application recovered by around 21:30 UTC. The company said it “emoved the source of logs which resulted in overload, and are improving monitoring and resilience of the system which triggered failure.”
At 15:37 UTC the next day, a configuration change caused servers to crash and complete downtime for 20 minutes, according to an Asana report. The company achieved full recovery at 16:30 UTC after rolling back the configuration. To prevent a repeat of the incident, the vendor updated “networking components to avoid this form of cascade failure” and moved “to staged configuration rollouts to identify this class of issue without causing downtime for customer traffic.”
ThousandEyes’ report on the incident said that any configuration adjustment, upgrade or change in the disparate and distributed components working together to deliver a service can have a devastating impact on the service’s overall functional performance.
IT operations (ITOps) teams need “a holistic perspective of the entire service delivery chain, including all components and dependencies” so they “can quickly ascertain the problem or fault domain and take mitigation steps, which may include rolling back configurations, among other actions.”
February Jitters By Atlassian’s Jira
Atlassian saw back-to-back inaccessibility issues in February for its Jira project management and issue-tracking software.
On Feb. 3, the Australia-based vendor started investigating slowness and unavailability at 10:15 UTC, according to an incident report. The vendor marked the incident resolved at 12:52 UTC.
Then on Feb. 5, Atlassian started investing intermittent errors in several Jira products at 10:44 UTC. The vendor identified the root cause and mitigated the problem within about 25 minutes, but the issue wasn’t fully resolved with service returning to normal until 21:44 UTC Feb. 6.
In a ThousandEyes report on the outage, the Cisco division said that “services appeared reachable, with no significant network issues observed, indicating that the problem likely resided in the backend.”
“While no direct correlation has been established and we can’t say for sure whether this issue was configuration-related, it is common for issues like these to come in waves,” according to ThousandEyes. “Following a mitigation or rollback, subsequent attempts to implement an internal change or patch can often lead to similar issues cropping up.”
Atlassian is part of CRN’s 2025 Partner Program Guide and has more than 700 partners worldwide, according to the vendor.
February Slack Outage
Salesforce collaboration application Slack saw a pair of issues on Feb. 26 and Feb. 27.
On the first day, from 6:45 a.m. Pacific to 4:13 p.m., “a large percentage of Slack users experienced issues with various features including sending and receiving messages, using workflows, loading channels or threads, and logging into Slack,” according to a Slack report on the incident. “These features may have been degraded or in some cases fully unusable.”
Slack blamed the issue on “a maintenance action in one of our database systems, which, combined with a latent defect in our caching system, caused an overload of heavy traffic to the database” — making half of the instances relying on the database unavailable.
At the peak of the outage, more than 3,000 users reported to DownDetector that they couldn't access the platform, according to CBS.
Even after the resolution of the database systems problem, Slack Events application programming interface users saw continued issues until 8:30 a.m. Pacific Feb. 27, according to a separate report by the company. Custom applications, integrations and bots stopped working as expected for some users.
Slack blamed the migration measures used for the database issue. “To help us address the initial incident, we paused the Events API job queue and rate-limited a non-critical API endpoint that was generating an outsized load. While the queue was paused, incoming Events API requests were placed in a backlog. Once we stabilized the database tier and restored critical Slack functionality, we re-enabled the Events API queue.”
Cisco’s ThousandEyes said in its report on the episode that users need “to monitor the performance of your entire service delivery chain, so that you can quickly pinpoint the specific fault domain when a problem pops up.”
“With this understanding, you can take the right steps to mitigate the user impact and resolve the issue,” according to the report. “These steps may include switching to a backup system or taking other mitigation actions on your end. And in some cases, you may discern that it’s best to simply wait for the issue to resolve itself.”
March Microsoft 365 Outage
At 1:34 p.m. Pacific March 1, Microsoft published an alert to X–formerly known as Twitter–saying that an issue caused an inability for users to access Outlook features and services.
Outage reports for Outlook peaked at about 35,000 in the U.S. on Downdetector, according to UPI. About 25,000 Microsoft 365 subscribers in the U.S. reported outages shortly after the Outlook reports started.
At 2:48 p.m. the same day, Microsoft posted to X that “a majority of impacted services are recovering following our change” and posted at 4:02 p.m. that “following our reversion of the problematic code change, we’ve monitored service telemetry and worked with previously impacted users to confirm that service is restored.”
April Zoom Outage
Zoom saw a two-hour outage on April 16, first reporting the issue at 11:25 a.m. Pacific.
Downdetector reports by affected users reached about 67,000 by noon Pacific, according to Reuters.
The communications platform vendor blamed “a server block by GoDaddy Registry” due to “a communication error between Zoom’s domain registrar, Markmonitor, and GoDaddy Registry, which resulted in GoDaddy Registry mistakenly shutting down the zoom.us domain,” according to the vendor’s report on the incident.
“Any start, join, or schedule meetings actions were unable to be completed successfully since the requests required a DNS lookup, which could not be completed,” according to the report.
To prevent the issue from happening again, GoDaddy and Markmonitor “put in place a registry lock that restricts server block commands from being placed on the zoom.us domain.”
ThousandEyes’ report on the outage noted that “authoritative nameservers for zoom.us, which are hosted by AWS Route53, were reachable, available, and seemingly configured correctly for the duration of the outage, and they returned records for zoom.us if queried directly.”
“However, because of the missing NS records at the TLD level, DNS clients were not pointed to the Route53 authoritative” nameservers, according to the report. “This highlights the importance of monitoring not only your own nameservers, but the public DNS infrastructure, as well.”
May SentinelOne Outage
In May, SentinelOne saw a global platform outage prevent access to its widely used consoles.
In the cybersecurity vendor’s report on the issue, SentinelOne said “a software flaw in an outgoing infrastructure control system triggered an automatic function that removed critical network routes” at 13:37 UTC May 29.
SentinelOne engineering confirmed at 20:05 UTC that day “that the manual restoration of all routes was completed and began validating customer console access,” with a subsequent post to the customer and partner portals saying that console access was restored.
All the data ingestion backlog was burned down by 10 UTC May 30, according to the report.
June IBM LogIn Issues
IBM has experienced several episodes this year where users had trouble logging in with the cloud platform, two of the episodes happening in June.
On June 2, the vendor started investigating the problem at 15:24 UTC and marked the incident resolved at 23:12 UTC, according to a report on the incident by the Armonk, N.Y.-based cloud, mainframe and AI vendor. Users couldn’t login to IBM Cloud though console, command line interface (CLI) or API.
Users also couldn’t manage or provision cloud resources, authenticate identities or access the support portal for opening or viewing support cases. Impacted services ranged from the Watsonx AI platform to Cloud Object Storage and the Netezza Performance Server.
The issue happened again on June 4. IBM started investigating at 11:13 UTC and marked the login issue resolved by 14:45 UTC.
IBM Cloud users previously saw login issues on May 20. IBM started investigating at 15:56 UTC and marked the incident resolved at 17:40 UTC.
June Google Cloud, Cloudflare Outages
On June 12, a Google Cloud outage took out a variety of popular websites and applications including Spotify and Discord.
Google’s report on the incident puts the start time at 10:51 a.m. Pacific and end time at 6:18 p.m. the same day. The issue dates back to a new feature added to Service Control, the core binary that is part of the check system that makes sure application programming interface (API) requests are authorized with appropriate policies to meet endpoints.
In larger regions, such as us-central-1, which includes Iowa, Service Control task restarts overloaded the infrastructure. The region took almost three hours to fully resolve, and Google throttled task creation to minimized infrastructure impact.
Moving forward, Google plans to modularize Service Control’s architecture to isolate the functionality and “fail open”–that is, default to an accessible state if a future failure happens, according to the vendor.
Cloudflare, meanwhile, said in its own report on the incident that its Workers KV key-value data store saw the failure of underlying storage infrastructure that is backed in part by Google Cloud.
Worker KV “is a critical dependency for many Cloudflare products and relied upon for configuration, authentication and asset delivery across the affected services,” according to the vendor.
The incident started June 12 at 17:52 Coordinated Universal Time (UTC). The impact ended the same day at 20:28. Moving forward, Cloudflare plans to prevent singular dependencies on third-party storage infrastructure to improve recovery, according to the vendor.
In a report on the incident, Cisco’s ThousandEyes division said that the lesson to users is “dependency chains are often longer than you think, and an issue can often manifest differently at various points in the chain, ranging from partial service failure to total service failure, depending on the impacted architecture.”
Even cloud-agnostic vendors have some dependencies on major cloud providers, meaning that users without direct relationships to Google and other hyperscalers can still feel the pain of an outage.
“Internet infrastructure, while robust in many ways, has evolved some single points of failure that can cascade far beyond their original scope,” according to the report. “IT teams must have deep visibility across their full service delivery chain to proactively identify potential issues and their source—and comprehensive backup plans in place to mitigate impacts on users when outages do happen.”
July Ingram Micro Outage
The aftermath of one of the most recent major outages of the year so far continues to unfold.
On Wednesday, IT distribution giant Ingram Micro said that it can once again process and ship orders received electronically across all of its business regions, ending a nearly weeklong outage.
The outage — which subsequently was acknowledged by Irvine, Calif.-based Ingram Micro as the result of a ransomware attack — reportedly began July 3. When the vendor identified ransomware on certain internal systems, it took systems offline as part of mitigation measures.
The company also launched an investigation with the assistance of leading cybersecurity experts and notified law enforcement.