Unexpected Downtime In An Unexpected year
The coronavirus crisis has been an extraordinary test for the cloud—both in its ability to deliver the service capabilities needed to keep business operations humming through quarantines and work-from-home mandates, as well as its ability to maintain availability amid massive surges in load.
The cloud has mostly met the moment, and service providers across the stack have largely been rewarded with adoption by new customers and the revenue and stock valuation increases they bring.
But as experts had expected, there have been hiccups and failures along the way.
While 2020 saw no truly catastrophic outages of the kind lasting more than a few hours, businesses have at times been frustrated and stymied by unexpected downtime.
Here is CRN‘s list of the 10 most significant cloud outages in a year of unprecedented operational distractions—one that we will likely look back on as an inflection point for the technology.
Get more of CRN‘s 2020 tech year in review.
Microsoft Azure, March 3
A six-hour outage, starting at 9:30 a.m. ET, struck the U.S. East data center for Microsoft’s Azure cloud, limiting the availability of Azure cloud services for some North American customers.
A few days later Microsoft disclosed that a cooling system failure was to blame. Malfunctioning building automation controls caused a reduction in airflow, and the subsequent temperature spikes throughout the data center hampered performance of network devices, rendering compute and storage instances inaccessible
Microsoft ultimately reset the cooling system controllers, and once the temperature fell, engineers power-cycled hardware to resume services.
Microsoft Azure, March 24-26
Microsoft confirmed a series of March outages impacting European customers were caused by strains placed on several cloud services by the COVID-19 pandemic.
Developers were uniquely impacted, as the first casualty on March 24 was Azure Pipelines, a continuous delivery service used by DevOps teams. For the next few days, software development pipelines experienced significant delays.
“This incident was caused by VM capacity constraints arising from the global health pandemic that led to increased machine reimage times and then increased wait times for available agents,” Microsoft later explained.
By the end of the week, Microsoft accepted blame for not promptly addressing the failure.
“On the first day, when the impact was most severe, we didn‘t acknowledge the incident for approximately five hours, which is substantially worse than our target of 10 minutes,” Engineering Director Chad Kimes said.
Google Cloud Platform, March 26
Google users started reporting problems accessing several cloud services just after 11 a.m. on March 26.
Many tweeted they encountered Google‘s 500 and 502 error codes—the 500 code relates to requests that fail due to an internal error; the 502 code denotes a bad gateway error.
Google ultimately described the outage as having to do with its “infrastructure components.”
Google customers on the Eastern seaboard seemed most impacted, according to Downdetector, which offers real-time status and outage information for service providers.
GitHub, April 21
GitHub, the source code repository owned by Microsoft, saw multiple outages near the end of April.
GitHub services first struggled for more than an hour on April 21. The next day, there were two back-to-back outages again stalling the work of developers who rely on the platform, and then another affecting multiple GitHub services for more than an hour the following day as well.
Git Operations, API requests, pull requests and other functionality that software engineers rely on as part of their day-to-day work were degraded. Developers went to Twitter to criticize Microsoft for a lack of transparency as the rolling outages continued through the week.
IBM Cloud, June 9
IBM blamed a third-party networking failure for a serious cloud outage that brought many Big Blue customers, including some popular websites, to a sudden halt.
The CEO of one IBM Business Partner told CRN customers across the U.S. lost access to their environments, their status screens and consoles, and they had “no sense of what was happening.”
“It affected everything,” he said. ”The whole environment was down.”
The IBM Cloud status page, which also was briefly down during the Tuesday disruption, reported a slew of issues that were resolved after 6:30 p.m. ET.
“The network operations team adjusted routing policies to fix an issue introduced by a 3rd party provider and this resolved the incident,” the IBM status page explained.
Cloudflare, July 17
A Cloudflare outage apparently caused by a malfunctioning router on the CDN provider‘s global backbone network brought down a slew of web services across many parts of the world.
Cloudflare quickly rerouted operations from a dozen data centers to get affected customers back online as complaints flooded Twitter. The outage appears to have lasted less than a half-hour.
Cloudflare later posted on its status page: “This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and are monitoring systems for stability now.”
Salesforce, Aug. 11
After hiccups in mitigating a virtual server problem, Salesforce resolved a disruption that took some customers offline for almost four hours.
The service disruption started at 11:54 a.m. Pacific Time, affecting users hosted on Salesforce’s NA89 instance, according to Salesforce’s status page and notifications sent to customers. The instance—one of nearly 200 for North America—runs in data centers in Phoenix and Washington, D.C., with transactions replicated across those locations for redundancy.
By 1:10 p.m., the CRM giant reported a successful site switch to reroute traffic after its team identified the likely problem as a power outage affecting network switches routing traffic.
But problems continued, as the “the NA89 instance then went into a brief period of performance degradation” for the next two minutes, the status page reported. That degradation also impacted Salesforce Live Agent, a tool on the platform for real-time communications with website users.
At 3:23 p.m., Salesforce reported: “We have identified a potential cause of the Live Agent issue and are urgently working on implementing a fix to resolve the impact to customers.” All was well two minutes later, ending a three-hour, 43-minute outage.
Zoom, Aug. 24
Zoom experienced a partial outage on the morning of Aug. 24 that prevented users from accessing their meetings and video webinars.
The San Jose, Calif.-based company, whose cloud-based online videoconferencing platform has become a linchpin during the new work-from-home era forced by the coronavirus pandemic, acknowledged at 8:51 a.m. EST that it was receiving reports of users being unable to visit the Zoom website (Zoom.us) and unable to start and join Zoom Meetings and Webinars.
An hour later, Zoom said it had identified the issue and was working to resolve it.
The company also had to mitigate problems with the web portal and web client for its website.
Other Zoom capabilities, including Zoom Phone, Chat, a conference room connector, cloud recording, meeting telephony services, and the Zoom developer platform remained operational throughout the incident.
Microsoft 365 and Azure, Sept. 28
A problem with Azure Active Directory locked users from across the U.S. out of their Microsoft Office 365 accounts, halting many businesses in their tracks that Monday afternoon.
The five-hour outage impacted Microsoft 365 and some Azure Cloud services from 5:25 p.m. EST to 10:25 p.m. EST.
On its status page, Microsoft said customers encountered errors attempting to authenticate logins to Microsoft 365, Azure, Dynamics 365, and custom applications using its Azure Active Directory single sign-on service. Only users not already signed in saw those authentication request failures.
Microsoft’s preliminary analysis found “a combination of three separate and unrelated issues” behind the problem: a code defect in a service update; a tooling error in the Azure AD safe deployment system that impacted regional scoping; and a code defect in Azure AD‘s rollback mechanism, which delayed an attempt to revert the service update.
The outage mostly affected customers in the Americas because the problem was “exacerbated by load,” Microsoft said, although other regions may have also seen disruptions.
Microsoft Office 365, Oct. 7
Microsoft Teams, Outlook, SharePoint Online, OneDrive for Business, and Outlook.com all saw degraded functionality after Microsoft attempted to update its network infrastructure on Oct. 7.
The third Office 365 outage in less than two weeks began that Wednesday afternoon at about 2:10 p.m., Eastern Time, according to Microsoft. User reports of Office 365 problems on downdetector.com spiked at about 2:26 p.m., ET.
At 2:48 p.m. ET, the Microsoft 365 Status account on Twitter acknowledged the outage.
Microsoft later said on its status page: “Further investigation has confirmed that a recent update to network infrastructure resulted in impact to Microsoft 365 services. Our telemetry indicates continued recovery within the environment following the reversion of the update.”
Microsoft Teams was among the first services to fully recover, while Exchange Online and Outlook.com took longer.