The 10 Biggest Cloud Outages Of 2022 (So Far)
Cloud outages so far this year have affected companies including Apple, Microsoft and Google.
Apple iCloud, Microsoft Azure and Google Cloud are among the technology vendors to experience major cloud outages in 2022 so far.
The issues vendors have experienced range from cut fiber cables to changes in coding to an air system shutdown in a data center.
Experts have told CRN that the era of cloud computing has resulted in more frequent outages but with less severity.
Click through the slideshow for more details about each outage.
Here are some of the biggest cloud outages to hit the globe in 2022 so far:
January Apple Outage
Apple’s iCloud cloud storage service and other services experienced an outage in January, impacting some users across the globe, according to outage tracking website Downdetector.
There were 866 outages reported as of 8 p.m. ET., according to Downdetector. Other Apple services were also suffering from an outage, including the Apple Store, the App Store and Apple support.
The Cupertino, Calif.-based tech giant’s own system status page showed an ongoing issue with iCloud backup, iCloud storage upgrades, iMessage, iCloud mail and photos. Some users are also unable to log in to their iCloud account. It’s unknown how widespread the issues are.
Two January IBM Outages
IBM Cloud had a tough start to the year with an issue with its classic infrastructure network, which “provides connectivity across a global footprint of over 60 IBM Cloud data centers and 28 points of presence (PoPs),” according to the tech giant.
Armonk, N.Y.-based IBM started investigating the issue on Jan. 2 and resolved it in about five hours, according to a report. IBM Cloud services users in the Dallas area were affected.
The next day, an issue with IBM’s virtual private cloud offering lasted for about an hour, according to the company. The issue affected users in Washington, D.C.; Japan; London; Dallas; Toronto; Germany and other areas, according to an IBM report.
February Slack Outage
In February, Slack users were hit with a major outage, the collaboration tools company confirmed on a status page on Slack’s website.
At 11:24 a.m. Pacific Time on the day, Slack – a Salesforce subsidiary – announced the issue resolved.
“We‘ve resolved the issue, and all impacted customers should now be able to access Slack,” according to a post on Slack’s status page. “You may need to reload Slack (Cmd/Ctrl + Shift + R) to see the fix on your end. If that doesn‘t work, try clearing cache (Help > Troubleshooting > Clear Cache and Restart from the app menu). Thanks for bearing with us and we apologize for the disruption to your work day!”
Slack’s first message on the issue came at 6:25 a.m. Pacific. ”We’re investigating the issue where Slack is not loading for some users,” according to the message. “We’re looking into the cause and will provide more information as soon as it‘s available.”
Almost 11,000 reports of a Slack outage were logged on Downdetector at 6:19 a.m. Tuesday.
March Google Cloud Outage
On March 8, users of Google’s Traffic Director tool experienced “elevated service errors for 2 hours and 35 minutes,” according to the cloud giant. Services such as Spotify and Discord were hit by the outage.
“A change to the Traffic Director code that processes the configuration was updated,” according to a post by Mountain View, Calif.-based Google. “The code change assumed that the configuration data format migration was fully completed. In fact, the data migration had not completed.
The post continued: “It would inadvertently delete the configurations which caused the downstream clients to lose their programming and deconfigure the data plane.”
March Apple Outage
In March, several major Apple services appeared down, including its App Store, Apple Maps, Apple TV and many other key products.
Apple’s corporate and retail systems were down as well, preventing corporate employees from working at home. Apple told staff that the outage stemmed from domain name system (DNS) problems.
Apple confirmed outages on its system status updates, and said that 15 services were down for “some users.” Reports began coming in from users just after 1 p.m. Eastern time on the day, and Apple showed services were still down at 2:30 p.m.
In a Twitter message from its customer support account, the Cupertino, Calif.-based tech giant said that it was aware of the outage and that it was “working to get it resolved as quickly as possible.” By 5 p.m. Eastern that day, all services had been restored.
April Atlassian Outage
An Atlassian outage started on April 5, with some customers restoring services by April 8 and the rest waiting until April 18, according to the company.
The cloud tools provider, which has offices in Australia and San Francisco, said the outage was due to a “communication gap” between teams working on deleting a standalone legacy application and “insufficient system warnings.”
“Although this was a major incident, no customer lost more than five minutes of data,” according to the company, whose most notable products include Jira and Trello. “In addition, over 99.6% of our customers and users continued to use our cloud products without any disruption during the restoration activities.”
To prevent the issue in the future, the company has plans for universal “soft deletes” across all systems; adding more customers to its automatic restoration program for multi-site, multi-product deletion events; and creating a large-scale incident communications playbook.
May Mimecast Outage
Mimecast, the U.K.-based provider of cloud cybersecurity services, said in May that a “major power outage” caused its North American grid to suffer problems causing delays and “degraded service” for customers.
“We can confirm we experienced a major power outage in one of our US data centers that impacted all power supply sources, including the backup generators, which in turn caused a cascading issue and resulted in degraded performance,” Mimecast said at the time on its status page.
“Midway through our Disaster Recovery Process, the power was restored which complicated our ability to recover. As of now, services are available and we are processing email but customers may still experience delays until the backlog clears. We sincerely apologize for the extended outage and the disruption caused,” it continued.
Cut Cable Downs Google Cloud In June
On June 7, two fiber cuts in Google Cloud’s Middle East network “affected the end-to-end path for several submarine cables, reducing capacity for many telecom and technology companies, including Google,” according to the tech giant.
Although the brunt of the outage hit users outside the U.S., users of Google’s Virtual Private Cloud requiring cross-regional traffic among the northamerica-northeast (Canada), northamerica-southeast and all us-east regions (which includes South Carolina, Northern Virginia and Ohio) “would have experienced packet loss of up to 50%” for about two hours, then “elevated network latency” for another two hours, according to a Google Cloud report.
“To our customers that were impacted during this outage, we sincerely apologize,” according to the report. “We are conducting an internal investigation and are taking steps to improve our
June Microsoft Azure And M365 Online Outages
On June 7, customers had trouble connecting to resources hosted in the East U.S. 2 region, located in Virginia, according to Microsoft. The issue lasted for about 12 hours and should not have affected customers with always-available or zone-redundant services.
The Redmond, Wash.-based tech giant blamed the outage on “an unplanned power oscillation in one of our datacenters within one of our Availability Zones in the East US 2 region,” according to a Microsoft report.
It continued: “Components of our redundant power system created unexpected electrical transients, which resulted in the Air Handling Units (AHUs) detecting a potential fault, and therefore shutting themselves down pending a manual reset.”
The outage affected Application Insights, Log Analytics, Managed Identity Service, Media Services and NetApp Files, according to the report.
Microsoft is working on ways to “improve our tooling and processes to flag anomalies more quickly” and “fine-tuning our alerting to inform onsite datacenter operators more comprehensively,” according to the report.
The company is also “developing a plan for fault injection testing relevant critical environment systems, in partnership with our industry partners, to be even more proactive in identifying and remediating potential risks” and “expanding how many Azure services support Availability Zones, so that customers can opt for automatic replication and/or architect their own resiliency across services.”
On June 21, Microsoft tweeted that it was investigating delays and connection issues with Exchange Online. About two hours later, the company tweeted that it “determined multiple Microsoft 365 services are experiencing delays, connection and search issues,” responding by rerouting traffic.
About nine hours later, Microsoft tweeted that “rerouting traffic combined with targeted infrastructure restarts has successfully restored service access and functionality.”
June Cloudflare Outage
An accidental outage at Cloudflare in June caused major disruptions across large swaths of the Internet, reportedly hitting popular sites such as Discord, Shopify, Grindr, Fitbit and Peloton.
The San Francisco-based vendor, which offers security and performance services for cloud deployments, said the problem was the result of “our error” and was fixed within about an hour and 15 minutes.
In a blog post, Cloudflare said the outage in the early hours of Tuesday affected traffic in 19 of its data centers.
“Unfortunately, these 19 locations handle a significant proportion of our global traffic,” the company said. “A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.”
The company concluded in the introduction of its blog post: “We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.”