The 10 Biggest Cloud Outages of 2015

A More-Perfect Cloud

There's a paradoxical dynamic taking place in the cloud—while outages have become less common, their impact is more widespread and damaging than ever.

Uptime has been improving, and will inevitably continue to do so, as cloud technologies mature and providers gain operational experience.

It's that incremental, steady progress toward greater reliability that's imparting enterprises, governments and academic institutions with the confidence to migrate mission-critical workloads whole hog to the cloud.

Which is why, even as some of the outages below weren't as catastrophic as those on incarnations of this list from years past, they created more problems, at times halting operations of entire government agencies and shutting down commerce for scores of high-tech businesses.

No provider is perfect, but those that host the bulk of the world's workloads are most worthy of scrutiny, which is why tech giants like AWS, Microsoft, Google, and Apple are so prominent on this list.

If you missed it, be sure to take a look back at the rest of the best of 2015 with CRN.

Verizon Cloud, Jan 10 and 11

While a cloud provider's worst fear is a prolonged outage, Verizon Communications stunned customers by scheduling to take its cloud offline for some 40 hours over the weekend to implement a comprehensive system maintenance project.

One reason for the upgrade of its cloud infrastructure, ironically, was to prevent future outages.

While many customers were peeved their provider intentionally cut their cloud service, some took solace knowing Verizon spent those 40 hours adding seamless upgrade capabilities that would enable future upgrades to be executed on live systems without disruptions, or even the need to reboot servers.

Google Compute Engine, Feb 18 and 19

Multiple zones of Google's IaaS offering went down just before midnight. After about an hour of downtime, service for most affected customers returned around 1 am the next morning.

While some connectivity issues lasted almost three hours, there were roughly 40 minutes during which most outbound data packets being sent by Google Compute Engine virtual machines were ending up in the wind.

Google said the problem was "unacceptable" and apologized to users who were affected.

About three weeks later, in a similar event, another network error brought down Google's IaaS cloud by clamping off outbound traffic. Some users lost service for up to 45 minutes.

Apple iCloud, March 11

Millions of people around the world couldn't buy digital music, books, or apps for almost 12 hours. Thankfully, most of them survived.

Apple, in its apology, blamed an internal DNS error for taking down its iTunes and App Store services. Some iCloud email accounts were also briefly affected.

Microsoft Azure, March 16

Two of Microsoft's Azure public cloud services went down for more than two hours for customers in the central U.S., due to what the software giant described as a "network infrastructure issue."

The outages, which began just after 1 p.m. CDT, affected customers of Microsoft's Azure Virtual Machines (Infrastructure-as-a-Service) and Azure Cloud Services (Platform-as-a-Service) offerings, the Redmond, Wash.-based vendor reported on its Azure Status webpage.

Microsoft described the issue as a "partial service interruption" and said the service had been restored to full availability by 3:19 CT.

Microsoft Azure, March 17

Microsoft's public cloud hadn't even been humming along for a full 24 hours before a second outage in as many days took down virtual machines, websites and other cloud services, this time affecting a denser concentration of customers on the East Coast.

Microsoft reported the problem on its Azure status page as starting at 1:30 p.m. EDT. The second-largest public cloud provider in the world informed customers that the service disruption was rooted in a problem with storage systems.

Apple iCloud, May 20

Eleven Apple services, including email, suffered a seven-hour outage. Some went down entirely, others were just working really, really slow.

Disrupted services included iCloud Drive, Photos, Documents, Find My iPhone, Back to My Mac, iCloud Backup, iCloud Keychain, iCloud Mail, iMovie Theater and iWork for iCloud Beta.

According to Apple's system status page, some 40 percent of the world's 500 million iCloud users were affected.

Amazon Web Services, August 10

Amazon Web Services, the world's largest public cloud provider, suffered a rare outage in the early morning hours of August 10 and the service disruption brought down many popular websites.

The problem seemed to originate at an AWS Data Center in northern Virginia, where the AWS status page listed a range of errors.

Amazon reported "increased error rates" for its Elastic Compute Cloud, EC2, and "elevated errors" for its Simple Storage Service, known as S3, between 12:08 and 3:40 a.m. PDT.

Partner accounts and reports on Twitter said many customers of those two AWS workhorse services were left in the lurch during those hours.

Google Compute Engine, August 13 to August 17

On a Thursday morning in Belgium, an electrical storm shot four lightning bolts into the power grid abutting Google's ultra-energy-efficient data center near the town of St. Ghislain.

Those successive surges seem to have set off a series of technical events and failures that ultimately resulted in some I/O errors.

According to Google, there was only data loss on a tiny fraction of persistent disks serving Google Compute Engine instances.

While Google said just about every stray bit of data was eventually recovered and restored, data centers are supposed to insulate servers, and customer data, from high-voltage surges, such as those caused by lightning.

In this case, some ultra-efficient energy architecture coupled with an epic storm seems at blame.

Google Compute Engine, November 23

Google's networking engineers tried to activate an additional link to a European carrier, but though the peer's network signaled it could handle routing a surprisingly broad array of traffic, that wasn't the case.

The line quickly saturated and the connecting network dropped most of the packets routed to Eastern Europe and the Middle East from the affected Western European data center.

Compute Engine couldn't communicate with those regions of the world for 70 minutes, between 11:55 am and 1:05 pm PST.

The region's traffic volume decreased by 13 percent during the outage, according to Google.

Microsoft Office 365, December 6

An outage within Microsoft's Azure infrastructure brought down Office 365 in Western Europe for a large part of the afternoon.

Many users, most of them from the U.K., couldn't get their emails, documents and other files used by Microsoft's cloud-based productivity tools. Some had intermittent problems for as long as four hours.

Microsoft later said an Active Directory configuration error caused the outage.