The 10 Biggest Cloud Outages of 2017 (So Far)
Enterprise customers increasingly appreciate that while no public cloud provider is perfect, there's no alternative for IT infrastructure that is. So instead of re-evaluating their embrace of the public cloud model after an outage, they're often more interested in understanding the root causes and feeling confident that whatever problems arose were adequately remediated.
Whether an outage impacts enterprise workloads or popular consumer applications, users also want to see that providers are both transparent and accepting of blame. It's the real-time response and post-mortem actions that often prove the difference between losing or keeping customers.
The bigger the provider, the higher the standard they're held to for reporting and remediation. Whether the outage was caused by a technical glitch, human error or malicious attack, customers want an honest assessment and explanation of the remedies put in place to ensure it won't happen again.
Here are 10 of the outages so far this year that sparked such discussions.
(For more on the "coolest" of 2017, check out "CRN's Tech Midyear In Review.")
IBM, January 26
IBM's cloud credibility took a hit at the start of the year when a management portal used by customers to access its Bluemix cloud infrastructure (formerly branded SoftLayer) went down for several hours.
While no underlying infrastructure actually failed, users were frustrated in finding they couldn't manage their applications or add or remove cloud resources powering workloads.
IBM said the problem was intermittent and stemmed from a botched update to the interface.
GitLab, January 31
GitLab's popular online code repository, GibLab.com, suffered an 18-hour service outage that ultimately couldn't be fully remediated. The problem resulted when an employee removed a database directory from the wrong database server during maintenance procedures.
Some customer production data was ultimately lost, including modifications to projects, comments, and accounts.
"Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts," the company said in a post-mortem.
In an apology to users, GitLab's CEO said "losing production data is unacceptable."
Instapaper, February 9
A file size limit for a MySQL database on Amazon's RDS service sparked an extended outage on the Pinterest property.
The online bookmarking site later reported that its engineers never even knew of the RDS limit of 2 TB for databases created before April 2014, and were given no warnings by the AWS service that the table storing its "bookmarks" was about to exceed it.
After being down for more than a day, Instapaper's service was revived with limited access to archived material while engineers worked to revive the rest of the database. Four days later Instapaper completed a full recovery.
Facebook, February 24
For almost three long, painful hours, some users across the world were locked out of Facebook and worried their accounts had been hijacked.
The social media giant later explained functionality meant to guard against hackers inadvertently sent users to a recovery screen that gave the impression someone else had logged into their accounts. Affected users were prevented from immediately logging back in.
Facebook confirmed no actual security breach had occurred.
It was the second time that week Facebook had problems. Days earlier, some people reported they couldn't see their news feeds.
AWS, February 28
This was the outage that shook the industry.
An Amazon Web Services engineer trying to debug an S3 storage system in the provider's Virginia data center accidentally typed a command incorrectly, and much of the Internet – including many enterprise platforms like Slack, Quora and Trello – was down for four hours.
The post-mortem said the employee was using "an established playbook," and intended to pull down a small number of servers that hosted subsystems for the billing process. Instead, the accidental command resulted in a far broader swath of servers being taken offline, including one subsystem necessary to serve specific requests for data storage functions and another allocating new storage.
The outage from a provider that owns roughly a third of the global cloud market reignited debate on the risks of public cloud.
Microsoft Azure, March 16
Storage availability issues plagued Microsoft's Azure public cloud for more than eight hours, mostly affecting customers in the Eastern U.S.
Some users had trouble provisioning new storage or accessing existing resources in the region. A Microsoft engineering team later identified the culprit as a storage cluster that lost power and became unavailable.
In addition to that problem, Microsoft also listed on the Azure status page a software error affecting storage provisioning across multiple services for longer than an hour.
Microsoft Office 365, March 21
Several Microsoft business and consumer cloud services, including Office 365 storage and email services, became inaccessible due to problems authenticating users.
The widespread outage prevented customers from accessing OneDrive storage, Skype collaboration, Outlook email, and consumer products such as Xbox Live.
Lululemon on IBM, May 22
The popular yoga-gear retailer's website going down became a big deal when the company's CEO turned the spotlight on IBM's managed cloud services.
Lululemon's chief executive, Laurent Potdevin, appeared on CNBC and placed blame for lost e-commerce sales squarely in the lap of Big Blue. He said his team worked on the problem for 36 straight hours, and he had already talked to IBM CEO Ginni Rometty to express dissatisfaction.
"We're looking at our options," Potdevin said about a possible defection from IBM's cloud.
Microsoft Skype, June 19
Microsoft Skype users, mostly in Europe, endured connectivity problems due to an apparent Distributed Denial of Service attack.
Skype users started complaining about hours of downtime on June 19. The issues continued into the next day, with users losing connectivity and having trouble exchanging messages on the communications platform.
While Microsoft did not immediately confirm reports of a DDoS attack, a hacker group, called CyberTeam, claimed responsibility for the attack in a tweet.
Apple iCloud, June 28
Multiple social media feeds reported availability problems with Apple's iCloud Backup service. Apple's systems status page said iCloud Backup was only down for less than 1 percent of users.
The problem, in which those affected could not restore iOS devices from previous backups, lasted for at least 36 hours. While the restore process would hang without completion, there was no problem initiating new backups of devices to protect data.