The 10 Biggest Cloud Outages Of 201110:00 AM EST Thu. Dec. 22, 2011
One of the most valuable lessons the IT industry learned in 2011 is that the cloud is not invincible. It's like any other IT system: It can crash. It can go down. It can be knocked offline by a lightning strike. While the cloud has proven itself as (oftentimes) more reliable than its on-premise counterparts, it's not 100 percent without flaw. That was proven several times over the course of the year with several high profile cloud outages. Here, we take a look at the 10 biggest cloud outages for 2011.
In September, Google Docs and a handful of other Google cloud offerings went offline for roughly an hour, making Google Document Lists, Google Documents, Google Drawings and Google Apps Scripts inaccessible for the majority of Google Apps users. Google quickly got the cloud services back up and running and said that the downtime was caused by a change that had been designed to improve real-time collaboration within the document list. That change, Google said, exposed a memory bug that caused the hour lull. Google Docs suffered another outage in October.
VMware's Cloud Foundry development platform was racked by a pair of different blackouts in the same week, on April 25 and April 26. While still in beta at the time, the open-source Cloud Foundry service was knocked out of commission by a power outage that affected a storage cabinet power supply on April 25 around 5:45 a.m. The following day, around 10:15 a.m., an engineer that was developing an early detection plan to prevent outages like the one the previous day knocked Cloud Foundry offline with an errant keyboard tap, which took out all load balancers, routers and firewalls; caused a partial outage to portions of the internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.
Yahoo Mail, the search company's massive cloud-based e-mail service, went down on April 28. Yahoo could not say exactly how many users were impacted when its hugely popular e-mail service was down for several hours, but Yahoo estimated that more than 1 million of Yahoo Mail's more than 250 million users were affected. Yahoo never really said what caused the problem, but said no e-mail data was lost or at risk during the disruption.
Google's widely popular cloud e-mail service Gmail suffered a massive outage in late February 2011 that wiped out thousands of Gmail inboxes. Gmail users awoke to find messages in their Google Gmail inbox, folders and other data vanished. At its peak, the outage affected roughly 150,000 Gmail users. In the days that followed, Google apologized for the outage, calling it a "scare." Google said a software bug that was introduced by a storage update had caused the downtime. Google Gmail was back to full service within a few days.
An August lightning strike in Dublin took down Microsoft Business Productivity Online Suite (BPOS) for several hours. According to Microsoft, the four hours of downtime were the result of "a widespread power outage in Dublin" that "caused connectivity issues." In response to the outage, Microsoft offered its cloud services customers a credit for the downtime they suffered.
Microsoft wasn't the only cloud company clobbered by the Dublin lightning strike in August. Amazon, too, was taken down during the dramatic lightning storm. According to Amazon, the Dublin outage took down its Elastic Compute Cloud (EC2) and Relational Database Service (RDS) for several days. Investigators found that a power failure caused the cloud outage, and that outage was sparked by a transformer exploding and catching fire. Lightning was thought to have caused the transformer explosion, but not confirmed 100 percent. Amazon said that backup generators failed to switch on once power went out and Amazon had to bring the generators back online manually to re-establish connectivity, which took several hours.
In a brief yet impactful August cloud outage, a host of Amazon customers suffered bouts of downtime that lasted roughly 30 minutes. Several of Amazon's marquee customers, including Netflix, Quora, Reddit and Foursquare were knocked offline and took to social networks and the Web to keep customers abreast of developments. According to Amazon, the August outage was caused by connectivity issues between three of its different Availability Zones and the Internet. The outage affected Amazon Web Services Elastic Compute Cloud (EC2) and Amazon's Relational Database Service (RDS).
Just a couple of months after its official launch, Microsoft Office 365, Microsoft's cloud productivity suite, suffered its first full-fledged cloud outage. The company blamed "a network connectivity issue" for the outage that hit several of its online services in August, including Office 365, Dynamics CRM Online and Windows Live SkyDrive. All told the outage lasted for roughly six hours and affected services hosted in one of Microsoft's North American data centers.
Between May 10 and May 13, Microsoft Business Productivity Online Suite (BPOS) suffered a string of cloud outages that caused lengthy cloud e-mail delays for BPOS users. Trouble started around 12:30 p.m., Tuesday, May 10, when the BPOS-S Exchange service experienced an issue with one of the hub components due to malformed e-mail traffic on the service. Microsoft said Exchange features a built-in capability to handle malformed traffic but "encountered an obscure case" where that also didn't work correctly, creating a backlog of e-mail. The issue caused delays of six to nine hours. Then, on May 13, more issues caused e-mail delays, resulting in more than 1.5 million e-mail messages getting stuck and awaiting delivery. Microsoft fixed that issue by 3:04 p.m. and all e-mails were cleared within a few hours. That May stretch was the beginning of a rough month for BPOS, which saw four outages in roughly a month's time.
Amazon's cloud outage that started April 21 was the grand-daddy of cloud outages for 2011, drawing major attention to the cloud's sometimes fragile state and also showcasing that when the going gets tough, communication is key. In April, Amazon suffered a catastrophic cloud outage that knocked some of its biggest customers offline for several days.
If that wasn't bad enough, the massive cloud vendor remained silent as customers struggled to get back up and running. After nearly a week of not a peep, Amazon released a highly-technical, long-winded manifesto chronicling what happened (Amazon blamed a "re-mirroring storm") and offered customers an apology and a cloud credit. The incident was seen by many as a sort lesson in how not to handle an outage.