10 Notable Cloud Outages And What Caused Them

Putting your applications and information into the cloud carries some risk. While most cloud vendors and providers have created ways to protect data and users from the security side, sometimes the cloud goes down, and when it does it often isn't pretty. While still in its infancy, lots of companies and consumers rely on the cloud to cheaply and easily store data and to develop applications and perform any number of tasks. They also trust that when they fire up their computers and hop online, the cloud and that data will be there for them.





We've dug through the archives and found 10 examples of what happens when the cloud and the data aren't there; when the cloud is down.

On Feb. 17, EMC's Atmos Online was unavailable for an unknown amount of time. Atmos Online is the cloud-based storage offering that is part of EMC's Atmos Infrastructure. Users attempting to log onto the Web-based EMC Atmos Online service were met with this greeting: "EMC Atmos Online Temporarily Down For Maintenance: The web site is currently unavailable and will be back up shortly. We apologize for any inconvenience and thank you for your patience."



In a statement, EMC said the Atmos Online outage was caused by maintenance issues, but did not elaborate.



"We are servicing Atmos Online for maintenance, which happens from time to time as we are still in beta mode for the online solution. With that, all current beta customers still have access to their data. We will be done with maintenance shortly," EMC said in a statement e-mailed to Channelweb.com. EMC did not divulge how long the outage lasted, how many Atmos Online users were affected or what type of maintenance knocked it offline.



The following day, the site was back up and EMC said: "We're upgrading and doing maintenance on the beta management portal. The data path has been available throughout this process."

On January 28, Microsoft Online Services users in North America were met with intermittent access to services, including Microsoft Business Productivity Online Standard Suite (BPOS).

According to Microsoft, some users served by a North American data center were affected. Here's what went down, according to a blog post from Microsoft: Monitoring alerted Microsoft to a possible issue; troubleshooting found there was a problem with network infrastructure resulting in intermittent access for customers.

In response to the incident, Microsoft said it found the root case and took the steps necessary to remediate the issue.

Additionally, Microsoft reached out to affected business customers and offered them a credit if they were impacted.

"We understand that any disruption in service may result in a disruption to your business," said Michael Ziock, senior director, Business Productivity Online Service Operations for The Microsoft Online Services Team in a posting on the Microsoft Online Services Team Blog.

Microsoft did not say how long the intermittent access lasted or exactly how many customers were affected.

Subscribers to Apple's Web-based MobileMe service had two hours of downtime due to "scheduled maintenance" in August 2008. Users, however, sang a different tune saying outages were more frequent and more regular than Apple let on.



MobileMe is Apple's service that lets users access e-mail, calendars and other data from any machine via the cloud, whether its an iPhone, smartphone, PC or Mac, for an annual service charge.



Since then, there have been several more reports of MobileMe outages lasting up to three hours or more.

It lasted less than an hour, but affected nearly 1 million users. In January 2009, Salesforce.com, the grandfather of cloud computing took a 40 minute hit during the height of the work day, leaving users without access to applications and data needed for customer information and transactions. As customers tried to access their accounts, they either couldn't access the site at all or were met with an error message.

Amazon Web Services' Elastic Compute Cloud (EC2) cloud computing platform was struck by a power failure in a Virginia data center and quickly hit by a second power failure in the redundant system, knocking down the EC2 cloud in December 2009. The outage affected one of AWS' east coast availability zones for roughly five hours.



Before that, Amazon EC2 went down in June 2009 for an extended period due to similar power issues caused by a lightning strike.

In this case, the early birds didn't get the worms. Early adopters of Microsoft's Windows Azure cloud computing offering faced a roughly 22-hour outage in March 2009. The outage meant users' applications weren't available. Tthe users affected were using a test release of the Azure service, which Microsoft began offering as a pay service earlier this year.



In a blog post about the incident, Microsoft said the outage was sparked during routine OS upgrade -- on Friday March 13, no less. The deployment service within Windows Azure slowed down due to networking issues causing a large number of servers to time out and fail. Basically, any application running only a single instance went down when its server went down. But few apps running multiple instances were affected, prompting Microsoft to recommend application owners deploy apps with multiple instances of each role.

Rackspace's cloud was knocked down on June 29, 2009 and again on July 7, 2009 due to power interruptions at its Dallas-Fort Worth data center. The hosting and cloud provider posted frequent updates to its blog. Overall, the outages lasted for several hours and prompted Rackspace to revisit its power interruption procedures. The outages, Rackspace said at the time, were "the result of a range of power infrastructure issues."



Several Rackspace-hosted Web sites also suffered an outage in December 2009 when the same data center suffered network-related problems.

Amazon Simple Storage Service (S3) was taken down for two to three hours after the site received too many authentication requests in February 2008. In a posting explaining the outage, Amazon wrote that the elevated levels of authentication requests started from one location around 3:30 a.m. Pacific Time. Then, around 4 a.m., several other users increased their volume of authentication calls, pushing the service over maximum capacity. The issue was resolved around 7 a.m.



In July 2008, Amazon S3 was hit by another wave of outages, one lasting as long as 8 hours.





In both instances, several services using Amazon S3, such as Twitter, were taken down in the outages.

On Sept. 24, 2009, Google's widely popular Gmail e-mail platform conked out for roughly two hours, blocking users from emails and contact lists and mucking up Google's Chat and Auto Complete tools. Gmail users logging into the cloud-based email service were met with a message explaining that Gmail "is temporarily unable to access your contacts. You may experience issues while this persists."



Google called the outage a "big deal," but said it only affected a "small subset of users."



The Sept. 24 outage was the second that month and one of many outages suffered by Google's cloud-based offerings. Google has chalked up the outages to various causes including routing errors and server maintenance.

Probably the most public of cloud outages, the now-infamous T-Mobile Sidekick outage of October 2009. A number of T-Mobile customers and Sidekick users lost personal data such as contacts and calendar items and data that was hosted in the cloud. The data was lost due to a failure in the Microsoft Danger servers that provide the Sidekick service. To make matters worse, Microsoft soon after said that most of the data had been restored and also started offering Sidekick data recovery tools. Meanwhile, T-Mobile said a number of affected Sidekick users were without their cloud-hosted data. To make up for the folly, T-Mobile offered impacted Sidekick users a $100 T-Mobile gift card and a month of free data service.