The 10 Biggest Cloud Outages of 2013 (So Far)10:00 AM EST Wed. Jul. 17, 2013
Nobody is perfect. And the same applies to companies. But as customers become increasingly dependent upon the cloud for applications and access to their data, perfection is exactly what those customers demand. Here are a few examples where the wheels of perfection fell off the cart, often leaving users with little else to do but express their rage on the Internet.
Online retail therapy in the post-holiday season was briefly impacted by an Amazon.com outage that lasted approximately one hour on Jan. 31. The primary effects seemed to be isolated to the website's home page, leading to extensive speculation that the site may have been subjected to a distributed denial-of-service attack, but such speculation was not confirmed. Nonetheless, the outage demonstrated the extremely high value of uptime to services such as Amazon. Analysts calculated that one hour of interrupted service may have translated to $5 million in lost revenue.
Apple's iCloud suffered an April 23 outage that impacted a variety of services, including sign-on, email, GameCenter and iTunes. Most business-critical applications did not seem to be affected. Users attempting to access the impacted services mostly experienced failures in authentication. Most services were restored within several hours, but a number of small scale glitches were previously reported by Apple users during the month of April.
And, when a cloud goes down, it's a sure bet channel partners are getting calls."When things like that happen, we typically get a lot of calls," Jim McCool, director of operations at Chantilly, Va.-based CWPS, a systems integrator and custom cloud services channel partner, told CRN in April about the outage. "In fact, they pretty much call us for everything. But a lot of times, we are the ones who end up calling the customers because we do a lot of proactive monitoring of the systems. So sometimes we are aware of the outage before they are."
CenturyLink, a Monroe, La.-based multinational communications company, suffered a widespread outage that affected CenturyLink customers in at least 20 states on May 7. An issue with a core router was determined to be the source of the problem, although specific details were not revealed. The outage caused the company's switchboard to be flooded with calls, thereby making it more difficult for customers to get information.
Outages such as this one often generate phone calls to channel partners from customers who are looking for workarounds, as well as any specific information as to when service will be restored. "We did receive a few calls about this issue," Garth Brown, president of Seattle-based solution provider Semaphore Corporation, told CRN after the May 7 outage. "I'm glad that our people really did not get hit too badly, but apparently the situation was a lot worse in other places."
While Dropbox can be a highly useful tool for accessing documents from any device, up time is the key aspect to delivering on that value proposition. On Jan. 10, Dropbox became the first major company in 2013 to suffer a substantial loss of service. The service interruption, which lasted in excess of 15 hours, was caused by a synchronization issue between the client software and the servers. During the outage, the company's estimates for restoration were substantially understated, which tended to increase the angst expressed on the Internet by frustrated users who were unable to access their documents.
To keep its users up to date on the situation, Dropbox turned to Twitter, regularly sending tweets. "Creating/joining shared folders, and creating shareable links to files, also affected. We appreciate your patience as we resolve this issue," said Dropbox's operations team in a Jan 10 tweet.
As Google encourages users to become more dependent on Google Drive, Google Docs and Gmail, service interruptions can have an even more profound effect on the Mountain View, Calif.-based company's users. Such was the case on April 17 when a relatively short-lived glitch demonstrated the challenges of high-percentage uptime for all of Google's three services. According to Google, the outage occurred in its Gmail cloud email service, which affected the other three services. A login configuration flaw that caused server overloads was believed to be at least part of the problem. Of its 425 million users, "less than 0.0007" were affected, the company said. After Google reported the problem, services were back up and running within about an hour. But, problems continued in the following days, leading to part two of Google's service outages woes ...
... Or should we call it parts two, three and four? In March of this year, the Google Drive storage service suffered three outages in a single week. Primary issues, which began on March 18, involved isolated software glitches that subsequently resulted in larger problems. As much as one-third of the customer base was impacted, leading to a virtual hue-and-cry across the Internet. On March 19, there was a two-hour outage, followed by an even lengthier loss of service on March 20. Google was pretty tight-lipped about the sources of the three outages, but users say the service has been relatively stable in recent weeks.
Microsoft's online service reputation took a hit on March 14 when Hotmail and Outlook.com both suffered a loss of service that lasted nearly 16 hours. Around the same time period, issues involving the stability of documents stored in the Microsoft SkyDrive were also discovered, but those problems were more quickly rectified. It was later reported that the mail problem had been caused by a firmware update that caused the company's servers to overheat.
"This is an update that had been done successfully previously, but failed in this specific instance in an unexpected way," wrote Arthur de Haan, vice president of test and service engineering in Microsoft's Windows Services unit, in a blog post. "This failure resulted in a rapid and substantial temperature spike in the datacenter. This spike was significant enough ... that it caused our safeguards to come in to place for a large number of servers in this part of the datacenter."
Service was restored incrementally between March 14-15, with most mailboxes running again before midnight.
An upgrade that was actually intended to enhance cloud stability and performance turned out to be the temporary undoing of that intended stability when the SCORM cloud crashed for a period of about three hours on March 14. SCORM, which is part of Rustici Software, is a set of technical standards to promote interoperability for e-learning software products. An error with the update caused a cascading effect that ultimately impacted multiple availability zones across the company's Amazon services. "We made changes to how SCORM Cloud handles caching in order to increase system stability and performance," wrote Joe Donnelly, customer support manager at Rustic Software, in a SCORM support forum. "Due to a mistake in the rollout of this change, we experienced import failures on one of our Amazon servers. This caused a series of cascading failures due to excessive CPU load and revealed instability in our use of multiple availability zones across Amazon's web service."
Major Australian telecommunications service provider Telstra reportedly suffered a large-scale, day-long outage to its high-end cloud computing platform in late March. A spokesperson confirmed the outage the following week in a statement to Aussie media. "Last week we had an intermittent service outage on our cloud platform that affected a small number, around 20, of our business customers," the spokesperson said, according to Australian technology news site Delmiter. The source of the problem appeared to be a storage layer failure within the company's Melbourne data center that knocked a number of key customers off service for an extended period of time. "The issue started on Monday 25 March when we identified a failure in the data storage equipment that supported the customers that we affected," the spokesperson said. "When the failure was identified we immediately engaged our storage partner and started restoring services," said the spokesperson. The company is reported to be in the midst of an $800 million expansion project to support its cloud infrastructure and related marketing activities.
The Microsoft Azure Cloud suffered a worldwide service interruption on Feb. 22 that impacted secure traffic for almost a full day. People on the Internet reported either complete loss of service or extremely slow service across much of the Azure portfolio during the time period. The Azure storage service was believed to be most significantly affected. An expired SSL certificate was determined to be the source of the problem. Non-secured HTTP connectivity remained available. According to Kaspersky's Threatpost blog, Redmond posted a Feb. 23 message on its Windows Azure Service Dashboard announcing the outage. "Storage is currently experiencing a worldwide outage impacting HTTPS operations (SSL traffic) due to an expired certificate," the post read. In addition to a mea culpa, the company said in a Feb. 24 post to its Windows Azure blog that it would be issuing credits to affected customers. "Given the scope of the outage," wrote Steven Martin, general manager of Windows Azure business and operations, "we will proactively provide credits to impacted customers in accordance with our SLA."