Public Cloud Pitfalls: Microsoft Azure Storage Cluster Loses Power, Puts Spotlight On Private, Hybrid Cloud Advantages

Solution providers said "storage availability" issues affecting Microsoft's Azure public cloud for eight hours and 10 minutes is once again spotlighting the benefits of private and hybrid cloud.

Microsoft said its engineering team identified a "storage cluster that lost power and became unvailable" as the preliminary cause of the storage issues which affected the Eastern United States.

The Azure public cloud issues, which affected storage availability from 5:50 p.m. EDT on March 15 to 2:00 a.m. on March 16, were first reported by VentureBeat.

unit-1659132512259

type

Sponsored post

"Starting at 21:50 UTC (5:50 p.m. EST) on 15 Mar 2017 to 06:00 (2:00 a.m. EST) on 16 Mar 2017, due to a incident in East US affecting Storage, customers and service dependent on Storage may have experienced difficulties provisioning new resources or accessing their existing resources in the region," said Microsoft in a post titled "Storage Availability In East US" on an Azure status page update. "Engineering confirmed that Azure services that experienced downstream impact included Virtual Machines Azure Media Services, Application Insights, Azure Logic Apps, Azure Data Factory, Azure Site Recovery, Azure Cache, Azure Search, Azure Service Bus, Azure Event Hubs, Azure SQL Database, API Management and Azure Stream Analytics."

Jamie Shepard, senior vice president for health care and strategy at Lumenate, No. 152 on the 2016 CRN SP500, said he was not surprised by the Azure storage issues which once again point to the need for customers to adopt a hybrid computing model.

"You have to have a hybrid model right now until the cloud becomes fully fault tolerant," he said. "These are data centers. At the end of the day whether it is Microsoft or Google, they are still running data centers. They are bound to the limits of technology even as they are taking advantage of technology advancements. It's really hard to manage that."

Shepard said the Azure failure is unacceptable when using a cloud to provision and manage infrastructure. "Most customers are leveraging cloud platforms to help in bursting workloads from on premise to the cloud and that requires the ability to have a dynamic and available infrastructure so that they can quickly spin up resources during time when on premise workloads are not available," he said. "For someone like Azure to not be able to provide basic, core cloud services such as provisioning new resources and accessing existing resources is like a data center just going down."

Beside the storage availability issue in the East US, Microsoft also listed on the Azure update page storage provisioning that affected multiple services for one hour and eighteen minutes. In that case, Microsoft said engineers identified a "software error" for storage management issues. "Engineers have observed around 50 percent of success rate during the impacted window," said Microsoft. "Most customers would have succeeded upon retries."

In a post on the storage provisioning issue, Microsoft said it lasted from 6:42 p.m. EST to 8 p.m. EST. "This incident is limited to service management operations, and existing Storage resources were not impacted. Virtual Machines or Cloud Services customers may have experienced failures when attempting to provision resources," said Microsoft in an Azure status page update. "Storage customers would have been unable to provision new Storage resources or perform service management operations on existing resources. Azure Search customers may have been be unable to create, scale, or delete services. Azure Monitor customers may have been be unable to turn on diagnostic settings for resources. Azure Site Recovery customers may have experienced replication failures. API Management service activation in South India may have experienced a failure. Azure Batch customers will have been unable to provision new resources. During this time all existing Azure Batch pools would have scheduled tasks as normal. EventHub customers using a service called 'Archive' may have experienced failures. Customers using Visual Studio Team Services Build will have experienced failures. Azure Portal may have been unable to access storage account management operations and would have been unable to deploy new accounts."

CRN reached out to Microsoft for comment on the Azure storage issues but had not heard back at press time.

Raymond Tuchman, CEO of Experis Technology Group, a fast-growing Potomac, Md.-based Hewlett Packard Enterprise private cloud powerhouse – and a company that has its own 80,000 square-foot cloud services data center – said he sees more customers open to private cloud-hybrid cloud because of recent public cloud issues like the Azure storage issues and the Amazon Web Services (AWS) outage 17 days ago.

"No system is infallible whether it is public cloud or private cloud," said Tuchman. "The big problem is if a public cloud has a problem it can affect hundreds of thousands or millions of people. If you have an issue on a private cloud it only affects you. It really comes down to whether you want to have control of your own environment with your own rules and regulations or do you want to operate with someone else's rules and regulations."

Tuchman said the Azure storage cluster losing power is in sharp contrast to the recent AWS outage, which lasted just short of four hours. "[What happened with] Azure is a problem that any system could have experienced," he said. "AWS looked to me like a quality assurance issue with someone typing in something wrong affecting production systems."

AWS, in fact, apologized for its outage, which was sparked by an AWS team member entering a bad command during the debugging of an S3 billing system and disclosed "several changes" including adding safeguards that would prevent an "incorrect input from triggering a similar event in the future."

Tuchman said some customers in the drive to push the technology envelope and be more competitive are developing software so fast they have lost sight of basic quality assurance procedures. "You have to balance just how fast you need to go to remain competitive with quality assurance to make sure everything works properly," he said. "If somebody does something in a public cloud and there is not quality assurance that everything is going to work 100 percent it affects hundreds of thousands or millions of people."