VMware Joins Cloud Outage Party With Cloud Foundry Blackout

VMware's Cloud Foundry development platform was rocked by two different blackouts last week, a pair of interruptions likely overshadowed by the fallout from Amazon's cloud outage the week prior.

While still in beta, VMware's open source Cloud Foundry service suffered to bouts of downtime last Monday and Tuesday (April 25 and April 26), the company said in an analysis of the cloud outages. The platform-as-a-service (PaaS) announced by VMware last month gives developers a platform to build and host Web applications.

In the first incident, which was detected at 5:45 a.m. April 25, VMware blamed a power outage with the downtime, saying a storage cabinet power supply went down knocking Cloud Foundry out of commission for nearly 10 hours. During the outage, VMware said, applications were online, but developers couldn't log into the Cloud Foundry PaaS or build new applications.

"Existing applications were not impacted by this event and continued to operate normally," Dekel Tankel of VMware's Cloud Foundry team wrote in the post outage analysis. "The folks most impacted by this event were the developers who received their access credentials the night before. They could not log in until 3:30 p.m. when the system health and storage connectivity was fully restored to 100 percent availability."

id
unit-1659132512259
type
Sponsored post

Tankel wrote that while the power outage is not considered a normal event, "it is something that can and will happen from time to time." Typically, the system would be redundant at both hardware and software layers to soften the blow, but Cloud Foundry said that "in this case, our software, our monitoring systems, and our operational practices were not in sync," he wrote.

The second spate of downtime, which occurred at 10:15 a.m. occurred while VMware was developing an early detection plan to prevent incidents like the previous day's Cloud Foundry outage.

"One of the action items from the previous day's partial outage was to develop a full operational playbook for early detection, prevention, and restoration should our systems fail to properly handle any sort of intermittent loss of connectivity to storage," Tankel wrote of the April 26 Cloud Foundry fumble. "At 8 a.m. this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed."

Tankel continued: "Unfortunately, at 10:15 a.m. PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

During the second Cloud foundry Outage, applications and system components continued to run, Tankel wrote, but since the front-end network was down Cloud Foundry staffers were the only people who knew the system was up. The front-end infrastructure was back to full operation by 11:30 a.m.

NEXT: VMware Cloud Foundry Downtime Overshadowed By Amazon Cloud Outage

The VMware Cloud Foundry double whammy came just days after cloud pioneer Amazon Web Services struggled with a massive cloud outage that knocked a host of customers' Web sites offline. Amazon's cloud outage, which started Thursday, April 21 and lasted, for some, several days, was caused by an issue with its Elastic Block Store (EBS) created during a network configuration change in its North Virginia Availability Zone that caused EBS instances to become "stuck" in a "re-mirroring storm." That made the volumes unavailable and created latency and outages.

Amazon remained tightlipped throughout and after the outage, prompting many industry watchers and cloud solution providers to call the Amazon cloud outage a "cautionary tale" that illustrates cloud computing environments are prone to the same types of interruptions as their on-premise counterparts. Amazon's cloud outage brought to light several lessons about cloud computing.

Last Friday, more than a week after the initial downtime, Amazon broke its cloud outage silence issued an apology and an explanation of what happened, and offered users a 10-day credit. Amazon also said it is putting mechanisms and procedures into place to guarantee failover and ensure a similar outage does not topple its cloud infrastructure in the future.

VMware now joins the ranks of other notable cloud outages of the past year. Unlike Amazon, however, VMware apologized and accepted fault for its Cloud Foundry outage shortly after the operations were restored.

"We take full responsibility for these issues and apologize to our users who were impacted by them," Tankel wrote. "We can and will do better, having already learned from these incidents. We greatly appreciate your patience as we improve our service and the underlying technology, while building capacity to deal with the extraordinary level of demand that we are experiencing."