VMware Joins Cloud Outage Party With Cloud Foundry Blackout


VMware's Cloud Foundry development platform was rocked by two different blackouts last week, a pair of interruptions likely overshadowed by the fallout from Amazon's cloud outage the week prior.

While still in beta, VMware's open source Cloud Foundry service suffered to bouts of downtime last Monday and Tuesday (April 25 and April 26), the company said in an analysis of the cloud outages. The platform-as-a-service (PaaS) announced by VMware last month gives developers a platform to build and host Web applications.

In the first incident, which was detected at 5:45 a.m. April 25, VMware blamed a power outage with the downtime, saying a storage cabinet power supply went down knocking Cloud Foundry out of commission for nearly 10 hours. During the outage, VMware said, applications were online, but developers couldn't log into the Cloud Foundry PaaS or build new applications.

"Existing applications were not impacted by this event and continued to operate normally," Dekel Tankel of VMware's Cloud Foundry team wrote in the post outage analysis. "The folks most impacted by this event were the developers who received their access credentials the night before. They could not log in until 3:30 p.m. when the system health and storage connectivity was fully restored to 100 percent availability."

Tankel wrote that while the power outage is not considered a normal event, "it is something that can and will happen from time to time." Typically, the system would be redundant at both hardware and software layers to soften the blow, but Cloud Foundry said that "in this case, our software, our monitoring systems, and our operational practices were not in sync," he wrote.

The second spate of downtime, which occurred at 10:15 a.m. occurred while VMware was developing an early detection plan to prevent incidents like the previous day's Cloud Foundry outage.

"One of the action items from the previous day's partial outage was to develop a full operational playbook for early detection, prevention, and restoration should our systems fail to properly handle any sort of intermittent loss of connectivity to storage," Tankel wrote of the April 26 Cloud Foundry fumble. "At 8 a.m. this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed."

Tankel continued: "Unfortunately, at 10:15 a.m. PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

During the second Cloud foundry Outage, applications and system components continued to run, Tankel wrote, but since the front-end network was down Cloud Foundry staffers were the only people who knew the system was up. The front-end infrastructure was back to full operation by 11:30 a.m.

NEXT: VMware Cloud Foundry Downtime Overshadowed By Amazon Cloud Outage