Google Apologizes, Offers Credits To Customers After 18-Minute Cloud Infrastructure Service Outage


Printer-friendly version Email this CRN article
Google apologized to customers Wednesday for an outage that took down its Compute Engine cloud Infrastructure-as-a-Service earlier this week, while also providing a detailed description of what caused it. 
 
The outage took place Monday evening Pacific time and affected Google Compute Engine instances and VPN service in all regions, Benjamin Treynor Sloss, Google vice president of engineering in charge of keeping the vendor's infrastructure up and running, said in a post to the Google Cloud Status website
 
While the outage lasted only 18 minutes, Sloss said in the post that Google is taking it very seriously. He said Google is offering service credits for 10 percent of customers' monthly Google Compute Engine charges, and 25 percent of their monthly VPN charges. 
 
 
"We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur," Sloss said in the post. "As of this writing, the root cause of the outage is fully understood and [Google Compute Engine] is not at risk of a recurrence." 
 
While all vendors have dealt with cloud outages, this one is notable because Google has a reputation for designing some of the most massively scalable and resilient systems on the planet. 
 
Simon Margolis, cloud platform lead at SADA Systems, a Los Angeles-based Google partner, said the vendor is also well known for doing exhaustive analyses of outages. 
 
"Google advocates the need to quickly triage, diagnose, resolve and, most importantly, report on-site reliability issues," said Margolis. "This is primarily for the sake of bettering the site's reliability engineer community as a whole by learning from each other's mistakes.
 
"This has been a longtime philosophy at Google, as their postmortems always include great detail as to the cause, resolution and prevention of a given issue without pointing fingers or assigning blame," Margolis said.
 
Sloss said in the post that the outage was caused by two separate previously unknown software bugs in Google's network configuration management software. 
 
Problems began when Google engineers removed an unused IP block -- or group of Internet addresses used for Compute Engine virtual machines and other services -- from its network configuration, which is usually a routine task, according to Sloss. 
 
Printer-friendly version Email this CRN article