Google Apologizes, Offers Credits To Customers After 18-Minute Cloud Infrastructure Service Outage
Google apologized to customers Wednesday for an outage that took down its Compute Engine cloud Infrastructure-as-a-Service earlier this week, while also providing a detailed description of what caused it.
The outage took place Monday evening Pacific time and affected Google Compute Engine instances and VPN service in all regions, Benjamin Treynor Sloss, Google vice president of engineering in charge of keeping the vendor's infrastructure up and running, .
While the outage lasted only 18 minutes, Sloss said in the post that Google is taking it very seriously. He said Google is offering service credits for 10 percent of customers' monthly Google Compute Engine charges, and 25 percent of their monthly VPN charges.
"We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur," Sloss said in the post. "As of this writing, the root cause of the outage is fully understood and [Google Compute Engine] is not at risk of a recurrence."
While all vendors have dealt with cloud outages, this one is notable because Google has a reputation for designing some of the most massively scalable and resilient systems on the planet.
Simon Margolis, cloud platform lead at SADA Systems, a Los Angeles-based Google partner, said the vendor is also well known for doing exhaustive analyses of outages.
"Google advocates the need to quickly triage, diagnose, resolve and, most importantly, report on-site reliability issues," said Margolis. "This is primarily for the sake of bettering the site's reliability engineer community as a whole by learning from each other's mistakes.
"This has been a longtime philosophy at Google, as their postmortems always include great detail as to the cause, resolution and prevention of a given issue without pointing fingers or assigning blame," Margolis said.
Sloss said in the post that the outage was caused by two separate previously unknown software bugs in Google's network configuration management software.
Problems began when Google engineers removed an unused IP block -- or group of Internet addresses used for Compute Engine virtual machines and other services -- from its network configuration, which is usually a routine task, according to Sloss.
But in this case, the management software that automatically propagates network configuration changes across Google's network flagged an inconsistency in the new configuration. And when the management software tried to roll back to the existing configuration, the first bug prevented it from doing so, Sloss explained.
Although Google had a safeguard in place to prevent the spread of faulty configurations, a second bug in the management software caused that to fail, too, according to Sloss.
"With no sites left announcing GCE IP blocks, inbound traffic from the Internet to GCE dropped quickly," which led to the outage, said Sloss in the post. Google engineers were able to stabilize service by manually reverting back to the last working configuration, he said.
Google is already planning "14 distinct engineering changes" aimed at preventing, detecting and mitigating similar problems with Google Compute Engine, and Sloss said in the post that the number is expected to rise as the team consults with Google engineers from other teams.
The outage comes as Google is making a concerted push to catch up with Amazon Web Services and Microsoft Azure in the public cloud market. Google has a long way to go, however, as it has 4 percent of the market compared with 31 percent for AWS and 9 percent for Microsoft, according to figures from Synergy Research.
"It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform," said Sloss in the post.