Did Google's Energy-Efficient Battery Backups Put European Cloud Data At Risk?

Giving Google tips on operating data centers is a little like giving LeBron James layup lessons.

But even the greats sometimes make errors, especially when exceptionally rare occurrences throw off their game. Those errors might not cost their team the championship, but they're certainly worthy of review.

A week ago, an electrical storm in Belgium shot four lightning bolts into the power grid abutting Google's ultra-energy-efficient data center near the town of St. Ghislain, causing data loss on a tiny fraction of Google-persistent disks serving Google Compute Engine instances, the Internet giant's IaaS cloud.

[Related: 5 Reasons Why Google Data Centers Are Running Like Giant, Efficient Computers]

id
unit-1659132512259
type
Sponsored post

While Google said just about every stray bit of data was eventually recovered and restored, data centers are supposed to be designed to absolutely insulate customer information flowing across servers from high-voltage surges of the kind caused by lightning.

So what exactly went wrong in Belgium?

To understand the specific nature of the failure is tricky, because, as with so many things Google, data center operations are veiled in secrecy.

But Brett Haines, operations director for Atlantic.net, a former VAR based in Florida turned cloud operator that recently opened its first European data center in London, shared with CRN his assessment of possible failure points, and how cloud customers can mitigate their risk from similar scenarios.

"It seems like their backup systems didn't catch the load of the data center. They just lost power and their battery couldn't sustain it," Haines said of the incident in Belgium, based on his reading of Google's root cause analysis.

Data centers have multiple layers of defense against electrical surges, he explained.

Among them is a Transient Voltage Surge Suppressor (TVSS) -- a surge protector.

That device should isolate computing equipment from an electrical spike from the utility, sacrificing itself to protect the far-more-expensive downstream gear. But the TVSS is a once-and-done device.

An early lightning bolt could have taken out the TVSS at the perimeter of the utility feed line. Subsequent high-voltage surges from lightning most likely damaged conductors and electrical distribution equipment, cutting power to the facility.

For such scenarios, data centers typically are equipped with multiple backup generators. But it takes a generator several seconds to come online and sync with the equipment it powers—an eternity for data floating in RAM or cache that's reading from and writing to persistent disk storage.

As the generator fires up, a UPS acts as an intermediary supply -- a battery system typically sized to maintain power to critical systems for at least 15 minutes, an electrical engineer told CRN.

The battery backup is where the fault with the Belgian outage almost invariably lies, Haines told CRN.

Google made a bit of a stir in 2009 in doing something unorthodox with this technology.

At the Google Data Center Efficiency Summit that year, Google engineers touted their world-beating power-efficiency achievements. They revealed a patent they had filed a year earlier for custom server racks that had their own backup batteries—Google essentially shifted UPS systems directly into the server cabinet, achieving an overall uptick in efficiency by forgoing a larger central backup.

If that architecture was deployed at the facility in Belgium, the nature of the recent failure suggests some of those localized battery backups weren't capable of powering their server loads before emergency generators kicked in, resulting in data loss, Haines told CRN.

Google suggested as much in the analysis on its status page: "Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain."

Google is not elaborating on what could have caused those batteries to become depleted.

"Some of those batteries could have been drained, either because they were older or the prior lightning strikes might have damaged them, rendering them ineffective," Haines said, qualifying his statement with, "Google keeps everything close to the [vest], so it's hard to know how they have data centers set up."

Either way, it shouldn't have happened, he said.

"As long as you set up correctly and have these fail-safes in place, it's not going to be that big of a concern. But, obviously, something went wrong there, and their fail-safes didn't work for them," Haines told CRN.

While engineering proper fail-safe electrical systems falls on the operator, customers are responsible for implementing good data-protection practices, he added.

"If data is important enough to put in a data center and is mission-critical, then it's important enough to be in two data centers," Haines told CRN. "You should think about distributing workloads across several locations."

Google offered a similar sentiment in its incident report: "GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters. Customers who need maximum availability should be prepared to switch their operations to another GCE zone."

Haines said the best approach is replicating in realtime between two sites -- the so-called hot/hot method, where both locations are operational and receiving data. But a hot/cold method, where one location just backs up the other, is a more affordable approach that also works.

What's most important is the user has a disaster-recovery plan, "because anything can happen at any time at any place," Haines told CRN.

Time will ultimately reveal the actual nature of the failure at Google's facility in Belgium, Haines said. With the amount of money Google spends on infrastructure, it's unlikely they just didn't invest enough in batteries.

One thing Google got right: "This outage is wholly Google's responsibility," the Internet giant wrote on its status page.

PUBLISHED AUG. 21, 2015