United Airlines, NYSE Outages Reveal Poor Redundancy Architecture, Insufficient Testing
A string of high-profile outages Wednesday morning that grounded United Airlines flights and halted trading at the New York Stock Exchange made clear that major institutions are failing to properly implement high-availability systems, according to one solution provider with expertise in mission-critical infrastructure.
Matt Gerber, CEO of Digital Fortress, a cloud services and data center operator based in Seattle, told CRN that for those failures to have occurred, the systems must have had shortcomings in built-in redundancy, fault tolerance and instantaneous failover capabilities, and they likely were not properly tested.
"You have to cycle through different scenarios when testing," Gerber, who once ran a leading disaster recovery business serving the financial services market, told CRN. "From what you saw today, different things fail at different places and different times."
/**/ brightcove.createExperiences(); /**/
[Related: NYSE Suspends All Trading]
Authorities dismissed fears of a coordinated cyberattack on United Airlines, the NYSE and The Wall Street Journal -- which had problems with its website -- leaving more mundane explanations for their IT failures.
United blamed faulty network connectivity caused by a router malfunction for grounding its global fleet for almost two hours, starting around 8:30 a.m. Eastern time.
Soon after, at around 11:30 a.m. Eastern, an "internal technical issue" forced trading to cease at the world's most important stock exchange. The exchange resumed operations at 3:10 p.m.
A NYSE spokesperson later said, "The root cause was determined to be a configuration issue."
To cap off the trifecta, the homepage of The Wall Street Journal started displaying the 504 error message just before noon, meaning a gateway server was timing out. Some insiders speculated that massive traffic caused by the NYSE outage crashed the newspaper's website.
While the outages renewed national attention on the vulnerabilities of major transactional computing systems and cyberattacks, "the biggest lesson learned here is that you can't plan for every disaster, and as a result of that, you have to be very rigorous and fluid in how you do disaster recovery planning and testing," Gerber said.
IT administrators must implement progressive and dynamic testing regimens, evaluating different parts of their systems in different ways and at different times, said Gerber, who has no first-hand knowledge of the root problems at the stock exchange and the airline, since their underlying infrastructure and operations are not made public.
For instance, United Airlines passengers might have been spared much grief if the carrier's IT staff pulled down various routers during testing to see what happened.
"You do have the capability to put systems in that are truly high-availability and have multiple layers of redundancy, but there's a cost tradeoff with that," Gerber said.
That's why financial institutions typically do a business impact analysis of any potential failures.
"They need to look at each and every aspect of the system and assess impact to customers of that piece of the system going down and then assign a criticality to it. And then based on that criticality, that's how you spend your money," Gerber said.
In the banking world, there are five levels of criticality, each assigned a maximum allowable downtime.
A real-time stock exchange would be in the highest tier, with no acceptable downtime. That means every component should be completely redundant with instantaneous failover protocols.
"There was a clearly a flaw in the schema if that were the case. Clearly something broke," Gerber said.
The FBI and Homeland Security both said they have ruled out an attack and concluded the outages were unrelated.
But for those conspiratorial types looking for evidence of a sinister plot, Tuesday's tweet from the hacker collective Anonymous certainly provided some fodder:
Wonder if tomorrow is going to be bad for Wall Street.... we can only hope./**/ /**/
PUBLISHED JULY 8, 2015