How NYSE Could Have Avoided Embarrassing Outage

The New York Stock Exchange has released a post-mortem of Wednesday's outage, explaining that a software update was the culprit that halted trading for three hours.

The rollout of new software was in preparation for what the exchange said was an industrywide test of the SIP timestamp requirement.

The faulty software was loaded onto a production server and immediately caused communication problems with gateway servers that traders were using to access the system. The gateways had not been configured to be compatible with the updated release, according to an NYSE spokesperson.

[Related: United Airlines, NYSE Outages Reveal Poor Redundancy Architecture, Insufficient Testing]

id
unit-1659132512259
type
Sponsored post

Problems exhibited themselves before the market opened that morning and escalated over the next few hours before trading was shut down at the NYSE and its affiliated NYSE MKT, once known as the American Stock Exchange.

"As customers began connecting after 7 ... on Wednesday morning, there were communication issues between customer gateways and the trading unit with the new release," according to the spokesperson.

A data center infrastructure expert told CRN that while such problems sometimes are unavoidable, stronger testing protocols might have prevented a lot of embarrassment.

Lief Morin, president of Key Information Systems, a data center operator based in Los Angeles, said there are lessons to be learned, primarily: Extensively test software updates and new technology before deploying it anywhere near a production environment.

"There's systems and procedures. We call it: build, test, run. We have separate sets of systems that do all three of those things. You build, then test for performance, reliability and upgradability, then there's a separate set of architectures you run on."

The stock exchange outage occurred almost simultaneously with system failures at United Airlines that grounded the carrier's global fleet and at The Wall Street Journal that brought down the home page of the newspaper's website. Authorities said the three outages were not connected, nor were they any form of coordinated cyberattack.

Morin said Wednesday's "interesting trifecta" serves "to remind us that we're heavily dependent on technology, and on rare occasion now, it fails. And the three high-profile things at the same time, it's a stark reminder of that, and you've got to be vigilant about it."

He said he wasn't surprised those systems didn't immediately failover to redundant backups without service interruptions.

"We do want to believe that you can create a system or set of systems that are ultimately inviolable, but it's not the case," Morin told CRN. "It's not possible. I think that's what we're seeing here in lots of different ways."

But proper testing and quality assurance, while not completely infallible, are absolutely imperative, he said.

"I can tell you there are some lessons to learn from that end, and I'm pretty sure they're going to learn them," Morin said. "There's layers of resiliency and redundancy that you build in, and then, once it's in production, it's never a static, monolithic thing. I guarantee they have this process in place, but it looks like something fell through the cracks," he said of United and the New York Stock Exchange.

The NYSE spokesperson said the exchange followed standard protocol when its IT staff deployed the new software on a single trading unit.

Once problems became apparent, even before the market opened at 9:30 am, the gateways were updated with the correct version of software to allow compatibility.

"However, the update to the gateways caused additional communication issues between the gateways and trading units, which began to manifest themselves midmorning," the spokesperson said.

Customers continued reporting unusual behavior, prompting the decision to halt activity at 11:32 am.

The exchange ultimately rebooted all customer gateways and failed-over to backup trading units at its Mahwah, N.J., datacenter so trading could resume.

Rob Rae, vice president of business development at Datto, a cloud backup service, told CRN the true cause of the NYSE outage was likely something that could impact many more businesses: a lack of investment in IT.

"This was preventable and it appears they didn’t plan appropriately for the outage," Rae told CRN.

Americans often blame hackers and terrorists for outages when they should be looking to point fingers closer to home.

"The NYSE is atop an ever-growing list of businesses who don't devote enough attention to their IT departments," Rae said. "This should never happen. Realistically, we are defaulting to terrorism as the cause of this 'glitch' because the real reason is even scarier, and this outage was probably avoidable."

PUBLISHED JULY 9, 2015