Lessons Learned: Pentium Flaws Aid Intel In Sandy Bridge Chipset Recall

Intel is receiving mostly positive feedback for its handling of the Sandy Bridge chipset recall, and a big reason for that may be the chip maker's past missteps in dealing with high-profile design errors and recalls.

The world's largest chip maker was once known for making critical mistakes related to product flaws, and one episode in particular 17 years ago proved to be a crucial learning experience for Intel that helped it avoid a potentially embarrassing repeat.

In 1994, a few missing entries in Intel's lookup table for its divide operation algorithm led to an error in the P5 Pentium's floating point unit, producing inaccurate results in approximately 1 in 9 billion floating point divides. Intel referred to it as a "slight reduction in precision." That's a long way of saying that the chips would occasionally cause incorrect math results in very rare cases.

At first, Intel appeared to ignore the problem and downplay the significance of the error. Later, the company offered to replace processors -- but for only those Pentium users who could prove that their systems had been affected by the FDIV flaw. In other words, simply having the flawed product wasn't enough to get you a refund; you had to have the product break before getting your money back. Imagine Ford and Firestone requiring customers to actually crash their Explorers before recalling the vehicles.

Sponsored post

Predictably, Intel's reaction to the floating point error triggered a public outcry that quickly built momentum. Eventually, the chip giant offered to replace all flawed Pentium processors based on request and cover the costs associated with its Pentium FDIV replacement program.

On Jan. 31st, just a few weeks after launching the much-anticipated Sandy Bridge platform at CES 2011 in Las Vegas, Intel announced a design error in its Cougar Point chipset, affecting its Core i5 and i7 Sandy Bridge processors. Intel said it had learned -- after the chipsets began shipping -- that Cougar Point's SATA ports would degrade over time, adversely affecting SATA hard drives in some Sandy Bridge systems.

According to Intel Chief Financial Officer Stacy Smith, all Intel needed to do to fix it was make a change to a metal layer in the SATA ports. Smith said Intel's engineers identified the problem, corrected it, and began production on replacement chipsets all within a matter of days.

Tim Ulmen, principal at Midwest IT Solutions Group, a Wichita, Kan.-based system builder, said the Cougar Point issue is reminiscent of the Intel Pentium FDIV bug from 1994.

"Albeit the error occurred rarely, it was a design flaw and for a brief period of time Intel skirted the issue and set a policy of replacing CPUs only in cases where the customer could prove the flaw affected them," Ulmen said. "I can vividly remember those days as I was involved with selling tier one computers at a retail level. Most of my customers tended to discount the error because of its infrequent nature but I did have some customers who insisted on a replacement CPU."

Meanwhile, Todd Swank, vice president of marketing at system builder Nor-Tech, said he recalls the Pentium FDIV error being a major issue -- but one that's very different from the Sandy Bridge situation.

"[The Pentium FDIV flaw] happened after so many customers had already had systems in hand and were already relying on them," Swank said. "Plus, that was a difficult problem to detect and customers feared that their fundamental data could be flawed and they didn’t even know about it," Swank said. "The recent Sandy Bridge episode was caught so early that very few customers were actually affected. It also involved a very obvious failure should a customer’s SATA drive suddenly stop working, nothing as subtle as the original Pentium flaw."

Next: The Pentium FDIV Bug

The massive fallout from the Pentium FDIV bug became a cautionary tale for how companies must handle such unforeseen circumstances given the evolution of the Internet and the media as a whole, without seeming arrogant, secretive or indifferent.

Much of the negative blowback resulted from Intel's apparent mishandling of public perception. Thomas Nicely, then a professor of mathematics at Lynchburg College, independently discovered and disclosed the Pentium FDIV flaw in October 1994. Nicely brought the flaw to the attention of Intel; amazingly, the chip maker told Nicely that it had been aware of the FDIV bug as far back as May or June. Rather than make a public announcement about the Pentium flaw, Intel kept the findings under wraps.

On his personal Web site, Nicely wrote that overall Intel seemed completely unprepared for something like the FDIV flaw. "Intel's initial failure to publicize the problem…was in retrospect a mistake" that alienated its OEM partners and most valued customers, Nicely wrote. He added that Intel's failure to warn its own tech support staff was "even more baffling."

For several weeks in 1994 Intel behaved as though there was no problem at all. Alexander Wolfe of E.E. Times broke the Pentium error story on November 7 after discovering Nicely's findings on a Compuserve message board. Much like the recent Sandy Bridge recall, Wolfe's story included not only initial news of a problem, but also Intel's immediate assurance that it had discovered the anomaly through its own random testing.

Steve Smith, then a Pentium engineering manager, now vice president and director of PC Client Operations and Enabling at Intel, told Wolfe fixing the flawed Pentiums required a mask change to its floating-point unit and an update to its programmable-logic array. "This doesn't even qualify as an errata," an Intel spokesperson told EE Times. "We fixed it in a subsequent shipping."

Nevertheless, IBM stopped shipping Pentium-based PCs despite Intel's objections, and media attention began to grow as news outlets such as CNN began following the story two weeks after EE Times first broke the story. Intel told the The New York Times that it had indeed discovered the FDIV bug in June and had already corrected the design error -- yet flawed chips were still in the market and had yet to be recalled. As a result, Intel experienced a significant decline in its stock price, and later, eight product liability lawsuits and two shareholder lawsuits were filed against Intel.

"That was messy for Intel. Very messy," said Steve Brown, vice president of sales and business development at BlueHawk Networks, a San Jose, Ca.-based system builder. "Pentium was at the core of what they were all about at the time. It was a PC chip that hit their top bread-and-butter as a chip company."

But the problem worsened. Even after Intel established its FDIV replacement program, it did not allow resellers to participate, citing the end user's prerogative. "This program is meant for end users of working systems, who are concerned about the impact of the flaw on their specific programs and applications," Intel wrote on its support site. "It is the individual decision of the end user to determine if the flaw is affecting their application accuracy."

Intel required that end users replace the processor themselves and told customers who reported that replacement CPUs were not working that the system should work if the right chip and instillation procedure were used. Intel also said it would not offer upgrades "under any circumstances." According to technology writer Vince Emery, who penned an extended account of Intel's Pentium woes soon afterward, Intel remained silent while the public grew restless and began to believe that Intel did not stand behind its products.

Next: Picking Up The Pentium Pieces

In contrast to the Pentium FDIV episode, when revealing the Cougar Point error earlier this year, Intel took immediate action even though the SATA port flaw would manifest itself years down the road for even for the most extreme users. Steve Smith said Intel would continue to ship Sandy Bridge processors and that the 32-nm process remained robust. He added confidently that Intel would temporarily hold shipments of Cougar Point chipsets and expected to resume shipments quickly. At the time, Smith said shipments would resume in April at the latest and possibly in March. Smith's statements rung true; when Intel resumed shipments in February it underscored the simplicity of the fix and projected the appearance of confidence.

Intel was also quick to assure its OEM and reseller partners downstream that it would support them and assured customers that it will resolve the Cougar Point problem. This was a stark contrast to the company's reaction in 1994; Intel exposed itself to accusations of arrogance by dismissing even the possibility that it had made a mistake with Pentium.

Even after independent parties reverse-engineered the Pentium Floating point divisions and created a model that predicts when an error will occur, Intel continued to downplay the FDIV flaw and denied that it occurred often enough to justify a recall. By December, the problem grew beyond Intel's control; just days before Christmas the company announced a full recall and replacement for all flawed Pentium chips, which would result in a pre-tax charge of $475 million for Intel.

Ironically, Nicely himself questioned the need for a full Pentium recall , given the high cost to Intel and the low number of users that would be directly affected by the FDIV error. "Certainly, it was appropriate from a public relations standpoint," Nicely wrote of the recall on his Web site. "My own feeling is that a great many people overestimated the importance, impact, and peril of the flaw."

But if Intel in November had replaced the defective chips regardless of whether users could prove that their applications were affected, Emery wrote, the company could have avoided the enormous backlash and brand damage that followed in December. But Intel's response to the FDIV bug became a bigger issue than the actual chip flaw. "Not disclosing the information made Intel appear to hide a sinister secret," Emery wrote. "It sent the message to customers that Intel was not trustworthy. Disclosing the flaw upon discovery would have created only minor news."

"Minor" may be a relative term here; After all, the Cougar Point flaw became a huge headline even though Intel jumped out in front of the story and self-reported the flaw. But the episode quickly faded after a couple weeks, whereas the Pentium FDIV flaw snowballed into a larger, more damaging story for Intel that lasted a couple months. As the saying goes, the cover-up became worse than the crime.

Instead, even when Intel finally acknowledged the problem in November, then Intel CEO Dr. Andrew Grove -- who later wrote that the FDIV bug experience marked a major strategic inflection point for the chip maker -- addressed the issue in an e-mail rather than a press conference or conference call with media members.

Next: Pentium FDIV Versus Cougar Point

Intel's handling of the Sandy Bridge chipset problem, on the other hand, started with immediately alerting all concerned parties via conference call, and reflected an understanding that the medium is often as important as the message.

The chip maker also presumably learned from IBM's proactive response -- issuing a press statement and suspending the shipment of Pentium-based PCs without pointing out the rarity of the flaw -- that it's important to work with one's OEM partners when an unforeseen issue arises. Even when Intel earlier this month chose to resume shipping the Intel 6 series chipset, Intel said it decided to lift the hold on shipments after extensive talks with its OEM partners on the subject.

In addition, Intel and its OEM partners learned to keep specific brand names out of the conversation when announcing errors, delays, or recalls. Intel spent a significant amount of time and capital on promoting its Pentium brand, which, during the uproar over the FDIV bug, became a running joke and, in some cases, an outright insult (remember the joke about how many Pentium engineers it takes to crew in a light bulb?).

During the conference call for the Sandy Bridge chipset recall, both Stacy Smith and Steve Smith repeatedly used the term "Cougar Point" rather than Sandy Bridge, and were quick to point out that the Sandy Bridge chip itself had not been affected. Partners like Lenovo have since launched Sandy Bridge-based systems under the slightly unwieldy moniker "second-generation Intel Core i3, i5, i7 processors" rather than invoke a name widely associated with a specific mistake.

Fixing the error when it occurs is crucial to minimizing damage, but paying attention to the impact on the brand is also important. Michael Oh, president of Boston, Mass.-based Apple partner Tech Superpowers, was in the industry in 1994 and remembers how impactful the Pentium FDIV error was for Intel's brand. "From a technical standpoint, I never ran into a client that actually had the issue," he said. "Theoretically it was something that could happen, but most people weren't doing spreadsheets to the twentieth degree. The primary problem was from a brand standpoint, and that was substantial partially because [the FDIV error] got out into the market."

Even if Intel's new Sandy Bridge platform had suffered extensive damage to its reputation because of the Cougar Point flaw, it would not have had quite the same impact on Intel's overall business as Pentium's widespread disparagement caused. "Sandy Bridge is different; Intel isn't depending on a single product line as it used to," Oh said. "The impact on the users, as far as I can tell, is pretty minimal. It isn't necessarily that they dodged a bullet -- the problem was publicized enough that Intel had to do some scrambling -- but it seems to be a problem that only arises in the long-term."

Next: Deja Vu: Another Pentium Episode

Despite some lessons learned from the Pentium P5 FDIV flaw, Intel would stumble again years later. In the summer of 2000, Dr. Thomas Pabst of Tom's Hardware Guide reported serious instability with Intel's new 1.13 GHz Pentium III chip that was so severe he couldn't sufficiently test the product. While he initially attributed the instability to a micro code update, after performing further tests and conferring with other PC hardware reviewers Pabst came to the conclusion that the Pentium III itself was flawed.

Pabst brought his findings to the attention of Intel, which downplayed the results and claimed that Tom's Hardware's test motherboards and cooling system were to blame for the chip's erratic behavior. When Pabst continued to press the issue, Intel took Tom's Hardware "out of the loop" and refused to brief the site on its forthcoming Pentium 4 technology. Essentially, Intel blackballed Tom's Hardware for reporting what was later confirmed to be true: the 1.13 GHz Pentium III model was flawed. Intel finally admitted the problem weeks after Pabst first brought the issue to light.

Ironically, Intel ran into more trouble with Pentium 4 just weeks later as the company delayed the microprocessor's launch because of chipset issues. The Pentium 4 chipset glitch persisted after launch and adversely affected both the product's introduction and the company's reputation.

The 1.13 GHz Pentium III error wasn't as serious as the Pentium FDIV episode, in part because Intel had become a larger force in the IT industry and had established Pentium as a successful and popular product after 1994. However, the Pentium III issue gave Intel critics further evidence that the company was arrogant and unwilling to accept responsibility for a mistake.

Intel's reaction to problems such as these has obviously changed over the last 11 years. Rather than deflecting media inquiries and partner concerns, the chip giant began to tackle product issues head on, and the means for dealing with the issues – both in terms of communications and logistics – also drastically changed. In short, the evolution of the Internet has made prolonged silences not only highly inadvisable but also virtually impossible.

Following the FDIV bug in 1994, Intel did not offer Internet ordering capability as part of its replacement program, nor did it provide customer support online until the problem came to light, either of which would be unthinkable today. In 1994, Intel's Steve Smith referred to Professor Nicely as "the most extreme user," perhaps not realizing at the time that Nicely could simply disseminate Intel's unofficial confirmation of the problem to hundreds of other "extreme" users over the Internet in a matter of hours via e-mail and primitive message boards. (Intel did not provide comment for this story.)

The Internet enabled communication between enthusiasts in any niche, which we take for granted today -- but Intel learned the hard way. Back then, as Emery points out, it took some time before news of the issue went from technology-oriented news media such as EE Times to more mainstream, financial or industry-oriented news media such as CNN. Today, there is less of a distinction between mainstream and technology media, and the time it takes for a major headline to move from a niche-focused outlet to national news coverage is closer to two hours, not two weeks.

"I certainly think times are different," Oh said. "If you look at the world of social media today, and the rapid transfer of information where people find out about a flaw like this within hours, and compare that to 1994, which is effectively before mainstream Internet, most people then would have read about it in the paper or Time magazine and it was easier to have plausible deniability in a situation like that. Nowadays, Intel's handling may have been indicative of learning from what happened in '94, but mostly as a modern technology company they have to fess up to this stuff immediately."

Next: Intel, Chip Industry Evolve

The nature of semiconductor technology has changed over the years as well, as Intel and its rivals have become increasingly concerned with reliability, power efficiency and integration. Intel in 1994 warned customers in its FDIV replacement program that many systems are designed for one speed only and attempting to upgrade to a functional program could damage the processor. Customers testing their systems for the error using instructions posted by engineers at The University of Kansas had to enter an elaborate formula into Windows Calculator, rather than rely on the kind of testing software widely available today.

As the ability to check these results expanded to more and more people, Intel made the cautious, proactive decision to reveal the Cougar Point design error quickly. Despite the fact that it turned out to be much less serious than the FDIV bug (a chipset flaw versus an actual design error on the processor itself), it necessitated swift action. Steve Smith on Feb. 1 said less than five percent of Cougar Point SATA ports may degrade over a few years inside notebooks. He added that the chipset had passed several stages of rigorous testing, and that only OEMS testing their systems at high temperatures discovered it.

"It certainly is a serious issue," Oh said. "If this had gone into the market, with thousands of units sold, and had the performance of the machines gone down over time because of a bug, that would have had a major financial impact -- product recalls, financial lawsuits. I don't want to understate the potential train wreck this could have been for Intel, but at the end of the day they've avoided it -- so far, at least."

In 1994, it took months for Intel to manufacture corrected Pentium chips. A decade-and-a-half of dominance later, Intel has the fabrication facilities and the capital to make a swifter – and evidently immediate -- transition to producing corrected chipsets.

Intel said that only previous-generation SATA ports for Cougar Point were affected by the error and that despite some supply line disruption, Intel's roadmap had not changed and the company did not expect the issue to affect full-year revenues. Although it remains to be seen if Intel avoids permanent financial damage, the resumption of Sandy Bridge shipmentsa week later is a good sign.

Brown, for one, believes the Sandy Bridge chipset recall is a very different – both in terms of technology, and the people involved in the decision-making process – and that Intel itself is also different.

’Today’s issue is far less important, especially given Intel’s emergence as a leader in the industry, and all the things they do besides their core business,’ Brown said. ’Intel has learned from what happened with Pentium in 1994, and they certainly did address it. At the same time, compared to 1994, there is a new breed of people now that don’t necessarily know what happened 17 years ago and are taking a different tact and a different method today.’