Page 1 of 2
In an exclusive interview with ChannelWeb, AMD Executive Vice President of the Computing Products Group Mario Rivas discusses how the planned ramp-up of the company's Operton and Phenom quad-cores failed and what AMD is doing to fix problems. Here are edited excerpts.
When did AMD discover the glitch on Barcelona and Phenom? What did you do when you found out there was a problem?
First, let me give you a little bit of background here. No. 1, we've been saying all along, it's unfortunate that the problems we are causing are coming across like we don't care about the channel. We've been getting quite a few phone calls from our partners in the channel asking, 'Have you changed your mind?' The answer is no. For me, some members of my family work within the channel, so it gets a little bit personal here. But, no, we are definitely committed. And everybody's frustrated that we're not able to ship Opteron processors in the volumes that we had committed to do. Nobody's more frustrated than I am, because there's demand in the market. It's a very complex, maybe the most complex x86 processor ever designed. And we encountered unexpected challenges when we ramped the part and we worked with partners.
The particular bug that we're talking about, we started just like the other ones, as an observation. It was not until mid-November that it actually moved from just being an observation in a testing environment to being a more serious bug. Once you get the bug, we tried to do BIOS workarounds, we looked for board modifications, even in some instances, for some patches we could do that would not degrade performance. And again, we were successful 90 percent of the time.
You have to understand that we do not immediately tell the world that we have an observation or a particular bug. So once we tried however many different validations of this as we tried to do, then we move into the question of, does this affect a real-world application, or is this a theoretical problem that will never occur? And we have a group of people that have ample experience in the server industry in this case, who made the call that yeah, we think this can happen in real life. When we reached that point, it was a Friday and we started notifying the customers on Monday on that particular case.
We still are shipping Phenoms in quantities. We are taking good care of the channel, as far as I know. If you're hearing negative feedback on the Phenom, I'd like to hear about it.
No, it hasn't been about Phenom, it's been about the Opterons.
Right. I think this particular errata really affects server work, where the lack of integrity of data in the caches does have a big impact, a possible big impact. But it doesn't affect the client in the same manner. We were able to identify a workaround for the desktop product that would not have been suitable for several of the server workloads, of the quality and performance we need for the server-workstation market.
On the workaround on the Opterons, the Tech Report source talks about a 10 percent to 20 percent performance hit. Is that true, is that something you can address?
Yes, 5 to 20 percent is something we did see on our fix, which is unacceptable for the server space. It depends on the application. In some workloads you will only see 5 percent, in some 20. And then you talk to each customer individually and you give them the option to go forward or not, and in some cases a few customers have chosen to go forward, you know, even with the degradation.
On some of the HPC clusters, the workaround does not affect performance?
No, it always affects performance. It's just that for the particular load, our architecture scales very well. So when you have a several thousand-unit node, the probability of this happening is much smaller. And it depends on the operating system, too. Now, I don't want you to take this as though we're saying everything is okay. No, this is definitely not the ideal situation we were envisioning as late as the beginning of November.