Why Nvidia Chose AMD EPYC For Its New DGX A100 AI System

'To keep the GPUs in our system supplied with data, we needed a fast CPU with as many cores and PCI lanes as possible,' an Nvidia executive says of the company's decision to choose AMD over Intel for its new DGX A100 deep learning system.


Nvidia said it chose AMD's latest EPYC server processors over Intel Xeon for the chipmaker's new DGX A100 deep learning system because it needed to squeeze as much juice as possible from its new A100s in order to realize the GPU's generational leap in performance.

The Santa Clara, Calif.-based company announced the A100 and the DGX A100, the star vehicle for the new data center GPU that combines inference and training acceleration, on Thursday. While the new GPU and system capabilities were the main attraction, the DGX A100's choice of CPU marked a notable departure for the GPU powerhouse.

[Related: Nvidia's 5 Biggest GTC 2020 Announcements: From A100 To SmartNICs]

Sponsored post

Nvidia had previously relied on Intel's Xeon processors to provide the CPU compute horsepower for its first two DGX systems, but that changed with the DGX A100, which sports two 64-core AMD EPYC 7742 processors.

Charlie Boyle, vice president and general manager of DGX Systems at Nvidia, told CRN that the decision to go with one of the top processors in AMD's second-generation EPYC Rome lineup came down to speed, core count and throughput.

"We always start our DGX designs around getting the most out of our GPUs. Our new Nvidia A100 GPUs that we use in the DGX A100 deliver a tremendous leap and performance and capabilities," he said in a statement. "To keep the GPUs in our system supplied with data, we needed a fast CPU with as many cores and PCI lanes as possible. The AMD CPUs we use have 64 cores each, lots of PCI lanes, and support PCIe Gen4."

Beyond the high performance and core counts offered by the AMD EPYC 7742, the other key was the CPU's support of PCIe 4.0, which is significantly faster than PCIe 3.0. Intel's second-generation Xeon Scalable processors, on the other hand, only support PCIe 3.0.

"The DGX A100 is the first accelerated system to be all PCIe Gen4, which doubles the bandwidth from PCIe Gen3. All of our IO in the system is Gen4: GPUs, Mellanox CX6 NICs, AMD CPUs, and the NVMe drives we use to stream AI data," Boyle said.

However, despite the change in CPU vendor between the first two DGX systems and the new DGX A100, customers won't notice a difference, except for improved performance, according to Boyle.

"At the system level, we do the software engineering work to make the differences between CPU architectures invisible to customers," he said. "Our customers can take applications that ran on our previous generation DGX systems and run them on the new DGX A100 without making any changes – the applications just run faster."

Eliot Eshelman, vice president of strategic accounts and HPC initiatives at Microway, a Plymouth, Mass.-based high-performance computing system builder that works with Intel, AMD and Nvidia, said he wasn't surprised by Nvidia's decision to go with AMD EPYC for the new DGX system because it's all about reducing the number of bottlenecks.

"Intel is still on PCIe Gen3, so it's half the bandwidth of AMD. I feel like it's kind of a no-brainer," he said.

An executive at a system builder that also works with Intel, AMD and Nvidia, said AMD EPYC's support for eight-channel memory also game the chipmaker a leg up over Intel, which only supports up to six memory channels in its socketable Xeon Scalable processors.

"The eight-channel memory and the PCIe 4.0 and the high core counts all gave AMD an advantage in the selection process. By default, it has to," he said. "That's why AMD EPYC Rome has been very competitive in certain segments. It's not just that they're priced reasonable, which they are. But they give you significant advantages in core, IO and memory addressing."

But with Intel planning to launch its 10-nanometer Ice Lake server processors later this year, the system builder executive said he expects Intel to leapfrog AMD, at least for a while.

"It's a leapfrog game, and then they'll potentially be back in the league for a little bit," he said.