Intel’s Gaudi 3 AI Chip Targets Nvidia H100, H200; Scales To 8,000-Chip Clusters

At the Intel Vison event, the semiconductor giant reveals several details of its upcoming Gaudi 3 AI chip, which include competitive comparisons against Nvidia’s H100 and H200 GPUs, the release schedule, specifications and OEM support.

Intel said its upcoming Gaudi 3 AI accelerator chip can best Nvidia’s powerful H100 GPU for training large language models and offer similar or, in some cases, better performance than the rival’s memory-rich H200 for large language model inferencing.

At the Intel Vision event Tuesday, the semiconductor giant unveiled the performance comparisons and several other details, including OEM support and a reference architecture relying on Ethernet to scale server clusters to more than 8,000 chips, for Gaudi 3.

The Santa Clara, Calif.-based company is hoping Gaudi 3, the successor to Intel’s Gaudi 2 from 2022, will give it a major boost in competition against Nvidia. The rival has dominated the AI computing space with its GPUs and recently unveiled plans to launch much more powerful follow-ups to the H100 and H200 later this year using its new Blackwell architecture.

Intel said air-cooled and liquid-cooled versions of the Gaudi 3 accelerator card will start sampling with customers in the first and second quarters of this year, respectively. It then plans to launch the air-cooled version in the third quarter and the liquid-cooled version in the fourth quarter.

At launch, Gaudi 3 will be supported by Dell Technologies, Hewlett Packard Enterprise, Lenovo and Supermicro, which are expected to receive the chip by the second quarter. The chip will also be made available in the Intel Developer Cloud platform for testing and development.

While Intel is targeting Nvidia’s H100 that debuted in 2022 and the recently launched H200 with Gaudi 3, the company will have to contend with the fact that Nvidia could start rolling out much more powerful GPUs around the same time as Gaudi 3.

This is part of an accelerated road map Nvidia announced last fall to stay ahead of rivals, underlying one of the major challenges Intel faces in trying to gain market share. The chipmaker is also facing competition from AMD and cloud service providers designing their own AI chips, like Amazon Web Services and Microsoft Azure, as well as a slew of startups.

But Intel executives emphasized that customers are looking for alternatives to Nvidia’s powerful and popular chips and said they are building a comprehensive AI strategy with open, scalable systems that will be attractive for enterprises across all AI segments.

“When we talk to customers, they are clearly asking us for choice, for competitive price-per- performance, for time-to-value and for an open ecosystem for AI solutions and deployments,” said Jeni Barovian, general manager of AI solutions strategy and product management within Intel’s Data Center and AI Group, in a briefing with journalists and analysts.

“And so we have a commitment here at Intel in a number of different areas: to streamline the developer workflow for AI, to simplify AI infrastructure, to provide scalable systems and to accelerate AI workloads with our foundational silicon software and systems,” she said.

As for Nvidia’s upcoming Blackwell GPUs, Intel plans to measure the performance of the rival chips once they launch, but the company expects Gaudi 3 “to be highly competitive” due to its pricing as well as other considerations like the chipmaker’s reliance on industry-standard Ethernet and Gaudi 3’s open integrated network on chip, according to Intel’s Das Kamhout.

“We believe it's a strong offering that provides that level of choice at the right [total cost of ownership] and the power efficiency that our customers are demanding,” said Kamhout, who is a vice president and senior principal engineer in the Data Center and AI Group.

Gaudi 3 Specs, Improvements Over Gaudi 2

Compared with Gaudi 2, Gaudi 3 will provide double the AI compute performance using the 8-bit floating point (FP8) format and quadruple the performance using the 16-bit BFLOAT16 format, according to Intel. It will also offer two times greater network bandwidth and 1.5 times greater memory bandwidth, the company added.

There are a variety of factors that make these improvements possible while using the same underlying architecture as the predecessor chip.

Eitan Medina, COO of Intel’s Habana Labs unit that designs the Gaudi chips, said Gaudi 3 is made up of two identical dies packaged together with eight high-bandwidth memory (HBM) banks. These dies are manufactured using a 5-nanometer manufacturing process in contrast with a 7nm process that has been used for Gaudi 2.

Each of the two dies has four matrix math engines, 32 fifth-generation Tensor Processor Cores and 48 MB of SRAM, and they’re connected using a “very high-bandwidth interface that from a software point of view, allows [Gaudi 3] to act as a single device,” according to Medina.

This gives Gaudi 3 a total of eight matrix math engines, 64 Tensor Processor Cores and 96 MB of SRAM. The chip also has a total HBM capacity of 128 GB and memory bandwidth of 3.7 TBps. It comes with support for 16 lanes of PCIe 5.0 connectivity as well.

What helps make Gaudi 3 unique, according to Medina, is the fact that the chip integrates 24 200-Gigabit Ethernet (GbE) network interface ports. These ports take advantage of a technology called RDMA over Converged Ethernet (RoCE), which enables remote direct memory access over an Ethernet network. This, in turn, allows up to eight Gaudi 3 chips to communicate with each other within a single server as well as across multiple nodes.

“Depending on what your workload needs are for inference, fine-tuning or training, you [can] build clusters at any size. [You] form a single node with just eight accelerators. And using Ethernet, you can build racks and complete clusters with literally thousands of Gaudis. This is what our customers are doing today,” Medina said.

Intel plans to make Gaudi 3 available in two form factors: an accelerator card that is compliant with the Open Compute Project’s Accelerator Module (OAM) design specification and a dual-slot, full-height PCIe add-in card that offers the same peak FP8 performance of 1,835 teraflops.

The PCIe version of Gaudi 3 will have a thermal design power of 600 watts while the OAM version will go up to 900 watts with air cooling and up to 1.2 kilowatts with liquid cooling..

Gaudi 3 Scales To 1,024-Node Clusters

Intel has developed reference architectures for Gaudi 3 that start with a single node connecting eight chips and scale all the way to a 1,024-node cluster consisting of 8,192 chips.

These multi-node clusters are necessary for training large AI models faster and enabling lower latency and high throughput for the inferencing of such models.

Enabled by the OAM-compliant universal baseboard developed by Intel, the single node can achieve 14.7 petaflops of FP8 compute performance, and it sports 1,024 GB of memory and 8.4 TBps of networking bandwidth, according to Intel. A 64-node cluster brings that up to 940 petaflops, 65.4 TB of memory and 76.8 TBps of networking bandwidth.

Scaling to 512 nodes in a cluster provides 7.5 exaflops, 524.3 TB of memory capacity and 614 TBps of networking bandwidth. Bringing that to 1,24 nodes enables 15 exaflops, 1 PB of memory capacity and 1.2 PBps of network bandwidth.

“Using Ethernet as a standard interface and working across the ecosystem with our OEM partners, we literally are able to support customers building implementations of AI at any size,” Medina said.

Gaudi 3 Comparisons To H100, H200

Compared with Nvidia’s H100, Gaudi 3 enables 70 percent faster training time for the 13-billion-parameter Llama 2 model, 50 percent faster for the 7-billion-parameter Llama 2 model and 40 percent faster for the 175-billion-parameter GPT-3 model, according to Intel.

The chipmaker also compared Gaudi 3 to Nvidia’s H200, which significantly increases the HBM capacity to 141 GB from the H100’s 80 GB, higher than Gaudi 3’s 12-GB capacity.

For large language model inferencing, which Intel measured by tokens per second, it said Gaudi 3 is on par with the H200 for the 7B and 70B Llama models.

For the 180-billion-parameter Falcon model, however, while Gaudi 3 is on par with the H200 when it comes to 128-token output sequences, the chip gains an edge when those output sequences grow to 2,048 tokens. This translates into a 2.3-fold performance advantage when using 128 tokens as input and a 3.8-fold advantage when using 2,048 tokens as input.

When it comes to power efficiency, Gaudi 3 also has its advantages, according to Intel.

Compared with the H100, Gaudi 3 offers power efficiency that is on par or slightly worse when it comes to inferencing Llama and Falcon models with 128-token outputs, but the chip gains an advantage when those outputs are grown to 2,048 tokens. This advantage, measured by tokens per second per accelerator card per watt, translates into a 20 percent to 40 percent boost for 7B Llama, a 2.1-fold to 2.2-fold boost for 70B Llama, and a 90 percent to 2.3-fold increase for Falcon 180B.

“Our customers are telling us that what they really find limiting is actually their ability even to get power to the data center. Price-performance is important. But power-performance is also extremely important,” Medina said.

All of these comparisons were made at single- or multi-node levels rather than using a single accelerator chip, and Intel said it made the comparisons using recently public data from Nvidia.

Gaudi 3 Supported By ‘End-To-End AI Software Stack’

While offering competitive performance and efficiency are table stakes in the fight against Nvidia, Intel knows that software enablement is equally important to compete.

That’s why the company said it has developed an “end-to-end AI software stack” that includes everything from the firmware, libraries and drivers to the models, frameworks and tools needed to develop AI applications that run on Intel’s Gaudi chips.

Gaudi chips support a wide range of models, which includes Llama, Stable Diffusion, BERT, Mixtral, GPT-2, CodeLlama and Falcon. It also supports frameworks like PyTorch and PyTorch Lightning as well as libraries like Hugging Face and MosaicML. Supported orchestration platforms include Kubernetes, Red Hat and OpenShift. Machine learning tools supported by Gaudi chips include the company’s own cnvrg.io as well as TensorBoard and Ray.