AMD: Instinct MI350 GPUs Use Memory Edge To Best Nvidia’s Fastest AI Chips

At its Advancing AI event, the chip designer says the 288-GB HBM3e capacity of the forthcoming Instinct MI355X and MI350X data center GPUs help the processors provide better or similar AI performance compared to Nvidia’s Blackwell-based AI chips.

AMD said its forthcoming Instinct MI350 series GPUs provide greater memory capacity and better or similar AI performance compared to Nvidia’s fastest Blackwell-based chips, which the company called “significantly more expensive.”

Set to launch in the third quarter, the MI355X and MI350X GPUs will be supported by two dozen OEM and cloud partners, including Dell Technologies, Hewlett Packard Enterprise, Cisco Systems, Oracle and Supermicro.

The company said it plans to announce more OEM and cloud partners in the future.

“The list continues to grow as we continue to bring on significant new partners,” said Andrew Dieckmann, corporate vice president and general manager of data center GPU at AMD, in a Wednesday briefing with journalists and analysts.

The chip designer revealed the MI350 series GPUs and its rack-scale solutions on Thursday at its Advancing AI 2025 event in San Jose, Calif. The company is also expected to provide the first details of the MI400 series that is set for release next year.

The Santa Clara, Calif.-based company has made significant investments to develop data center GPUs and associated products that can go head-to-head with the fastest chips and systems from Nvidia, which earned more than double the revenue of AMD and Intel combined for the first quarter, according to a recent CRN analysis.

“We are delivering on a relentless annual innovation cadence with these products,” Dieckman said.

MI355X, MI350X Specs And Performance Claims

Featuring 185 billion transistors, the MI350 series is built using TSMC’s 3-nanometer manufacturing process.

The MI355X and MI350X both feature 288 GB of HBM3e memory, which is higher than the 256-GB capacity of its MI325X and roughly 60 percent higher than the capacity of Nvidia’s B200 GPU and GB200 Superchip, according to AMD. The company said this allows each GPU to support an AI model with up to 520 billion parameters on a single chip. The memory bandwidth for both GPUs is 8 TBps, which it said is the same as the B200 and GB200.

The MI355X—which has a thermal design power of up to 1,400 watts and is targeted for liquid-cooled servers—can provide up to 20 petaflops of peak 6-bit floating point (FP6) and 4-bit floating point (FP4) performance.

AMD claimed that the FP6 performance is two times higher than what is possible with the GB200 and more than double that of the B200. FP4 performance, on the other hand, is the same as the GB200 and 10 percent faster than the B200, according to the company.

The MI355X can also perform 10 petaflops of peak 8-bit floating point (FP8), which AMD said is on par with the GB200 but 10 percent faster than the B200; five petaflops of peak 16-bit floating point (FP16), which it said is on par with the GB200 but 10 percent faster than the B200; and 79 teraflops of 64-bit floating point (FP64), which it said is double that of the GB200 and B200.

The MI350X, on the other hand, can perform up to 18.4 petaflops of peak FP4 and FP6 math with a thermal design power of up to 1,000 watts, which the company said makes it suited for both air- and liquid-cooled servers.

AMD said the MI355X “delivers the highest inference throughput” for large models, with the GPU providing roughly 20 percent better performance for the DeepSeek R1 model and approximately 30 percent better performance for a 405-billion-parameter Llama 3.1 model than the B200.

Compared to the GB200, the company said the MI355X is on par when it comes to the same 405-billion-paramater Llama 3.1 model.

The company did not provide performance comparisons to Nvidia’s Blackwell Ultra-based B300 and GB300 chips, which are slated for release later this year.

AMD: Higher Tokens Per Dollar A ‘Key Value Proposition’

Dieckmann noted that AMD achieved the performance advantages using open-source frameworks like SGLang and vLLM in contrast to Nvidia’s TensorRT-LLM framework, which he claimed to be proprietary even though the rival has published it on GitHub under the open-source Apache License 2.0.

The MI355X’s inference advantage over the B200 allows the GPU to provide up to 40 percent more tokens per dollar, which Dieckmann called a “key value proposition” against Nvidia.

“We have very strong performance at economically advantaged pricing with our customers, which delivers a significantly cheaper cost of inferencing,” he said.

As for training, AMD said the MI355X is roughly on par with the B200 for 70-billion-parameter and 8-billion-parameter Llama 3 models.

But for fine-tuning, the company said the GPU is roughly 10 percent faster than the B200 for the 70-billion-parameter Llama 2 model and approximately 13 percent faster than the GB200 for the same model.

Compared to the MI300X that debuted in late 2023, the company said the MI355X has significantly higher performance across a broad range of AI inference use cases, running 4.2 times faster for an AI agent and chat bot, 2.9 times faster for content generation, 3.8 times faster for summarization and 2.6 times faster for conversational AI—all based on a 405-billion-paramater Llama 3.1 model.

AMD claimed that the MI355X is also roughly 3 times faster for the DeepSeek R1 model, 3.2 times faster for the 70-billion-parameter Llama 3.3 model and 3.3 times faster for the Llama 4 Maverick model—all for inference.

Rack-Scale Solutions To Push 2.6 Exaflops Of AI Compute

The company said the MI350 GPUs will be paired with its fifth-generation EPYC CPUs and Pollara NICs for rack-scale solutions.

The highest-performance rack solution will require direct liquid cooling and consist of 128 MI355X GPUs and 36 TB of HBM3e memory to perform 2.6 exaflops of FP4. A second liquid-cooled option will consist of 96 GPUs and 27 TB of HBM3e to perform 2 exaflops of FP4.

The air-cooled rack solution, on the other hand, will consist of 64 MI350X GPUs and 18 TB of HBM3e to perform 1.2 exaflops of FP4.

AMD did not say during the Wednesday briefing how much power these rack solutions will require.