Nvidia Says New Software Will Double LLM Inference Speed On H100 GPU

The AI chip giant says the open-source software library, TensorRT-LLM, will double the H100’s performance for running inference on leading large language models when it comes out next month. Nvidia plans to integrate the software, which is available in early access, into its Nvidia NeMo framework as part of the Nvidia AI Enterprise software suite.


Nvidia said it plans to release new open-source software that will significantly speed up live applications running on large language models powered by its GPUs, including the flagship H100 accelerator.

The Santa Clara, Calif.-based AI chip giant said on Friday that the software library, TensorRT-LLM, will double the H100’s performance for running inference on leading large language models (LLMs) when it comes out next month. Nvidia plans to integrate the software, which is available in early access, into its Nvidia NeMo LLM framework as part of the Nvidia AI Enterprise software suite.

[Related: Nvidia CEO Explains How AI Chips Could Save Future Data Centers Lots Of Money]

Sponsored post

The chip designer announced TensorRT-LLM as Nvidia seeks to maintain its dominance of the fast-growing AI computing market, which allowed it to double revenue over a year in the last financial quarter.

“We’ve doubled the performance by using the latest techniques, the latest schedulers and incorporating the latest optimizations and kernels,” said Ian Buck, vice president of hyperscale and high-performance computing at Nvidia, in a briefing with journalists. “Those techniques improve performance, not just by increasing efficiency but also optimizing the algorithm end-to-end.”

Nvidia teased TensorRT-LLM last month as part of the recently announced VMware Private AI Foundation platform, which will let VMware customers use their proprietary data to build custom LLMs and run generative AI apps using Nvidia AI Enterprise on VMware Cloud Foundation.

Buck said TensorRT-LLM will support several Nvidia GPUs beyond the H100, including its previous flagship data center accelerator, the A100, as well as the L4, L40, L40S and the forthcoming Grace Hopper Superchip, which combines an H100 GPU with its 72-core Grace CPU.

As CRN reported earlier this week, high demand for the H100 and A100 driven by generative AI development has led to long lead times for many companies, which prompted Lenovo’s top server sales executive to tell partners to consider alternatives if they don’t need to run the largest of the LLMs.

“We have a wide suite of products available for our customers to dial in and build the right infrastructure for all the different modalities where they are on their AI journey,” Buck said.

How TensorRT-LLM Speeds Up Nvidia GPUs

Nvidia said it worked closely with several major AI ecosystem players—including Facebook parent company Meta and Mosaic, the generative AI platform vendor recently acquired by Databricks—on the LLM inference optimizations that went into the open-source TensorRT-LLM.

“Everyone can get the benefit of getting the best possible performance out of Hopper and, of course, other data center GPUs for large language model inference,” Buck said.

TensorRT-LLM optimizes LLM inference performance on Nvidia GPUs in four ways, according to Buck.

The first is through the inclusion of ready-to-run, state-of-the-art, inference-optimized versions of the latest LLMs such as GPT-3, Llama, Falcon 180B and BLOOM. The software also includes the latest open-source AI kernels that introduce cutting-edge techniques for running LLMs.

“As people develop new large language models, these kernels can be reused to continue to optimize and improve performance and build new models. Of course, as the community implements new techniques, we’ll continue to place them or they’ll place them into this open source repository,” Buck said.

The second element of TensorRT-LLM is a software library that allows inference versions of LLMs to automatically run at the same time on multiple GPUs and multiple GPU servers connected through Nvidia’s NVLink and InfiniBand interconnects, respectively.

“In the past developers had to, to get the best possible performance, take a large language model and manually split it up across multiple GPUs in a server or across multiple servers and manage that explicitly. No more,” Buck said.

“TensorRT-LLM encapsulates all that technology, all that learning, all that experience from Nvidia engineering and working with the community into a single library, so we can automatically scale large language models across multiple GPUs and multiple servers,” he added.

Buck said multi-GPU and multi-node computation is necessary for the largest of the LLMs because they are so big, consisting of 175 billion parameters or larger, they can’t fit on a single GPU, even the H100.

The third element that improves LLM inference performance is what Nvidia calls in-flight batching, a new scheduler that “allows work to enter the GPU and exit the GPU independent of other tasks,” Buck said.

The Nvidia executive said this feature is important for LLMs because there is a high amount of variability in the length and complexity of prompts from end-users, which can range from a simple question to a request to produce an entire document.

Without in-flight batching, a GPU system would only be capable of sending one batch of work in and out of a GPU at a time, which slows down batch processing and lowers GPU utilization.

“With TensorRT-LLM and in-flight batching, work can enter and leave the batch independently and asynchronously to keep the GPU 100 percent occupied. This all happens automatically inside the TensorRT-LLM runtime system, and that dramatically improves H100 efficiency,” Buck said.

The last critical aspect of TensorRT-LLM is that it’s optimized to take advantage of the H100’s Transformer Engine, which automatically converts LLMs that have been trained with a 16-bit floating point format to an 8-bit integer format that takes up less space in the GPU memory.

How H100 With TensorRT-LLM Performs

In two charts shared by Nvidia, the company demonstrated that the TensorRT-LLM optimizations allow the H100 to provide significantly higher performance for popular LLMs.

For the GPT-J 6B LLM, Nvidia showed that an H100 enabled with TensorRT-LLM can perform inference two times faster than a regular H100 and eight times faster than the previous-generation A100.

For Meta’s Llama2 LLM, the company showed the optimized H100 running nearly 77 percent faster than the vanilla H100 and 4.6 times faster than the A100.

Buck said the performance gains translate into improved power efficiency, with the H100 using the same power to complete twice as many tasks as before thanks to TensorRT-LLM.

“Energy efficiency is an end-to-end optimization. It comes from hardware. It comes from scheduling. It comes from algorithms. And it comes from, of course, new models. And so for the end-to-end solution stack, TensorRT is an incredibly important part of that story,” he said.