Homepage Rankings and Research Companies Channelcast Marketing Matters CRNtv Events WOTC Cisco Partner Summit Digital 2020 Lenovo Tech World Newsroom HPE Zone Masergy Zenith Partner Program Newsroom Dell Technologies Newsroom Fortinet Secure Network Hub Hitachi Vantara Digital Newsroom IBM Newsroom Juniper Newsroom The IoT Integrator Lenovo Channel-First NetApp Data Fabric Intel Tech Provider Zone

Nvidia's Ian Buck: A100 GPU Will 'Future-Proof' Data Centers For AI

'By having one infrastructure that can be both used for training at scale as well as inference for scale out at the same time, it not only protects the investment, but it makes it future-proof as things move around,' says Buck, Nvidia's head of accelerated computing, in an interview with CRN.

Back 1   2   3   ... 9 Next

What from an architectural level allowed Nvidia to create a GPU that is very good at both inference and training?

What Ampere represents is the combination of multiple generations of the Tensor Core on the training side as well as our most recent work in Turing architecture that we did for studying and understanding inference. Today, the Turing T4 GPU is a fantastic inference GPU. And the Tensor Core that's in Turing, the second-generation Tensor Core, we took the Volta Tensor Core and advanced it for Turing to add inference capabilities. What inference needs, it's a different kind of workload than training. Training, you're taking a neural network, and you're teaching it a problem by running massive amounts of data through. Because of the nature of the training problem, you typically do it in 32-bit floating point calculations. You need to keep the full range of the data. In some cases, we can optimize it to 16 bits, FP16. That's state of the art in training. For inference, which is the deployment of AI, you can go a step further, because you're only doing the forward pass of the AI calculation, and the state of the art there is doing it in eight-bit integer. In some cases, you can even go with four bits in integer arithmetic. That's eight bits, so eight zeros and ones versus 16 and 32. I say that simply. To make all that work is, of course, a massive amount of software and numerical statistical work as well as understanding how to manage precision as well as performance to get things to be state of the art, both in accuracy and time to convergence for training and lowest possible latency for inference.

With the third-generation Tensor Core, we combined the learnings from both Volta and Turing to bring them together to build a great Tensor Core that is capable of doing both with amazing performance and advance things forward. In fact, we took it into the next level by inventing the new numerical format for doing the calculations, called Tensor Float 32, and then we pulled forward all INT8 capabilities of Turing and made it even faster. So A100 can deliver up to over a petaop of INT8 inference performance. The third thing that makes it great as an inference as well as a training capability is its support for sparsity. Not every connection in the neural network contributes to the final result, and it's often difficult to know which ones do and which ones don't. But as weights [a parameter that determines the effectiveness of an input] go close to zero, we can zero them out and optimize the calculation. Now those optimizations can be randomly scattered through the neural network, but they can happen quite frequently. So it's not easy to recalculate the computation to just ignore whole parts of it. What the third-generation Tensor Core has is hardware support for sparsity. It can optimize those weights, which actually don't contribute to the result, without having to refactor the whole computation. It keeps the network structure the same. No one wants to create a new network. It works excellent in the inference use case and can be used to leverage to increase training performance.

Back 1   2   3   ... 9 Next

sponsored resources