Nvidia's Ian Buck: A100 GPU Will 'Future-Proof' Data Centers For AI

'By having one infrastructure that can be both used for training at scale as well as inference for scale out at the same time, it not only protects the investment, but it makes it future-proof as things move around,' says Buck, Nvidia's head of accelerated computing, in an interview with CRN.

ARTICLE TITLE HERE

'Game Changing For The Data Center'

Nvidia executive Ian Buck said the chipmaker's new A100 GPU will future-proof data centers for artificial intelligence workloads thanks to the processor's unified training and inference capabilities that will pave the way for large-scale infrastructure investments.

Buck, general manager and vice president of accelerated computing at Nvidia and the creator of the company's CUDA parallel computing platform, told CRN that the A100, revealed in mid-May, continues the company's efforts to democratize AI. Key to those efforts is Nvidia's claim that five DGX A100 systems, the new AI system equipped with eight A100s, can perform the same level of training and inference work as 50 DGX-1 and 600 CPU systems at a tenth of the cost.

"By having one infrastructure that can be both used for training at scale as well as inference for scale out at the same time, it not only protects the investment, but it makes it future-proof as things move around, as networks change — you can configure your data center in any way possible well after you've purchased and physically built it," he said in a recent interview.

This is made possible in part by the A100's new multi-instance GPU feature, allowing the A100 to be partitioned into as many as seven separate GPU instances that can perform work in parallel. Alternatively, eight of the GPUs can be linked with Nvidia's third-generation NVLink interconnect to act as one giant GPU in the DGX A100 or a server using Nvidia's HGX A100 board.

But key to the A100's ability to future-proof data centers is the way it unifies training and interference capabilities into one chip, combining what was previously available in two separate GPUs, the V100 for training and the T4 for inference. By combining these capabilities, Buck said, the A100 can significantly increase the flexibility and utilization of data centers by shifting workloads on the fly.

"Ampere's flexibility to be both an amazing training as well as inference GPU makes it really game changing for the data center," he said.

Nvidia's channel partners will be critical to driving sales and integrating systems for the A100, according to Buck, which will be done through Nvidia's new DGX A100 and HGX-based servers coming from Atos, Dell Technologies, Gigabyte, Hewlett Packard Enterprise, Lenovo, and Supermicro, among others.

"Only through our partners can they help design systems and configure and get them understood and consumed and deployed for the problems that our customers are trying to solve in the broader scheme of things," Buck said.

In his interview with CRN, Buck talked about how Nvidia was able to build a GPU that combines inference and training capabilities, why Kubernetes is important for the future of data centers, how he thinks the A100 will shake up the AI accelerator market and how the coronavirus pandemic is impacting demand for AI. What follows is an edited transcript.

What from an architectural level allowed Nvidia to create a GPU that is very good at both inference and training?

What Ampere represents is the combination of multiple generations of the Tensor Core on the training side as well as our most recent work in Turing architecture that we did for studying and understanding inference. Today, the Turing T4 GPU is a fantastic inference GPU. And the Tensor Core that's in Turing, the second-generation Tensor Core, we took the Volta Tensor Core and advanced it for Turing to add inference capabilities. What inference needs, it's a different kind of workload than training. Training, you're taking a neural network, and you're teaching it a problem by running massive amounts of data through. Because of the nature of the training problem, you typically do it in 32-bit floating point calculations. You need to keep the full range of the data. In some cases, we can optimize it to 16 bits, FP16. That's state of the art in training. For inference, which is the deployment of AI, you can go a step further, because you're only doing the forward pass of the AI calculation, and the state of the art there is doing it in eight-bit integer. In some cases, you can even go with four bits in integer arithmetic. That's eight bits, so eight zeros and ones versus 16 and 32. I say that simply. To make all that work is, of course, a massive amount of software and numerical statistical work as well as understanding how to manage precision as well as performance to get things to be state of the art, both in accuracy and time to convergence for training and lowest possible latency for inference.

With the third-generation Tensor Core, we combined the learnings from both Volta and Turing to bring them together to build a great Tensor Core that is capable of doing both with amazing performance and advance things forward. In fact, we took it into the next level by inventing the new numerical format for doing the calculations, called Tensor Float 32, and then we pulled forward all INT8 capabilities of Turing and made it even faster. So A100 can deliver up to over a petaop of INT8 inference performance. The third thing that makes it great as an inference as well as a training capability is its support for sparsity. Not every connection in the neural network contributes to the final result, and it's often difficult to know which ones do and which ones don't. But as weights [a parameter that determines the effectiveness of an input] go close to zero, we can zero them out and optimize the calculation. Now those optimizations can be randomly scattered through the neural network, but they can happen quite frequently. So it's not easy to recalculate the computation to just ignore whole parts of it. What the third-generation Tensor Core has is hardware support for sparsity. It can optimize those weights, which actually don't contribute to the result, without having to refactor the whole computation. It keeps the network structure the same. No one wants to create a new network. It works excellent in the inference use case and can be used to leverage to increase training performance.

The ability to do inference and training: is that just going to appeal to customers who want to do both, or is it going to also appeal to customers who want one or the other?

You have to think about building these data centers. These are multi-million dollar, potentially hundreds of millions of dollar investments. You don't do that on a whim, on the fly. And when you build up an infrastructure, you'd like to make sure that its utility is maximized and well into the future. By having one infrastructure that can be both used for training at scale as well as inference for scale out at the same time, it not only protects the investment, but makes it future-proof as things move around as networks change, you can configure your data center in any way possible well after you've purchased and physically built it. So Ampere's flexibility to be both an amazing training as well as inference GPU makes it really game changing for the data center, because, as you know, it changes every six months. There's always a new network; it's going to change the workload demands, but they can rely on Ampere to carry them into that future for both the training and inference capabilities.

If an organization needs to make an investment for training first, with the A100, they already have that inference capability when they get to the point of deployment.

Exactly. And they can move it around. It can be used for real-time production inference use cases where latency matters. Or as those needs and that infrastructure shift, they can shift that infrastructure to training capabilities, which is ideal and allows them to really lean in on investing and building up very large-scale infrastructure.

How long did Nvidia know that it needed to get to this point of having these inference and training capabilities within one GPU?

It's a good question. I have to think back now. Utilization is always top of Mind. People always want to make sure that their infrastructure, as they deploy it, is well utilized and well capitalized. One of the great features of Ampere, one of the reasons why we created the multi-instance GPU capability is because they can automatically configure one GPU for the exact amount of inference performance they need for a particular use case, so they may need only a single MIG instance, which is about the performance of a V100. But for some of these larger networks where they want to go to the next level of conversational AI, where the models, as you saw, are getting huge for natural language understanding, they can scale up and provide bigger GPUs, all the way up to the capabilities of a single A100, which is probably more than anyone needs right now for real-time inference. But I say that — given the fact that we just launched Ampere, I'm sure someone will make a model that is capable because now that capability exists. So utility is critical for data centers, and making sure that they can be flexible and protect their investment that they're making, and they know they can trust it and do that with Ampere. That was really important message that we felt was compelling and also is important for data centers as we were seeing the rise of T4 and the rise of Volta at the same time, and managing those two capacities, as the different workloads created demand.

I think we've thought of AI accelerators as being good at training or inference but not both. What kind of impact do you expect the A100 to have on the market for AI accelerators as a whole?

I think we've raised the bar. You think about what Nvidia does: we're a full-stack company. Remember, we launched Volta three years ago, yet we've increased its performance many times over by understanding full stack, understanding networks and working with people that are inventing AI to make it run really well on our platforms. Now we have a whole new foundation to continue to expand upon.

And I'm sure the numbers that we're showing are amazing now. I can only wait to see what they look like in two, three years time as we continue to advance that full stack. And people are now capable of building next-generation AI with the incredible power, 20X more power that Ampere gives to them to do that next level of innovation. It's a wonderful synergy we have of being able to help the researchers, including our own Nvidia researchers, develop the next generation of AI capabilities. And by getting Ampere in their hands, they can move faster and make it smarter, larger, more intelligent and do new things. Ampere is yet another huge jump but yet one of many jumps to advance AI down the path. We're very excited because to get it hands of all these AI developers, researchers and see what they can do with it.

Because of that change in the economics of how you can do inferencing and training with the A100, do you expect it to have somewhat of a democratization effect on which organizations are able to do AI?

Two questions there. One is the cost. I think with every generation, going back from Maxwell, Pascal to Volta to Turing to the original DGX to the DGX Volta to DGX A100, it dramatically decreases the costs of AI. And that comes through not just the hardware, but also the systems that we build, the software that we optimize and just working with everyone to make it run incredibly fast. I think Resnet 50 trains in literally a minute or something like that. It's quite silly that that's how fast it can be. So we dramatically bring down costs with every generation, and that will drive a transition. There's a clear reason why people should build their next data center with Ampere, with A100 as Jensen showed in the keynote: it's one-tenth the cost. That is obviously a reason why we have such a huge demand for [A100].

Second on the democratization: As we bring down the cost, we extend the reach. We extend the reach already by making every GPU we ship CUDA capable and capable of doing AI. But as we bring more of that computing capability to our systems, and as we activate it through all of our cloud partners, what should be able to rent for $1 an hour just got 20 times faster. Now the exact pricing, we'll have to wait till the cloud providers announce all that, but it's bringing some of that modern AI and this benefit you see both for the data center, a mirror copy that gets reflected in the cloud rental ability — what people can rent with just a credit card. So we're also bringing more compute capability, 20 times more compute capability in instances that they can just rent, which obviously is important for a lot of the academic community and others who have a different economic model.

How is the pandemic changing that demand for AI? I know there's a lot of research that needs to be accelerated right now for very important purposes. For the overall landscape, what are you seeing?

It's funny. I was talking to someone about this earlier in the week, and some of the supercomputing centers, they're actually having a problem, because they're seeing spikes in demand. Apparently researchers aren't going to meetings anymore. Or they spend more time on other things. They're at home submitting jobs to supercomputers to do their work. From a data center perspective, you can see a difference, but it's actually been quite healthy in the sense that people are able to do their work and have been able to submit their jobs, wherever they are. They don't need to be in the data center to do their work. And that that continues.

Certainly the interest and demand for AI continues to be important for home assistants, for teleconferencing, for social media. These are all areas where you and I interact every day and even more so now that we're home. And those same services hit the data center, hit the cloud, and, of course, need AI to process the data and make decisions, whether they be recommenders or content filtering and such things. Things have been fine from that standpoint with COVID. There's obviously other impacted businesses and industries. But, overall from a datacenter standpoint, we've been fine.

Does Nvidia anticipate how a post-COVID-19 "new normal" in society could impact demand for AI?

We'll have to see how that plays out. I'm not going to predict the future of the pandemic per se. The focus right now has been about helping find maybe not a cure but at least a treatment to help de-risk the impacts of COVID. Certainly we're seeing, and you saw with the White House COVID response, which Nvidia has joined, finding drug candidates that can help understand the COVID-19 virus better, understand the different mechanics of how the virus infects a cell and understand which drugs could be used or deployed to help mitigate the effects if not cure it. We simply need to create a treatment that can make COVID less fatal — and we can all feel a little bit better about going out of our homes.

To do that, we just need to understand the dynamics of how the virus works and the literally billions of drug candidates, or substructures of drugs, could be used to interact with that. It takes a very long time, but Nvidia is no stranger to this kind of simulation. This is a molecular dynamics simulation that was actually one of the first tenets of our GPU computing strategy, [which] was to help the research community with molecular dynamics simulations in applications like Gromacs and Amber, which were largely pioneered by the academic community and have been GPU accelerated for 15 years now. We've been contributing to with supercomputing sites around the world to help accelerate and apply the technology to further optimize some of the applications. And there's a whole bunch of work going on in that area. So right now, that's the focus. Certainly, you could see a future where we can learn from this and build a smarter, early warning system to understand and detect these kinds of epidemics earlier, perhaps deploy them in smart city scenarios.

What should Nvidia partners be focused on in terms the added value that they provide on top of Nvidia-based systems?

Nvidia is a pure technology company. For a company that, on one hand, does amazing video game technology, and the other side is helping find the drug that's going to help mitigate and manage COVID, by definition, you're a technology company. What our partners do is help connect that technology in our products to that customer's problem, wherever they are.

We don't have that reach actually. Only through our partners can they help design systems and configure and get them understood and consumed and deployed for the problems that our customers are trying to solve in the broader scheme of things. So our partners are incredibly important for doing that, for connecting the technology that we build, not just the GPUs but the systems, all the software stacks that we have, all of our technology that we bring to market, even the applications that we host in our NGC Container Registry, connecting all those pieces and putting together correctly for that customer for what they care about. And it's computing. So how you translate computing into a particular customer's problem, we rely on our partners, who we train and work with and explain our technology, many times over, so they can understand how to connect it with that customer's problem and make sure they value it and inevitably obviously purchase and deploy it, so we can help support them on whatever particular problem that customer has.

Without our partners, it'd be a totally different world. They connect it to our customers' problems, whether it be in higher education and research to supercomputing, which obviously has its own different challenges, to consumer cloud internet service company, a startup or a healthcare, pharmaceutical company, medical device company, a city that's trying to deploy a smart city [system]. Many things go into the actual deployment, configuration, purchasing, management [and] software stack to make that customer successful. We rely on our partners to connect the dots and to make sure that they're getting the best possible products from Nvidia to support whatever they're doing.

I wanted to ask about a phrase that CEO Jensen Huang (pictured) used when the Mellanox acquisition closed. He was saying,"AI is now driving an architecture change from hyper-converged infrastructures to accelerated-disaggregated infrastructure." In your own words, what does that mean, and how applicable is that to the data center market as a whole?

I think about Kubernetes when he says that. What Kubernetes did was basically disaggregate the data center by running multiple micro services where they need to run inside of the data center on the right parts of the data center infrastructure that's optimized for that particular micro service. And then chaining it all together with a Helm chart. The beautiful thing the Kubernetes did: It disaggregated the data center so you didn't have to run an entire service on a single node but can run it across an entire data center with storage and I/O-based nodes versus accelerated nodes and make it all work. Now to make that disaggregation work, you obviously have to connect multiple parts of the infrastructure with great networking with great connectivity so that you can achieve that disaggregation, that efficiency.

Can you bring it back to the A100? How does A100 figure into that new kind of infrastructure?

The good part of that is that you can rely on a singular A100 infrastructure to serve both the training portions as well as the inference portions of that disaggregation story, so you don't have to maintain or necessarily deploy T4 servers inside of your data centers as well as V100 training [servers]. You can have a singular A100 infrastructure and maximize that total utilization because you've combined those two workloads with a singular infrastructure, which is very exciting. This is not saying T4 doesn't continue to have value. Certainly, there are some use cases where you may only need one T4 server, or even outside of the data center I still expect T4 to be around, but A100's value in this regard for disaggregated data centers is excellent.

Does this mean that T4 won't have a successor?

We'll talk about new products and continue at a later date. Right now, we're extremely excited about A100.

I want to ask about the EGX A100 because that looks like it's the first product that integrates Mellanox and Nvidia technologies on the same board. Can what it is and what the market is for that?

It's targeted at the edge. Think about smart devices that need to deploy AI, whether they be smart speakers or telco use cases with 5G or robotics use cases where the latency is so low, we can actually potentially manage one robot through a connected cloud, where we need to do the AI deployment at the edge. Well, you need three things. First you need an accelerator because modern AI, in order to deliver state-of-the-art models with the lowest possible latency, it needs to run on a GPU. That's known. We need great connectivity. The packets and phrases coming in off the network — we're measured by turnaround time. We need to run the accelerator and get it back out to the network with great network connectivity. The third thing you need is security. This will be used for understanding when I'm talking to a smart speaker, so it's what I am talking to in my own home. It may be used for healthcare use cases: containing my medical information or a patient's medical information. So it needs to have those three things: great acceleration, great connectivity and great security.

We combine all that into the EGX A100. It has an A100 GPU. It has the latest and greatest Mellanox NICs, the ConnectX-6 Lx. And through that we can actually do a trusted security root of trust on that device, so that the networks that you're running are trusted and secure [while] the data and connection links are fast and [have] hardware-accelerated security, so that they continue to deliver amazing performance with full security enabled. That all comes together in the EGX A100, and we've been working on this one with our partners for a while now, and [in Nvidia's recent presentation, we showed] all the different partners we've worked across the different industries where they have these needs. It will support all of our major software stacks from smart city to healthcare to telco to conversational AI to robotics. And then, of course, how it gets connected up and how it gets bought and deployed through the OEM system builders or through the cloud.

CEO Jensen Huang has said that there will be demand for AI throughout the entire data center market. When is that going to happen?

To understand that perspective, you should just think about what AI is capable of doing. At the highest level, AI can write software from data. It can look at a set of data and come up with a model that can predict an outcome, which is actually writing software. If you apply that template to the things data centers do for understanding and managing data, it can be applied in almost any use case where you need to understand data. Even cases where you want to optimize what data you're going to look at.

We see that the proliferation happening, certainly as users interact with the cloud, that is user data that needs to be understood. We could use it to transcribe this call. When we upload video or pictures. understand that content, who to share it with, understand who not to share it with — or should it be shared at all — understand its security and whether or not it's actually you or not you. These modalities, these use cases are not unique to YouTube or Twitter. They're what most data centers actually do with their own use cases in their own domain. The challenge is we just have to get this technology in the hands of people that can apply it to their particular domain, and that's been a big mission for Nvidia.

AI was first deployed at scale by the hyperscalers because they have the talent and capacity to figure it out for the first time. A lot of these use cases are well enough understood that they can be deployed and add value to the rest of the market, whether it be the financial world, the retail world or the telecommunications world, etc. That's one of the reasons why you see us inventing SDKs like Jarvis for conversational AI. Jarvis is a complete conversation AI SDK. It comes with a whole bunch of pre-trained models for speech recognition, pre-trained models for language understanding — after you've understood what was said, you can understand what it meant. For chatbot technology, so you can actually come up with an answer. And for text-to-speech so a computer can talk back to you in a human indistinguishable voice but generated by computer.

We actually make all those models available on NGC. What we have, that's not enough, because we're not going to know anything about financial trading information, so we provide you with retraining kits: how to retrain a BERT [model], so you can understand financial documents like a 10-K or earnings report. You can apply it to whatever output you want that makes sense for your use case. And then of course, we also make a software stack that's pre-configured and optimized to deploy on GPUs as well, so you can very quickly with one line of code take your training model and deploy it across a Kubernetes cluster. So that's us democratizing AI. We create these SDKs like Jarvis for conversational AI, like Merlin for recommender systems, and we open source them, we make them free on NGC, and we give you all the tools capable so that developers can apply them to their domain-specific use cases.