How Penguin Computing Is Fighting COVID-19 With Hybrid HPC

'We have several researchers that have joined in and are utilizing that environment, and at the moment, we're doing that at no cost for COVID-19 research,' Penguin Computing President Sid Mair says of the system integrator's cloud HPC service, which complements its on-premise offerings.

Supporting Research On-Premise And In The Cloud

Penguin Computing President Sid Mair said the company is using its high-performance computing prowess on-premise and in the cloud to help researchers tackle the novel coronavirus.

The Fremont, Calif.-based company this week announced it is working with AMD to upgrade the U.S. Department of Energy's Corona supercomputer — a coincidence of a name — with the chipmaker's Radeon Instinct MI50 GPUs to accelerate coronavirus research. But that's not the only way the system integrator is looking to help researchers study and understand the virus.

In an interview with CRN, Mair said the company is in multiple discussions for additional opportunities to help researchers using Penguin Computing's HPC capabilities to study the virus and COVID-19, the disease it causes. But the company is also using its own internal capabilities, an HPC cloud service called Penguin Computing On Demand, to deploy compute resources when there isn't enough time or money for researchers to stand up new on-premise HPC clusters for research.

"We have several researchers that have joined in and are utilizing that environment, and at the moment, we're doing that at no cost for COVID-19 research, even though it is a production commercial system that we currently sell high-performance computing compute cycles on today," he said.

HPC is seen as a critical tool in accelerating the discovery of drugs and vaccines for COVID-19, as demonstrated by the recent formation of the White House-led COVID-19 High Performance Computing Consortium, which counts chipmakers AMD and Nvidia as well as OEMs and cloud service providers like Hewlett Packard Enterprise and Microsoft as members. The effort is also receiving support from Folding@Home, a distributed computing application that lets anyone with a PC or server contribute.

This strategy of utilizing both on-premise and cloud servers to deliver HPC capabilities is referred by some experts as "hybrid HPC," which Mair said allows researchers to offload compute jobs into the cloud when there isn't enough resources to deploy new on-premise servers.

"They can't upgrade quick enough in order to continue to do their research, so being able to walk in and move their workflow over into an HPC environment that works and acts and implements just like they would do it on-premise but they're doing it in the cloud is becoming very, very beneficial to our researchers," he said.

William Wu, vice president of marketing and product management at Penguin Computing, said Penguin Computing is also planning to expand its offerings for researchers doing anything related to COVID-19, which could include running simulations to understand the impact of easing stay-at-home restrictions.

"We do intend to roll out something much more broader to allow anybody that is doing anything related to COVID, either directly or indirectly, to take advantage of what we're offering," he said.

In Mair's interview with CRN, he discussed how Penguin Computing's new GPU upgrade deal with AMD for the Corona supercomputer came together, how the company protects its employees during server upgrades, why GPUs are important for accelerating COVID-19 research and whether the pandemic is shifting the demand between on-premise and cloud HPC solutions.

What follows is an edited transcript.

How did Penguin Computing get involved in the Corona GPU upgrade opportunity?

Penguin has been involved from the start. Corona was procured a little over a year and a half ago or so to do research and provide an open research platform for outside researchers from the [U.S. Department of Energy] to be able to get work on a large-scale, GPU-accelerated infrastructure. [Lawrence] Livermore [National Laboratory] procured that through our CTS-1 contract that we have from the [DOE's National Nuclear Security Administration] for what they call commodity compute capabilities for high performance computing. So that's all the production environments that sit in the tri-labs — [this] is what the NNSA CTS-1 contract does.

The original project was a joint project, where AMD and ourselves and Mellanox, the interconnect vendor, and Livermore put together this compute system, Corona. Each of us donated a specific piece to the implementation, so it wasn't just a, for instance, pure procurement on the point of DOE. There was an agreement among all of us to provide resources and capabilities and basically a donation in-kind in building the entire environment. And AMD was a significant part of that at the time.

And so when COVID-19 came around, there was a pretty significant spike in the ability for [DOE] to use that. A lot of the biological or biometric codes as well as well as artificial intelligence codes utilize GPUs more than your traditional physics codes do in in a compute environment. So Corona was a really good fit, because we had a very good combination of GPU-accelerated capabilities along with general high-performance computing based on standard compute architectures. This was done [based on a] need. They wanted to provide this resource to the research community.

And so AMD, through open discussions between the DOE, Lawrence Livermore and ourselves, said, "look, we would like to go ahead and provide these GPUs." At the time when we did the original Corona, half the machine was just pure CPU base compute, and half the machine was GPU-accelerated, with the intent of upgrading the rest of the machine for GPU acceleration at a later time when budget was available, so this ended up being a donation rather than having to worry about a budget. I really think AMD stepped up and had the means to be able to provide those GPUs, so it's a great combination.

Was it AMD that reached out to Penguin about this or Livermore?

I can't necessarily answer that 100 percent because they reached out to my team that that covers the account. It was almost simultaneously Livermore reaching out and then AMD reaching out an hour later. It was a very, very close time period.

We are very commonly on joint calls with many of our suppliers, major suppliers of those types of compute resources. We are always on three-, four-way Zoom-type calls where we're discussing with the DOE: how we can improve or provide specific computing capabilities for their needs? And very, very typically, one of those CPU or GPU suppliers in the industry are a part of that — or sometimes other accelerator suppliers or interconnect suppliers, so it varies pretty often. But it's very common for us to have those kind of conversations.

What makes these GPUs important for research purposes?

Two reasons. One: In biomedical research, they they've been working with GPU-accelerated codes for a number of years, so it's a fairly well-established usage of compute resources. And what GPUs allowed you to do is move highly parallelized capabilities into a computing environment — a GPU basically has many, many cores built into it in order to run compute. And so there are certain codes that work very well with GPUs, and then there are other codes that work very well with more of a CPU base. And so these GPUs help accelerate those types of codes that match a lot of that bio research that's going on.

The second piece is GPUs are very, very good at looking at deep learning situations for artificial intelligence. By using AI, what AI typically does is that it helps you narrow down your choices on what you want to look at in a higher intensity compute environment for these types of codes. You may look at patterns going on and when you narrow down to a few of those patterns, then you can actually do the longer compute or compute-intensive activity to those narrower set of things that you're trying to look at, to come to an answer quicker.

The second thing AI gives you the ability to do is look at certain types of pattern or movement. So if you're going to look at, for instance, the dispersion of a contagion across a population, AI is really good at looking at those kinds of patterns and determining where your next hotspot could be, or where you could possibly do things like open up the economy due to changes in what's been happening within the spread of the virus. AI can also be used to narrow down these treatments and/or vaccine capabilities by narrowing down the number of proteins you may want to look at for a vaccine. So there's a variety of actual activities that go on with these kinds of computer environments.

What does Penguin need to do to upgrade the system with these GPUs?

In this case, there isn't a lot of extra integration work. Almost all the integration was done up front, and the machines have already been installed and in place for over quite a period of time. So this is just upgrading that capability within each individual node.

Now, obviously, if you think about it, you've got at least upwards of a hundred nodes that you have to go upgrade, and in each one of them, you have to put four GPUs into and go through that entire upgrade process. That process typically takes de-installation, doing testing and putting things back together. It's not horribly complex compared to actually building the original computing environment.

Given the need to have social distancing and things like that, how does your company go about upgrading and setting up the system with the new GPUs?

I will tell you our safety precautions in a general installation process that we do and not talk about the specific installation process at Livermore.

So in general, what we do is we practice all of the social distancing guidelines, and for every professional services and installation services employee that participates — whether it's within our own installations of our personal cloud environments that we provide, called Penguin On Demand, but also our on-premise [systems] like Corona and others throughout the United States and the world — they all have their own sets of [personal protective equipment], and we ensure that they have plenty of stock for those things that need to be replaced on a periodic basis. And they're required to wear both the PPEs and maintain a six-foot social distance.

Now, does that mean that two of our employees cannot socially distance in order to pull a node out that takes two people? Obviously, you can't get a six-foot distance when you got two people pulling out a node. However, they are wearing PPEs, including face masks and sometimes face shields and obviously gloves. And then depending on the environment, typically some form of a smock, which was already in place because of static electricity, things like that, that you may have people wear.

Because of all the research that's happening with regards to COVID-19, are you seeing any other need for either new systems or upgrades for other organizations?

I think there will be. There is some today. And we're having some discussions that I can't really disclose about additional COVID-19 research opportunities, and I'm hoping that there'll be a few more press announcements in a few weeks to talk about some of those.

Some of the things we're doing today is, we've already offered up our Penguin On Demand environment. We have a number of systems totaling about 25,000 cores available through our cloud-based HPC environment, so we've offered up COVID-19 research into that environment if people want access to it. So we have several researchers that have joined in and are utilizing that environment, and at the moment, we're doing that at no-cost for COVID-19 research, even though it is a production commercial system that we currently sell high-performance computing compute cycles on today.

In relation to that, how do you balance the needs of the business for Penguin, with the need of these organizations to have affordable or even free hardware to do this kind of research?

That's a really good question, because obviously a lot of it has to do with economics. Especially as a public company, we need to provide value to our shareholders. One of the ways we do this, for instance, with Penguin On Demand is, we have certain customers that are, for instance, weather customers that are computing worldwide weather three, four, five times a day. Those are scheduled computing events that happen, and they're set aside so that the number of systems they need in order to run their own business are always there.

So if COVID-19 researchers are coming in to do work, they will be able to submit it to a queue, but that queue may not necessarily run that instant. It may run an hour later, two hours later or overnight the next night. It just depends on when the cycles are open and available, and we have the tools in our environment to automatically queue those up and automatically determine when [to do it] versus the size of the job they have to do and the number of compute cycles they need and everything else. We can basically predict that and determine when best to run.

Do you see the pandemic impacting the balance in demand between cloud-based computing versus on-premise computing for HPC?

I think the jury's still out on that a little bit. When you build an on-premise HPC environment, in general, once you have the on-premise environment put together, you have a limited number of resources that are managing that on a daily basis. So you can practice social distancing and everything else in general for that environment itself. However, there's also a huge economic benefit for the cloud and for those researchers that want access to things and don't have the resources to do it. For instance, if you're doing cloud-based HPC, you're eliminating all that — the management side of it, the infrastructure, the cost of the power — all of those things get absorbed into the service that we provide. And then you're only paying for the compute cycles that you need in order to get your job done.

I believe we're the only high-speed, large, scale-out HPC capability in the marketplace in the cloud. There are many that do virtualized things, but to provide one high-speed interconnected HPC environment that looks and feels just like someone would use on their on-premise system, I believe we really are about the only company that does that.

What's happening today for the government systems, research systems, university systems, higher education systems and medical research systems, [the research is] eating up all their on-premise cycles very quickly. They can't upgrade quick enough in order to continue to do their research. So being able to walk in and move their workflow over into an HPC environment that works and acts and implements just like they would do it on-premise but they're doing it in the cloud is becoming very, very beneficial to our researchers. Because they can almost instantly move their codes over to our environment and not take three, four, five, six weeks to try and port it to a system that uses virtualization rather than a true high-performance computing software stack or build environment.

How typical is it for the DOE or another government agency to be concerned about accelerating the competing they are already doing?

High-performance computing is an essential capability that is just embedded in everything today, from the financial world to the biomedical world to anything for oil and gas, seismic [activity], weather, you name it. And, of course, on the government side, national security, etc. The exascale program, sponsored through the DOE primarily, is meant to maintain a competitive edge for the United States in many aspects of science and technology. I think they're always looking for ways in which they can continue to grow in their computing capabilities.

Now today, computing capabilities are a lot different than they were even 10 years ago, where most of your computing was concentrated on physics-based computing. Today, it's heavily concentrated in three major areas: artificial intelligence, which is the big buzzword nowadays. And that's either machine learning or more of a training environment, inference computing. And then high-performance computing, which is primarily physics-based but not always today. And the last one is data analytics and how do you understand data. And in reality, it's very gray between all three of those environments today. All three require the ability to create high-performance computing capabilities, whether it is floating-point capabilities or other capabilities to be able to move or manipulate data.