How Penguin Computing Helped Design Meta’s New AI Supercomputer

Thierry Pellegrino, the executive who oversees Penguin Computing at parent company Smart Global Holdings, talks to CRN about how the system builder designed and integrated the new AI supercomputer for Facebook parent company Meta.

What It Takes to Build The World’s ‘Fastest AI Supercomputer’

Any company with the same level of financial resources as Facebook parent company Meta could buy the same parts and systems Meta is using for its new AI supercomputer.

But to connect hundreds — and eventually thousands — of Nvidia’s DGX A100 systems together along with storage systems, it takes the right kind of system integrator to design and construct a high-performance computing cluster that meets the company’s requirements.

In the case of Meta’s new AI Research SuperCluster, the Menlo Park, Calif.-based tech giant turned to Penguin Computing to design and integrate the new AI supercomputer, which Meta said will become the world’s fastest later this year when the full cluster of 16,000 Nvidia A100 GPUs goes online.

In an interview with CRN, Thierry Pellegrino, the executive who oversees Fremont, Calif.-based Penguin Computing at parent company Smart Global Holdings, said the new AI supercomputer was the result of an “embedded collaboration” between Meta and Penguin Computing that began more than four years ago with a previous-generation AI supercomputer that used Nvidia’s older V100 GPUs.

“When you operate such a cluster, you have to understand the requirements around security, around performance, around availability and around all the unique aspects of a company like Meta that need to be taken into consideration. That’s been our value to Meta,” said Pellegrino, whose title is president and senior vice president of Smart Global Holdings’ Intelligent Platform Solutions.

In his interview with CRN, Pellegrino, who previously ran Dell EMC’s HPC business, talked about how Penguin Computing designed and integrated Meta’s AI Research SuperCluster, which includes storage systems made by Pure Storage and Penguin Computing itself. He also discussed how the company overcame storage and networking bottlenecks and whether the new AI supercomputer is a sign of things to come in the larger enterprise world. The transcript was lightly edited for clarity.

Penguin Computing is Meta’s architecture and managed services partner. Can you talk about what kind of work that entails with the new AI Research SuperCluster?

Let me walk you back memory lane just a little bit, because I’d like to give context. We’ve been a provider to Meta for many years. Actually, some of the early workstations or servers that Mark Zuckerberg used [for Facebook] were Penguin Computing-branded, and we’ve got some pictures that show it, so that that is how it all started.

But it really started with their [first-generation cluster] back in 2017. And it’s mentioned in their public announcements: Out of a request for information of over a dozen responders, we were selected as the key and unique partner for the architecture, design and deployment of their [first] AI research cluster. And the reason why we got selected was really because we had a very different approach to solving for the ask. A lot of other manufacturers will think about what you need as building blocks without getting into the details of the application. We spent more time — like we knew through our HPC background, 20 years of it — understanding the workload and the bottlenecks that they had experienced. And from there, we made a pretty innovative proposal to them that piqued their interest. So we got the nod back in 2017. And that started the partnership.

In the last four years, we built up more than just a seller-buyer relationship. It was a very embedded collaboration on the [first] cluster, where we looked at the expansion from day one. We built in a lot of automation together as a joint team, a lot of coding for the deployment, utilization and management of the cluster — and all this with a keen eye to uptime, performance and [return on investment] for Meta. Because this type of resource is a sizeable investment, and if you don’t carefully design the architecture, you can over-design in areas that you don’t bring any value to your end users. And [the first cluster] was really for both companies a good success story. I wanted to tell you the history, because it didn’t happen yesterday. It was four years in the making.

But then come to today and the announcement and explaining to you a little bit more what we do: So we became very aware of their need based on their workload. And for [the AI Research SuperCluster], we designed the whole cluster with Meta, which included several components, and Penguin Computing is really integrating components from the industry. We integrate a lot of components that are already existing from partners like Nvidia or Pure Storage, but also others in the industry.

In the case of Meta, those were the two that were selected, plus some of our own elements for the cache storage. But that was because it seemed to be the right component for Meta. So we designed the whole solution for them. We made sure that we performance-tested against their requirements, and that’s an important part: The requirements were understood through the four years at [the first cluster], and we were able to get through this performance assessment fairly rapidly. Then once we had that, we had to assemble all the components, build them in our factories, deliver them, deploy them [and] just the very mundane aspect of cabling, making sure all the networking is done right, the setup is right.

But really what’s unique about this deployment is the scale when you deploy that many nodes and that big of a storage footprint at that scale. I would say a lot of companies in the industry can do it at a smaller scale, even at a reasonable scale. My experience over the past few years has demonstrated that there are a lot of players in that space. But you get to that scale integrated with a team that has really important requirements, and you deliver it in a short period of time that we did integrated with the operation team, which is not a simple task.

When you operate such a cluster, you have to understand the requirements around security, around performance, around availability and around all the unique aspects of a company like Meta that need to be taken into consideration. That’s been our value to Meta. And as we look at how this is evolving over time, as the announcement described, we’re looking at the next phase of this Research SuperCluster, which will be even more scaled compared to what we already deployed for this phase one.

Did Meta know early on that it wanted to go with Nvidia’s DGX system? Or did Meta reach that conclusion after evaluating all of its options?

I think in any discussions that we have with Meta as a customer, it’s really a collaboration. There’s a lot of knowledge that we share between both teams. And of course, it’s not just a hardware discussion. It’s also a software discussion. It’s an application support discussion. So there are a lot of factors that come in play, and that was the case when we decided on DGX as the node of choice for this deployment.

What was it about the DGX that fulfilled their needs?

I mean, the DGX is a good integrated platform. It has density that you would want for the type of accelerator [the A100 GPU] that we had settled on. And it’s a fully integrated box. So you don’t necessarily just look at the hardware piece. You’ve got validation across a lot of different workloads, and that comes into play. A lot of times, I think on a deployment like this, people focus on the hardware that makes up the system, but for this type of at-scale deployment for AI activities, you got to go look beyond just the hardware, the gear that’s underlying. So in that sense, the DGX is really well integrated element for a node like this, and that was the right choice for this deployment.

Looking at what their performance requirements were, how did you figure out how many DGX systems and what kind of storage systems they needed?

It’s several steps. The first step is determining the right elements in the architecture. And we do a lot of compare and contrast. We circle through a lot of different vendor options. We spend a lot of time also understanding what options each vendor provides, the knobs to turn. And we usually narrow it down fairly rapidly to a handful that could be contenders for what we’re trying to build.

Then we quickly get into assembling a subset of the system. And we did that with [the first cluster], and we did that again with RSC and then, because we have the knowledge of the workload that’s going to be run, we run a either a synthetic version of that workload or the workload itself in order to get a baseline performance assessment. And then having been in the HPC space for 20 years and leveraging that into HPC and AI now, we have a good understanding of how things scale.

If you’re driving a racecar, if you’ve driven a lot of Grand Prix [races] in Formula One, for example, you have a lot of data that gets you smarter as to what you can expect out of a real-time operation of a cluster in a large-scale environment. So we look at the scaling factor. And I won’t hide it from you: We eventually ran the full benchmark on the full cluster, and the results were even better than we had anticipated. So it takes a lot of knowledge and experience to get to that scale. And there’s a lot of really smart people that have done that for over 20 years at Penguin Computing, and we rely on that in order to make the right assessment.

With that many DGX systems in a cluster, are there any challenges with connecting them together and ensuring that there are as few bottlenecks as possible?

First, InfiniBand fabrics have limitations as to how many targets they can register in the initial deployment of that InfiniBand spec. So you start getting and pushing into the creative way of solving for that many addresses in order to connect everything the right way. You really tap into the high-end of Nvidia’s portfolio for network connectivity — that’s the Mellanox portfolio that they acquired a couple years ago. When I was describing what we did, connecting the different nodes, doing the architecture for the topology and removing bottlenecks, but also removing single point of failures in the network, you don’t want to be alienated from a majority of your nodes if you have a failure in one of the cables or in one of the transceivers. So you’ve got a lot of design considerations that go into the design upfront.

And then just the careful deployment: There are a lot of cables to connect, label, test. And like I said, even the addressing of every single endpoint is challenging. So going back to the theme of “at-scale,” you start pushing the envelope on a lot of the different technologies that you need to integrate for the systems. And by the way, we’re using commonly available equipment. You get into the exascale space, you can allow yourself a lot more flexibility and design custom silicon or come up with new specifications for the interconnect, etc., etc. Here it’s commercially available, high-end and optimized for the workload, and each component is carefully picked in order to deliver the performance expected.

What were the challenges in ensuring that the storage was as fast and responsive as possible for the requirements that Meta had?

You got to understand the number of accesses for the workload between your bulk storage and your cache storage. You’ve got to understand the ratio that you will need. The access rate, how many times you go hit to those storage tiers. The performance, because you’ve got a pipe, and you got to understand if the pipe going to your bulk and cache storage is big enough for what can be consumed within the nodes and can be utilized with the software that’s leveraging that resource, you got to understand just the life of the system to pick the right components.

And all this was a pretty big equation to go solve, and it was more of a trial and error than knowing, “OK, this is the one.” And we tried a lot of options on the bulk storage side and the cache storage side, and this one was a combination that worked well. And it’s also, in a way, learnings from the previous deployment with [the first cluster]: You get to live with a certain architecture with the previous cluster. You learn over four years how things behave and evolve, and then you make a decision based on that learning. But yeah, this is what the optimal configuration was for this point in time when the decision was made. Things evolve, but this one works out pretty good.

What kind of software work did Penguin Computing do with Meta on this project?

I’ll speak about one thing that’s really unique. The operation of the cluster has a lot of requirements, and I can’t speak to the details, but imagine in our world, so in IT in general, you’ve heard the term DevOps: You have your development team coding, automating things, and then you have the operation team putting [on] a lot of constraints.

In the world of IT, it’s common: This IT deployment needs to follow all those requirements. And you start from the top of your priorities and then you code, automate, make sure you have the policies, everything set up right for the beginning of your requirements, and then you go down that list.

And at the end, you have your first instantiation, [a minimum viable product], everything works, and then you iterate on that. So that concept of continuously evolving the life of the cluster to the changing needs of the operations is a little bit new to the world of HPC and AI and just having the ability to go through this in a very strong partnership, like what we’ve built with Meta, has been, quite frankly, a learning experience, but something that now we’re really excited about continuing to do with them.

What does this new AI supercomputer deployment say about what kind of work Penguin Computing is doing and where it’s going for the future?

I think for Penguin, we’re continuing to build up on our knowledge. As much as we could, we leveraged our knowledge of 20 years to go and architect and deploy the system with Meta. But we learned also through this process. Not a lot of companies get the luxury to deploy at scale like we just did. So it’s just overall, Penguin is getting smarter about AI and HPC. For [the Intelligent Platform Solutions group], which is my business unit, it’s great because we’re also integrating some of that knowledge into how we want to get some AI on the edge with the Penguin Edge division that that we have. And for [Smart Global Holdings], it’s just great to have one of the business units with such a known and respected customer and partner as Meta. I think the whole vision of the metaverse is an amazing one, and we can’t be more proud to be a partner of choice for Meta in that endeavor.

As one of the biggest hyperscalers in the world, Meta clearly has really great compute needs that are fairly unique, but do you expect there to be a greater need for AI supercomputers like this within the larger enterprise world in the future?

Time will tell, but there are other areas where there are large deployments: academia, in some of the government areas you’ll also see large deployments. So you can see that the world is really going to build more of those resources because they drive not just value for the sake of science. It’s tied to businesses evolving. They are more digital and are going in the direction of AI being a big part of their business, so I would say I would expect more of this. Our friends at Hyperion Research expect really strong growth in the world of HPC and AI, and what Meta is doing is just one example.