Facebook Parent Meta Taps Nvidia GPUs For ‘Fastest AI Supercomputer’

With 16,000 GPUs to support its ‘metaverse’ aspirations, Meta’s new AI Research SuperCluster will consist of 2,000 Nvidia DGX A100 systems, which will make it the ‘largest customer installation of Nvidia DGX A100 systems’ when it’s fully deployed later this year, according to Nvidia.

ARTICLE TITLE HERE

Facebook parent company Meta is building what it said will become the world’s “fastest AI supercomputer” later this year with the largest deployment of Nvidia’s DGX A100 systems to date.

The Menlo Park, Calif.-based tech giant said that its new AI Research SuperCluster, revealed Monday, will use a total of 16,000 Nvidia A100 GPUs to train gigantic AI models for a variety of purposes related to the company’s buzzy “metaverse” aspirations when the cluster is fully built out in mid-2022.

[Related: AMD Continues To Steal Top 500 Supercomputer Share From Intel]

id

unit-1659132512259

type

Sponsored post

With 16,000 GPUs, Meta’s supercomputer will consist of 2,000 Nvidia DGX A100 systems, each of which contain eight A100 GPUs. Nvidia said this will make the AI Research SuperCluster the “largest customer installation of Nvidia DGX A100 systems” when it is fully deployed.

The AI Research SuperCluster is already operational and currently consists of 760 Nvidia DGX A100 systems for a total of 6,080 GPUs, which Meta said is 20 times faster for running computer vision applications and three times faster for training large natural language processing models compared to an earlier cluster with 22,000 Nvidia V100 GPUs.

These DGX A100 systems are connected over an Nvidia Quantum 200 GB/s InfiniBand fabric, which Meta said will make it the “one of the largest such networks” when the full cluster is completed.

As for storage, the AI Research SuperCluster has 175 petabytes of Pure Storage FlashArray systems, 46 petabytes of cache storage using Penguin Computing’s Atlus systems and 10 petabytes of Pure Storage FlashBlade systems. Collectively, the storage system will be able of delivering 16 TB/s of training data, and Meta plans to scale it up to a total capacity of 1 exabyte, which is equal to 1,000 petabytes.

Penguin Computing, a Fremont, Calif.-based provider of high-performance computing systems, is helping deploy the cluster as Meta’s architecture and managed services partners.

When the full cluster is built out, Meta said the AI Research SuperCluster will be capable of performing nearly 5 exaflops of mixed precision compute or 5 quintillion calculations per second.

In a blog post, Meta employees Kevin Lee and Shubho Sengupta said this unprecedented level of AI computing capability is necessary to train large, adaptable and complex models — some of which will have more than a trillion parameters and data sets as large as an exabyte — for advanced AI applications like computer vision, natural language processing and speech recognition.

These applications will play a central role in Meta’s goal of building the metaverse, a term taken from author Neal Stephenson’s dystopian sci-fi novel “Snow Crash” that Meta is using to describe a computing platform for a 3-D virtual environment where people can gather for a range of activities.

Lee and Sengupta said the AI Research SuperCluster will help Meta “build entirely new AI systems” to support its metaverse dreams, including a system that can translate voices in real time to large groups of people who speak different languages. It will also be used to train new AI models that can “seamlessly analyze text, images and video together” as well as develop new augmented reality tools.

“We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse,” Lee and Sengupta wrote.

The new Nvidia supercomputer was announced as the GPU maker faces increased competition from rivals Intel and AMD this year.

Intel has been significantly ramping up its AI efforts over the past few years, which includes its $2 billion acquisition of AI chip startup Habana Labs in 2019. The semiconductor giant is making a major push into the discrete GPU market in 2022, which will include its new Ponte Vecchio GPU that will power the U.S. Department of Energy’s Aurora supercomputer that is expected to go online later this year.

AMD, on the other hand, launched a new data center GPU line last fall, the Instinct MI200 series, which the chipmaker said is up to 4.9 times faster for HPC applications and up to 20 percent faster for AI applications compared to the 400-watt SXM version of Nvidia’s flagship A100 GPU.