Chip Startup Graphcore Takes On Nvidia A100 With New IPU

Graphcore claims its new Colossus MK2 is the ‘world’s most complex processor,’ packing more transistors in a smaller die than Nvidia’s A100 while outpacing the GPU by several times in performance and throughput. ‘Graphcore may now be first in line to challenge NVIDIA for data center AI, at least for large-scale training,’ an analyst says.


Artificial intelligence chip startup Graphcore is taking on Nvidia with what it is calling the “world‘s most complex processor,” saying it can outpace the GPU juggernaut’s new A100.

The Bristol, U.K.-based startup‘s new intelligence processing unit, the Colossus MK2, was announced Wednesday and packs 59.4 billion transistors on an 823 mm2 die, a few billion transistors more than Nvidia’s A100 in a slightly smaller package. The chip relies on the same 7-nanometer manufacturing process from chip foundry TSMC that is used by AMD and Nvidia in their latest processors.

[Related: The 10 Hottest AI Chip Startups Of 2020 (So Far)]

Sponsored post

At launch, the new IPU will be supported by a $32,450 purpose-built 1U system called the M2000 that is equipped with four MK2 IPUs, making it capable of one petaflop, or three quadrillion calculations per second, according to Graphcore. The M2000 also comes with the startup‘s ultra-low latency IPU-Fabric, which has a throughput of 2.8 Tbps and can support scale-out deployments of up to 64,000 IPUs. The system can also work with existing data center infrastructure.

Graphcore said eight M2000s, which would cost a total of $259,600, can outperform Nvidia‘s $199,999, eight-A100 DGX A100 systems in FP32 compute math by more than 12 times and AI compute by more than three times while providing more than 10 times the memory capacity at 3.6 TB.

When it comes to the EfficientNet-B4 image classification training algorithm, that same $259,600 cluster of eight M2000s can perform the same amount of work as 16 DGX A100 systems, which would cost a total of more than $3 million, Graphcore said in another claim.

The MK2 IPU itself consists of 1,472 independent processor cores, 8,832 separate parallel threads and 900 MB of on-board ultra-high-speed RAM, the latter of which allows the processor to hold large AI models inside the memory. Compared to the original MK1 IPU, Graphcore said the MK2 offers a 9.3-fold improvement in BERT-Large training performance, a 8.5-fold improvement in BERT-3Layer inference performance and a 7.4-fold improvement in EfficientNet-B3 training performance.

Nigel Toon, a semiconductor veteran who is Graphcore‘s CEO, said the MK2 has similar flexibility to Nvidia’s A100 when it comes to doing both training and inference workloads, but the IPU can do such work faster because it doesn’t rely on large batch files like a GPU does.

“What we’re seeing is much, much lower latency when you’re doing inference,” he said. “So let’s say you’re doing something like [natural language processing], we’re typically seeing one-tenth the latency, so you get the answer much quick, much more quickly.”

That speed differential translates into cost savings for customers, Toon said.

“The fact that we‘ve got these different IPUs inside the IPU machine, each of those could be doing different inference jobs,” he said. “So again, you could use them all together for training, and you can use multiple IPU machines for training, and then you can split them up to use the individual IPUs to support different customers with different inference workloads.”

The startup‘s Poplar software is key, according to Toon, to the MK2’s memory bandwidth and density advantages over Nvidia’s A100. The MK2’s high-speed memory can reach 180 Tbps, which, when combined with DRAM, supports up to 450 GB in capacity.

“Because we have so much memory inside the processor, our Poplar software is able to ensure that the processor always has the data it needs, it‘s able to pre-load the data that it will need in the next phases of compute, and it can bring that data from an attached DRAM memory,” he said. “And all of that is invisible to the user, and It’s completely transparent.”

Toon said the IPU was designed from the start to support sparsity, a method Nvidia also uses that allows processors to run models in a much more efficient manner.

“You end up having to build models that are much, much larger than they really need to be,” he said. “And you end up using a lot more compute than you need to.”

Among early adopters of MK2 systems include the University of Oxford, which has reported “dramatic performance gains,” the U.S. Department of Energy‘s Lawrence Berkeley National Laboratory, which had already been using MK1 through Microsoft Azure, and J.P. Morgan, which is evaluating the startup’s chips for natural language processing and speech recognition.

Graphcore is selling its IPUs through cloud service providers, OEM partners and channel partners, and the startup plans to launch a formal channel program in September, according to Toon.

“Our approach here is to offer an IPU machine as an OEM product, as a white label product that people can sell as their own product,” he said. “The real benefit of this is that they don‘t have to create specialized servers with different configurations.”

One of the Graphcore‘s first channel partners is French systems integrator Atos, which is planning to sell M2000 systems and IPU-POD systems to large European labs and institutions.

“We are already planning with European early customers to build out an IPU cluster for their AI research projects,” Arnaud Bertrand, senior vice president and head of strategy, innovations and research and development at Atos, said in a statement. “The IPU new architecture can enable a more efficient way to run AI workloads, which fits the Atos decarbonization initiative, and we are delighted to be working with a European AI semiconductor company to realize this future together.”

Karl Freund, a senior analyst at Moor Insights & Strategy, said while he‘s impressed by the MK2’s performance, he thinks the IPU’s scalability is “perhaps its greatest feature,” with the processor’s fabric allowing large scale-out deployments and the M2000 supporting plug-and-play infrastructure.

“With this new product, Graphcore may now be first in line to challenge NVIDIA for data center AI, at least for large-scale training,” he said.