Home
News
Products
Corporate
Contact
 
Thursday, September 25, 2025

News
Industry News
Publications
CST News
Help/Support
Software
Tester FAQs
Industry News

Industry Players Race to Scale Up the AI Data Center


Thursday, September 25, 2025

The high computational demands of AI workloads are driving the need to interconnect GPUs/AI accelerators into clusters that function as a single unit to increase performance efficiency.

As the leader in AI accelerators, Nvidia’s NVLink interconnect has become the most common interface for AI accelerators. However, until recently, NVLink has been a proprietary technology and only available on Nvidia-based platforms.

NVLink is used to connect Nvidia CPUs and GPUs. IBM incorporated it into its Power8 and Power9 CPUs; however, it received little interest. As a result, IBM transitioned to PCIe for the Power10 generation.

Early this year, Nvidia opened the NVLink interconnect through NVLink Fusion, a program that enables other semiconductor platforms to utilize NVLink through a licensing agreement.

Meanwhile, the rest of the industry has been extending the capabilities of PCIe and Ethernet to compete with NVLink, while also forming the Ultra Accelerator Link Consortium (UALink) to offer an alternative to NVLink. The key questions are which ones will survive, and will any of them displace NVLink?

A little networking history

There are two key terms in data center networking: scale-up and scale-out. Historically, scale-up meant to scale the resources within a server chassis, while scale-out referred to the ability to create clusters by connecting multiple servers.

However, with the transition to accelerated computing, scale-up has changed to connecting resources within a rack, and possibly even beyond the rack, to act as a single system. Scale-out refers to connecting these compute resources into clusters across the data center and even between data centers.

It is important to note that Nvidia just introduced a third key term for networking – scale-across – for linking separate data centers. However, scale-up networking has become critical to allow AI resources across multiple servers to act as a single compute unit.

Another historical note is that the tech industry commonly develops new interconnects for new generations of technology. The reason for this is that the capabilities of the latest semiconductors and demands of the applications often outpace the capabilities of standards bodies like the Institute of Electrical and Electronics Engineers (IEEE), which governs the Ethernet standards, and the PCI Special Interest Group, which regulates the PCIe standards, which cannot respond quickly enough. However, only a few of these unique interconnects survive over time.

Proprietary interconnects are nothing new

The shift to connecting compute resources has long driven the industry toward proprietary solutions. Intel developed its Quick Path Interconnect (QPI), followed by its Ultra Path Interconnect (UPI) for CPU-to-CPU and Xe Link for GPU fabrics, while opting for PCIe for the CPU-to-GPU interconnect; AMD developed HyperTransport for CPU-to-CPU interconnect and later Infinity Fabric to connect all AMD compute resources in a system; and Nvidia introduced NVLink as a GPU fabric and extended it to CPUs.

With the increasing availability of new AI solutions, including custom processors and AI accelerators from hyperscale cloud service providers, the industry is seeking a unified solution.Success or failure in scale-up networking

The most common options for scale-up networking include the following:

While each interconnect has its benefits, its limitations are likely to determine its ultimate success or failure in scale-up networking, especially in AI applications where performance is critical.

PCIe is an industry standard, but it has limited performance and longer latency compared to other scale-up interconnects due to the protocol associated with it.

While SUE is based on Ethernet standards, it utilizes a modified protocol stack to reduce latency. While it offers high scalability and a roadmap to very high performance, SUE still has higher latency than the other interconnects, and like PCIe, it does not provide memory coherence. Additionally, SUE requires new switch chips. Currently, only Broadcom’s Tomahawk Ultra supports SUE for scale-up networking.

By: DocMemory
Copyright © 2023 CST, Inc. All Rights Reserved

CST Inc. Memory Tester DDR Tester
Copyright © 1994 - 2023 CST, Inc. All Rights Reserved