While there’s been a lot of discussion about the compute requirements of AI, less attention has been paid to the network. However, AI is forcing organizations to rethink their network infrastructures.
Traditional networks are unable to support the requirements of AI workloads. Without high-bandwidth connectivity, organizations will struggle with poor GPU utilization and inadequate performance of GPU-powered systems. Given the cost of GPUs and other hardware accelerators (xPUs), a high-performance network is critical, and the demands are increasing rapidly.
Speed isn’t the only consideration. Networks must provide extremely low latency with no packet loss and be able to scale to connect hundreds or thousands of xPUs.
Organizations have several choices when it comes to upgrading their networks: Ethernet, InfiniBand and proprietary technologies, such as NVLink. While many organizations are utilizing high-speed Ethernet, the other options also deserve consideration.
Organizations are deploying AI clusters with tens of thousands of xPUs to train large AI models, and operators report that each xPU requires 1Tbps of network bandwidth. Dell’Oro Group estimates that AI clusters quadruple in size every two years. The scale of the networking demands is mind-boggling.
The nodes in an AI cluster must work together as one vast and powerful entity, which requires high-speed communication between the xPUs. One slow connection can impede the performance of the entire cluster. After finishing their assigned tasks, xPUs may sit idle, waiting for data from other xPUs. The result is wasted compute resources and poor scaling efficiency.
According to research by Meta, early AI applications spent about a third of their time waiting on the network, resulting in hundreds of millions of dollars in wasted resources. A high-performance, low-latency network directly contributes to more efficient AI development and the ability to run more complex models.
When AI first came on the scene, organizations deployed various networking technologies to meet performance demands. However, Ethernet is becoming the preferred choice because it is familiar, cost-effective and highly evolved and has an open ecosystem of products. Organizations are using 800G Ethernet for AI model training and inferencing, and 400G Ethernet for ingesting training data.
Remote Direct Memory Access (RDMA) enables a network interface card (NIC) in one node to access memory in another node directly, bypassing the CPU and operating system. This results in significantly reduced latency, higher bandwidth and lower CPU overhead compared to traditional network communication methods such as TCP/IP sockets. RDMA over Converged Ethernet (RoCE) is a protocol that implements RDMA on Ethernet networks.
The Ultra Ethernet Consortium is developing the Ultra Ethernet Transport (UET) specification, which updates RDMA to meet the demands of AI and high-performance computing (HPC). UET products are expected to be available from late 2025 into early 2026, with some early “Ultra Ethernet-ready” components already released.
InfiniBand is another option. Designed for fast server-to-server and server-to-storage communication, InfiniBand enables direct, hardware-driven data transfers to maximize performance and minimize latency. It has a switched fabric architecture for efficient routing and utilizes RDMA to enable “zero-copy” data transfers.
NVLink is NVIDIA’s proprietary, high-speed interconnect that creates peer-to-peer links between multiple GPUs and/or CPUs. It offers significantly higher bandwidth than conventional PCIe and enables greater shared memory bandwidth. When coupled with NVIDIA’s NVSwitch technology, it can create a large-scale, all-to-all communication fabric across multiple nodes.
Technologent’s networking experts can explain these and other options in detail, and offer recommendations based on your existing infrastructure and planned AI and HPC initiatives. We can then help you plan and execute a network modernization strategy that takes your organization into the future. Contact one of our consultants to get started.