The Basics of High-Performance Networking
In this article
When it comes to the traditional pillars of data center infrastructure (i.e., compute, storage, network), the network is unique in that it has no inherent value on its own. While standalone servers can still compute things and storage hold things, the network is only as useful as the information it transports.
The same logic applies to high-performance networking (HPN) — though the downside of a dysfunctional HPN can be a little more significant. If the data does not arrive perfectly, reliably and in order, it will often cause upstream and downstream disruptions that are way beyond the scope of the basic transport. The reason? Because "high performance" is significantly more complex than a mere function of bandwidth and application-specific integrated circuit (ASIC) latency.
Networkers know that the line of demarc in a traditional environment is where a packet hits the wire on one side and is handed off to the network interface card (NIC) on the other. It's simple, concise, straightforward to troubleshoot (or at least pass the buck and say "not a network problem"), and comprises only a tiny fraction of the whole picture.
This article will cover the components of HPN, the transport technologies available, and the "special sauce" that makes things actually go fast.
The components of "high performance"
Latency
NIC-to-NIC latency passing through a single switch ASIC usually clocks in somewhere in the 100-300 nanosecond range (this varies by chip and platform). Kernel-to-kernel latency — holding all other things equal — is around 50 microseconds. What's the implication? About 95 percent of the latency is happening inside the server.
Understanding why requires a look inside the server itself.
A highly simplified packet walk (i.e., "a day in the life of a packet") starts when two 8-byte registers go into a CPU and an 8-byte register comes out. This is considered a single 64-bit flop. From there, our intrepid little data nugget must go through the L1 (super-fast), L2 (very fast) and L3 (fast) caches, main memory (not fast at all), and the PCI bus before it makes it to the NIC for transmission to another device.
*Note that the applications, sockets and protocol drivers all live in main memory*
Bandwidth
Bandwidth demands are constantly increasing. Table stakes for HPN are currently 400gbps, with a leap to 800–1600gbps expected over the next two years. Bandwidth is a constantly moving target, and investment should be understood to have a limited "top of the food chain" shelf life before what is considered ultra-fast becomes ho-hum normal.
That said, while the engineering and physics of how this bandwidth is delivered are quite complex, and the size of the pipe is very relevant, it is not the bandwidth itself that makes a network "high performance."
What makes an app go fast if not latency and bandwidth?
RDMA
Remote Direct Memory Access (RDMA) is an I/O bypass technology implemented on the NICs of participating compute devices. RDMA leverages zero-copy networking, meaning it reads directly from the main memory of one device to the main memory of another while completely bypassing the NIC, socket and transport buffers running in that memory.
RDMA is multidisciplinary and involves various levels of understanding application, network, compute and storage technologies. Once RDMA is invoked, a network can realize 90–96 percent decreases in latency (end-to-end, not NIC-to-NIC).
There are, however, some drawbacks to RDMA. Most notably, it has very strict quality and reliability requirements: data must be delivered reliably, in order, with no drops or jitter, all the time. The standard data center networking "best effort" SLA is insufficient. This is where the "high performance" part of HPN comes into play.
What is high-performance networking then?
The question of defining HPN isn't a function of latency or bandwidth, though both are factors. High performance is really determined by how you transport your RDMA (or the next really big important application). Much in the same way networks had to evolve to accommodate voice over IP, fibre-channel over ethernet, and dozens of other "sensitive applications," modern networks will have to adapt to support today's high-end compute and storage functions.
Presently, there are two dominant industry-standard methods for RDMA transport:
InfiniBand
InfiniBand (IB) is a communications standard that has native support for RDMA. The latency and bandwidth of InfiniBand ASICs are roughly equivalent to ethernet performance. In fact, IP over IB has the same latency as unaccelerated IP over ethernet. The physical hardware consists of Host Channel Adapters (HCA — basically a NIC) and InfiniBand switches. It's a controller-based architecture (via a subnet manager) that is non-routable, meaning that failure domains are the same size as the entire fabric. The subnet manager runs in software on one of the switches.
RoCEv2 (RDMA over Converged Ethernet v2)
Pronounced "rocky," RoCE is a similar standard that leverages ethernet with a few tweaks. It also consists of HCAs and switches but relies upon HCA point-to-point negotiation for reliable delivery instead of the IB-based controller. Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) parameters must be configured on the switches to recognize and prioritize RDMA-related traffic.
The choice of which to embrace lies somewhere between engineering preference and deeply-held religious beliefs.
InfiniBand versus RoCE considerations
Here is a quick comparison:
InfiniBand | RoCE |
|
|
Flexibility
- AI InfiniBand fabrics are purpose-built and can support one workload type — AI.
- If a client's data center consists of a limited number of AI workloads continuously running with similar performance profiles, then InfiniBand might be a good fit.
- High-performance ethernet is more versatile.
- If a client's data center is cloud-focused (hybrid or cloud-ready) or consists of a wide ecosystem of mixed-requirement applications, then ethernet/RoCE may be a better choice.
- Migrations from InfiniBand to ethernet can be challenging.
Scalability
- As AI/ML workloads grow, the size of farms will, too.
- If current growth trends continue, today's NVIDIA SuperPOD DGX cluster (127 nodes/1,016 GPU) will need to be 20x larger in two years.
- The specification allows for very large fabrics but with complications vis-Ã -vis large failure domains.
- InfiniBand has no Layer 3 option.
When to use each?
Best practices leverage InfiniBand for the most demanding and performance-hungry applications (i.e., not just for any AI model, but for really big AI models). The other 95 percent of use cases can run quite comfortably on the much more versatile ethernet option.
A 2022 study conducted by Meta compared pre-training times for their LLAMA2 AI model. Two identical clusters were built: one on InfiniBand and one on RoCEv2. What they found was that, below 2,000 GPUs, comparative performance was very close. At current rack densities and pricing, 2,000 GPUs would require approximately 63 cabinets, cost $100,000,000, and take over a year to build.
The general industry consensus is that — while there will always be a place for InfiniBand — ethernet-based solutions will evolve to satisfy the bulk of market requirements.
WWT's AI Proving Ground can help
The Meta study is an excellent example of "science that needs to be verified." As mentioned at the beginning of this article, high-performance networking cannot be tested in a vacuum. It requires real data bouncing between real servers and storage to give its performance any contextual meaning.
WWT has more than 10 years of experience designing and implementing big data and AI/ML solutions. In late 2023, we announced a three-year, $500 million dollar investment in a state-of-the-art AI Proving Ground inside our Advanced Technology Center (ATC). This composable lab environment is designed to accelerate the ability of clients to answer pressing questions about AI infrastructure, design and performance with hands-on access to the latest AI hardware, software and reference architectures — all in a secure, scalable and transparent manner.
For example, if a client wanted to compare LLM training times between a Cisco RoCE/Dell GPU/NetApp storage mix against an equivalent NVIDIA InfiniBand/NVIDIA GPU/Pure Storage mix (see graphic below) — the AI Proving Ground is the only lab on the planet where these on-demand mixes of hardware are available.