The Battle of AI Networking: Ethernet vs InfiniBand
Introduction
Comparing Ethernet vs InfiniBand is like the introduction to a prize fight, with the prize being market share in the $20 billion AI networking market.
That said, it's less pugilism and more fine details, although it could be argued that "sweet science" applies to both.
InfiniBand was created to address Ethernet's shortcomings (lossy, stochastic and slow). Over time, however, the overall performance/reliability gap has substantially narrowed; with some tweaks, Ethernet can push data with the same bandwidth, latency and reliability as InfiniBand. While the ultra-high-performance domain (perhaps top 3-5 percent of the total market?) still belongs to InfiniBand, the vast majority of current InfiniBand deployments can actually be handled by Ethernet.
Regardless of the changes in performance profiles, directly comparing Ethernet and InfiniBand is challenging. It's not even apples-to-oranges; it's comparing apples to wheelbarrows. In some ways, they're identical; in others, radically different. The stakes for the primary use case (both Generative and Inferential AI) are high from both an economic and strategic perspective, though, so it's important to get it right.
As mentioned in a previous article, Basics of High-Performance Networking, a network's value is derived not from the transport itself but from how it connects compute and storage. When it comes to high performance, it boils down to a single question: How do you transport your RDMA?
However, the performance of a system leveraging RDMA is a function of the type of storage, type of compute, enhancements to each and how they're configured.
In recent proofs of concept (POCs) hosted in WWT's labs, engineering a true apples-to-apples Ethernet/IB comparison has meant duplicating a complex InfiniBand infrastructure on Ethernet, hop-by-hop, optic-by-optic, nerd-knob by nerd-knob. The environment was so customized that the results were largely only relevant to that exact build and its configuration. So, while we could absolutely say that Ethernet/RoCE was faster than InfiniBand, it only held true for those specific environments and the circumstances we tested.
Ethernet vs InfiniBand
Comparing the two "by the numbers," with attention to their differentiating factors:
ETHERNET | INFINIBAND | |
Max Bandwidth | 800 gbps | 800 gbps |
MTU | 9216 bytes (NOTE: RDMA is optimized for 4096 bytes, so larger frames will not necessarily result in enhanced performance) | 4096 bytes |
Layer 3 Support | Yes | No |
Delivery | Best Effort, enhanced to lossless | Lossless |
Load Balancing | Hash Values | Deterministic (NCCL) |
RDMA Support | RoCEv2 | Native |
Enhancements |
|
|
Pros |
|
|
Cons |
|
|
The question remains: how to test them in a way that broadly applies?
Test
WWT recently conducted a series of independent tests designed to eliminate all variables except for network transport. The raw metrics in these tests were expected to be worse than other publicly available numbers precisely because many performance-optimizing features were disabled to position the network transport as the central component.
While reflective of smaller-scale rail-optimized and rail-only designs, the intent of these tests was to compare the performance profile of RoCEv2 and its enabling features (PFC, ECN) against InfiniBand's natively scheduled fabric holding all other variables equal. That's it.
Equipment
HARDWARE | FUNCTION |
(2) 8-Way Compute Nodes | Compute |
H100 GPU | Accelerator |
NVIDIA Quantum 9700 NDR | Network (InfiniBand) |
Arista 7060DX5-64S | Network (Ethernet) |
Cisco Nexus 9332D-GX2B | Network (Ethernet) |
Setup
Phase 1
For Phase 1, a single-switch network was deployed, representing the ideal minimum-variable scenario.
Methodology
Testing made use of industry-standard MLCommons benchmarks, specifically the MLPerf Training and MLPerf Inference Data center problem sets. These enabled an apples-to-apples analysis of how network transport affects generative and inference AI performance.
- Each selected benchmark test was run for each Network solution and OEM, with the end results compared as an average
- Individual OEM results are masked to avoid complications of "whose network is better"
- Ethernet was minimally optimized, with only basic PFC and ECN switch configurations used in accordance with industry best practices.
- Performance-enhancing features on the compute node (notably NVLink) were disabled
- The intent was to force all GPU-GPU traffic out of the server and onto the network. Performance optimized? No. However, it allowed us to observe exclusively how the network contributed to the performance and then directly compare the differences.
- NCCL was modified between IB and Ethernet tests to whitelist compute NICs (a requirement for Ethernet functionality)
- The same physical optical cables were used for all Ethernet tests
- The same physical third-party optics were leveraged across systems
- Storage was local to the compute node
In short, every variable not related to network transport was removed.
Results
BENCHMARK | MODEL | ETHERNET | INFINIBAND | ETH/IB RATIO |
MLPerf Training | BERT-Large | 10,886 s | 10,951 s | 0.9977 |
MLPerf Inference | LLAMA2-70B-99.9 | 52.362 s | 52.003 s | 1.0166 |
Performance ratios were expressed in terms of Ethernet / InfiniBand (i.e., a longer Ethernet completion time will be reflected as a ratio greater than 1).
Observations
- Across generative tests and OEMs, the performance delta between InfiniBand and Ethernet was statistically insignificant (less than 0.03 percent)
- Ethernet was faster than InfiniBand's best time in three out of nine generative tests (although the margin was only by a few seconds)
- In inference tests, Ethernet averaged 1.0166 percent slower
Conclusions
- In the evaluations discussed above, InfiniBand and unoptimized Ethernet are statistically neck-and-neck.
- It is understood that performance differentials will emerge in larger networks but has been observed in other laboratory environments that the performance gap is generally under 5%.
- Introduction of current and pending optimization features (e.g., UltraEthernet) will substantially improve Ethernet performance.
- In larger, more complex multivariate tests that weren't part of this particular evaluation (i.e., the "bespoke" customer POCs run in WWT's Advanced Technology Center), Ethernet has been observed to sometimes outperform InfiniBand by a sizeable margin, especially when there was packet size variance and multiple AI's sharing the same fabric.
- In published case studies of large-cluster performance on Ethernet (i.e., Meta's LLAMA2 training on a 2000-GPU Ethernet cluster, and LLAMA3 training on a 24,000-GPU Ethernet cluster) performance between Ethernet and IB was at parity.
Connecting dots between small-scale tests, complex multivariate POCs, BasePod production environments and industry case studies:
WWT views Ethernet as a wholly viable alternative to InfiniBand for most generative and inference use cases
Caveat
Complications that are encountered in larger clusters that were not addressed by this test include (but are not limited to):
- Elephant Flows
- Multiple workload / "Noisy Neighbor" resource contention
- Transient Oversubscription
- Incast Oversubscription
- Imperfect Load-Balancing
A thorough investigation of use cases and the technology mix that will best support them is strongly recommended prior to deployment.
How WWT can help
We have answered the fundamental question, but that does not mean that we have answered every question.
A complex ecosystem of GPU, DPU, storage and specific use cases still need to be tested. As such, future tests will be run over a more conventional spine/leaf non-blocking architecture with multiple 8-way compute nodes.
In iterative tests, Ethernet-enhancing features (including Ultra Ethernet modifications, ECMP entropy improvements, flowlets, packet spraying, network and NIC reordering, etc.) will be examined.
World Wide Technology has over 10 years of experience in the design and implementation of Big Data and AI/ML solutions. In late 2023, WWT announced a three-year, $500 million dollar investment in the creation of a unique AI Proving Grounds. AIPG provides an ecosystem of best-of-breed hardware, software, and architecture where customers can answer pressing questions in AI infrastructure and design. If a customer wants to gauge LLM training times with Cisco RoCE, AMD GPU and NetApp storage (for example) against an equivalent NVIDIA InfiniBand/NVIDIA GPU/Pure storage mix, this is the only lab on the planet where these on-demand mixes of hardware are available.
References
Data Center AI Networking , 650 Group (2024) https://650group.com/press-releases/data-center-ai-networking-to-surge-to-nearly-20b-in-2025-according-to-650-group/
The Basics of High Performance Networking, WWT (2024) https://www.wwt.com/article/the-basics-of-high-performance-networking
MLCommons (2024) http://www.mlcommons.org
Meta AI, Meta (2024) http://ai.meta.com
AI/ML Datacenter Networking Blueprint, Cisco (2024) https://www.cisco.com/c/en/us/td/docs/dcn/whitepapers/cisco-data-center-networking-blueprint-for-ai-ml-applications.html
AI Networking, Arista (2024) https://www.arista.com/assets/data/pdf/Whitepapers/AI-Network-WP.pdf