The Battle of AI Networking: Ethernet vs InfiniBand

Introduction

Comparing Ethernet vs InfiniBand is like the introduction to a prize fight, with the prize being market share in the $20 billion AI networking market.

That said, it's less pugilism and more fine details, although it could be argued that "sweet science" applies to both.

InfiniBand was created to address Ethernet's shortcomings (lossy, stochastic and slow). Over time, however, the overall performance/reliability gap has substantially narrowed; with some tweaks, Ethernet can push data with the same bandwidth, latency and reliability as InfiniBand. While the ultra-high-performance domain (perhaps top 3-5 percent of the total market?) still belongs to InfiniBand, the vast majority of current InfiniBand deployments can actually be handled by Ethernet.

Regardless of the changes in performance profiles, directly comparing Ethernet and InfiniBand is challenging. It's not even apples-to-oranges; it's comparing apples to wheelbarrows. In some ways, they're identical; in others, radically different. The stakes for the primary use case (both Generative and Inferential AI) are high from both an economic and strategic perspective, though, so it's important to get it right.

As mentioned in a previous article, Basics of High-Performance Networking, a network's value is derived not from the transport itself but from how it connects compute and storage. When it comes to high performance, it boils down to a single question: How do you transport your RDMA?

However, the performance of a system leveraging RDMA is a function of the type of storage, type of compute, enhancements to each and how they're configured.

In recent proofs of concept (POCs) hosted in WWT's labs, engineering a true apples-to-apples Ethernet/IB comparison has meant duplicating a complex InfiniBand infrastructure on Ethernet, hop-by-hop, optic-by-optic, nerd-knob by nerd-knob. The environment was so customized that the results were largely only relevant to that exact build and its configuration. So, while we could absolutely say that Ethernet/RoCE was faster than InfiniBand, it only held true for those specific environments and the circumstances we tested.

Ethernet vs InfiniBand

Comparing the two "by the numbers," with attention to their differentiating factors:

	ETHERNET	INFINIBAND
Max Bandwidth	800 gbps	800 gbps
MTU	9216 bytes (NOTE: RDMA is optimized for 4096 bytes, so larger frames will not necessarily result in enhanced performance)	4096 bytes
Layer 3 Support	Yes	No
Delivery	Best Effort, enhanced to lossless	Lossless
Load Balancing	Hash Values	Deterministic (NCCL)
RDMA Support	RoCEv2	Native
Enhancements	Dynamic Load Balancing Weighted ECMP VOQ Disaggregated Scheduled Fabric (DSF) Adaptive Routing EtherLink Performance Isolation DDP	Adaptive Routing SHARP
Pros	Handles multi-workload fabrics (i.e., several different AIs with varying requirements) Easily adapted skillset for existing network engineers	Simple to install Self-optimizing
Cons	At present, it requires a few QoS modifications to optimize performance	Rare skillset Operationally difficult to support when something goes wrong

The question remains: how to test them in a way that broadly applies?

Test

WWT recently conducted a series of independent tests designed to eliminate all variables except for network transport. The raw metrics in these tests were expected to be worse than other publicly available numbers precisely because many performance-optimizing features were disabled to position the network transport as the central component.

While reflective of smaller-scale rail-optimized and rail-only designs, the intent of these tests was to compare the performance profile of RoCEv2 and its enabling features (PFC, ECN) against InfiniBand's natively scheduled fabric holding all other variables equal. That's it.

Equipment

HARDWARE	FUNCTION
(2) 8-Way Compute Nodes	Compute
H100 GPU	Accelerator
NVIDIA Quantum 9700 NDR	Network (InfiniBand)
Arista 7060DX5-64S	Network (Ethernet)
Cisco Nexus 9332D-GX2B	Network (Ethernet)

Setup

Phase 1

For Phase 1, a single-switch network was deployed, representing the ideal minimum-variable scenario.

Methodology

Testing made use of industry-standard MLCommons benchmarks, specifically the MLPerf Training and MLPerf Inference Data center problem sets. These enabled an apples-to-apples analysis of how network transport affects generative and inference AI performance.

Each selected benchmark test was run for each Network solution and OEM, with the end results compared as an average
Individual OEM results are masked to avoid complications of "whose network is better"
Ethernet was minimally optimized, with only basic PFC and ECN switch configurations used in accordance with industry best practices.
Performance-enhancing features on the compute node (notably NVLink) were disabled
- The intent was to force all GPU-GPU traffic out of the server and onto the network. Performance optimized? No. However, it allowed us to observe exclusively how the network contributed to the performance and then directly compare the differences.
NCCL was modified between IB and Ethernet tests to whitelist compute NICs (a requirement for Ethernet functionality)
The same physical optical cables were used for all Ethernet tests
The same physical third-party optics were leveraged across systems
Storage was local to the compute node

In short, every variable not related to network transport was removed.

Results

BENCHMARK	MODEL	ETHERNET	INFINIBAND	ETH/IB RATIO
MLPerf Training	BERT-Large	10,886 s	10,951 s	0.9977

MLPerf Inference	LLAMA2-70B-99.9	52.362 s	52.003 s	1.0166

Performance ratios were expressed in terms of Ethernet / InfiniBand (i.e., a longer Ethernet completion time will be reflected as a ratio greater than 1).

Observations

Across generative tests and OEMs, the performance delta between InfiniBand and Ethernet was statistically insignificant (less than 0.03 percent)
Ethernet was faster than InfiniBand's best time in three out of nine generative tests (although the margin was only by a few seconds)
In inference tests, Ethernet averaged 1.0166 percent slower

Conclusions

In the evaluations discussed above, InfiniBand and unoptimized Ethernet are statistically neck-and-neck.
It is understood that performance differentials will emerge in larger networks but has been observed in other laboratory environments that the performance gap is generally under 5%.
Introduction of current and pending optimization features (e.g., UltraEthernet) will substantially improve Ethernet performance.
In larger, more complex multivariate tests that weren't part of this particular evaluation (i.e., the "bespoke" customer POCs run in WWT's Advanced Technology Center), Ethernet has been observed to sometimes outperform InfiniBand by a sizeable margin, especially when there was packet size variance and multiple AI's sharing the same fabric.
In published case studies of large-cluster performance on Ethernet (i.e., Meta's LLAMA2 training on a 2000-GPU Ethernet cluster, and LLAMA3 training on a 24,000-GPU Ethernet cluster) performance between Ethernet and IB was at parity.

Connecting dots between small-scale tests, complex multivariate POCs, BasePod production environments and industry case studies:

WWT views Ethernet as a wholly viable alternative to InfiniBand for most generative and inference use cases

Caveat

Complications that are encountered in larger clusters that were not addressed by this test include (but are not limited to):

Elephant Flows
Multiple workload / "Noisy Neighbor" resource contention
Transient Oversubscription
Incast Oversubscription
Imperfect Load-Balancing

A thorough investigation of use cases and the technology mix that will best support them is strongly recommended prior to deployment.

How WWT can help

We have answered the fundamental question, but that does not mean that we have answered every question.

A complex ecosystem of GPU, DPU, storage and specific use cases still need to be tested. As such, future tests will be run over a more conventional spine/leaf non-blocking architecture with multiple 8-way compute nodes.

In iterative tests, Ethernet-enhancing features (including Ultra Ethernet modifications, ECMP entropy improvements, flowlets, packet spraying, network and NIC reordering, etc.) will be examined.

World Wide Technology has over 10 years of experience in the design and implementation of Big Data and AI/ML solutions. In late 2023, WWT announced a three-year, $500 million dollar investment in the creation of a unique AI Proving Grounds. AIPG provides an ecosystem of best-of-breed hardware, software, and architecture where customers can answer pressing questions in AI infrastructure and design. If a customer wants to gauge LLM training times with Cisco RoCE, AMD GPU and NetApp storage (for example) against an equivalent NVIDIA InfiniBand/NVIDIA GPU/Pure storage mix, this is the only lab on the planet where these on-demand mixes of hardware are available.

References

Data Center AI Networking , 650 Group (2024) https://650group.com/press-releases/data-center-ai-networking-to-surge-to-nearly-20b-in-2025-according-to-650-group/

The Basics of High Performance Networking, WWT (2024) https://www.wwt.com/article/the-basics-of-high-performance-networking

MLCommons (2024) http://www.mlcommons.org

Meta AI, Meta (2024) http://ai.meta.com

AI/ML Datacenter Networking Blueprint, Cisco (2024) https://www.cisco.com/c/en/us/td/docs/dcn/whitepapers/cisco-data-center-networking-blueprint-for-ai-ml-applications.html

AI Networking, Arista (2024) https://www.arista.com/assets/data/pdf/Whitepapers/AI-Network-WP.pdf