Introduction

Comparing Ethernet vs InfiniBand is like the introduction to a prize fight, with the prize being market share in the $20 billion AI networking market.

That said, it's less pugilism and more fine details, although it could be argued that "sweet science" applies to both.

InfiniBand was created to address Ethernet's shortcomings (lossy, stochastic and slow).  Over time, however, the overall performance/reliability gap has substantially narrowed; with some tweaks, Ethernet can push data with the same bandwidth, latency and reliability as InfiniBand. While the ultra-high-performance domain (perhaps top 3-5 percent of the total market?) still belongs to InfiniBand, the vast majority of current InfiniBand deployments can actually be handled by Ethernet.

Regardless of the changes in performance profiles, directly comparing Ethernet and InfiniBand is challenging.  It's not even apples-to-oranges; it's comparing apples to wheelbarrows.  In some ways, they're identical; in others, radically different. The stakes for the primary use case (both Generative and Inferential AI) are high from both an economic and strategic perspective, though, so it's important to get it right.

As mentioned in a previous article, Basics of High-Performance Networking, a network's value is derived not from the transport itself but from how it connects compute and storage. When it comes to high performance, it boils down to a single question: How do you transport your RDMA?

However, the performance of a system leveraging RDMA is a function of the type of storage, type of compute, enhancements to each and how they're configured. 

In recent proofs of concept (POCs) hosted in WWT's labs, engineering a true apples-to-apples Ethernet/IB comparison has meant duplicating a complex InfiniBand infrastructure on Ethernet, hop-by-hop, optic-by-optic, nerd-knob by nerd-knob. The environment was so customized that the results were largely only relevant to that exact build and its configuration. So, while we could absolutely say that Ethernet/RoCE was faster than InfiniBand, it only held true for those specific environments and the circumstances we tested.

Ethernet vs InfiniBand

Comparing the two "by the numbers," with attention to their differentiating factors:

 ETHERNETINFINIBAND
Max Bandwidth800 gbps800 gbps
MTU

9216 bytes

(NOTE: RDMA is optimized for 4096 bytes, so larger frames will not necessarily result in enhanced performance)

4096 bytes
Layer 3 SupportYesNo
DeliveryBest Effort, enhanced to losslessLossless
Load BalancingHash ValuesDeterministic (NCCL)
RDMA SupportRoCEv2Native
Enhancements
  • Dynamic Load Balancing
  • Weighted ECMP
  • VOQ
  • Disaggregated Scheduled Fabric (DSF)
  • Adaptive Routing
  • EtherLink
  • Performance Isolation
  • DDP
  • Adaptive Routing
  • SHARP
Pros
  • Handles multi-workload fabrics (i.e., several different AIs with varying requirements)
  • Easily adapted skillset for existing network engineers
  • Simple to install
  • Self-optimizing
Cons
  • At present, it requires a few QoS modifications to optimize performance
  • Rare skillset
  • Operationally difficult to support when something goes wrong

The question remains: how to test them in a way that broadly applies?

Test

WWT recently conducted a series of independent tests designed to eliminate all variables except for network transport. The raw metrics in these tests were expected to be worse than other publicly available numbers precisely because many performance-optimizing features were disabled to position the network transport as the central component. 

While reflective of smaller-scale rail-optimized and rail-only designs, the intent of these tests was to compare the performance profile of RoCEv2 and its enabling features (PFC, ECN) against InfiniBand's natively scheduled fabric holding all other variables equal. That's it.

Equipment

HARDWAREFUNCTION
(2) 8-Way Compute NodesCompute
H100 GPUAccelerator
NVIDIA Quantum 9700 NDRNetwork (InfiniBand)
Arista 7060DX5-64SNetwork (Ethernet)
Cisco Nexus 9332D-GX2BNetwork (Ethernet)

Setup

Phase 1

For Phase 1, a single-switch network was deployed, representing the ideal minimum-variable scenario. 

Figure 1:  Phase 1 Topology
Figure 1:  Phase 1 Topology

Methodology

Testing made use of industry-standard MLCommons benchmarks, specifically the MLPerf Training and MLPerf Inference Data center problem sets.  These enabled an apples-to-apples analysis of how network transport affects generative and inference AI performance.

  • Each selected benchmark test was run for each Network solution and OEM, with the end results compared as an average
  • Individual OEM results are masked to avoid complications of "whose network is better"
  • Ethernet was minimally optimized, with only basic PFC and ECN switch configurations used in accordance with industry best practices.
  • Performance-enhancing features on the compute node (notably NVLink) were disabled 
    • The intent was to force all GPU-GPU traffic out of the server and onto the network. Performance optimized? No. However, it allowed us to observe exclusively how the network contributed to the performance and then directly compare the differences.
  • NCCL was modified between IB and Ethernet tests to whitelist compute NICs (a requirement for Ethernet functionality)
  • The same physical optical cables were used for all Ethernet tests
  • The same physical third-party optics were leveraged across systems 
  • Storage was local to the compute node

In short, every variable not related to network transport was removed. 

Results

BENCHMARK MODEL ETHERNET INFINIBAND ETH/IB RATIO 
MLPerf Training BERT-Large 10,886 s10,951 s0.9977
      
MLPerf Inference LLAMA2-70B-99.9  52.362 s 52.003 s1.0166

Performance ratios were expressed in terms of Ethernet / InfiniBand (i.e., a longer Ethernet completion time will be reflected as a ratio greater than 1).

Observations

  • Across generative tests and OEMs, the performance delta between InfiniBand and Ethernet was statistically insignificant (less than 0.03 percent)
  • Ethernet was faster than InfiniBand's best time in three out of nine generative tests (although the margin was only by a few seconds)
  • In inference tests, Ethernet averaged 1.0166 percent slower

Conclusions

  • In the evaluations discussed above, InfiniBand and unoptimized Ethernet are statistically neck-and-neck. 
  • It is understood that performance differentials will emerge in larger networks but has been observed in other laboratory environments that the performance gap is generally under 5%.
  • Introduction of current and pending optimization features (e.g., UltraEthernet) will substantially improve Ethernet performance.
  • In larger, more complex multivariate tests that weren't part of this particular evaluation (i.e., the "bespoke" customer POCs run in WWT's Advanced Technology Center), Ethernet has been observed to sometimes outperform InfiniBand by a sizeable margin, especially when there was packet size variance and multiple AI's sharing the same fabric.
  • In published case studies of large-cluster performance on Ethernet (i.e., Meta's LLAMA2 training on a 2000-GPU Ethernet cluster, and LLAMA3 training on a 24,000-GPU Ethernet cluster) performance between Ethernet and IB was at parity.   

 


Connecting dots between small-scale tests, complex multivariate POCs, BasePod production environments and industry case studies:

WWT views Ethernet as a wholly viable alternative to InfiniBand for most generative and inference use cases

Caveat

Complications that are encountered in larger clusters that were not addressed by this test include (but are not limited to):

  • Elephant Flows
  • Multiple workload / "Noisy Neighbor" resource contention
  • Transient Oversubscription
  • Incast Oversubscription
  • Imperfect Load-Balancing

A thorough investigation of use cases and the technology mix that will best support them is strongly recommended prior to deployment.

How WWT can help

We have answered the fundamental question, but that does not mean that we have answered every question. 

A complex ecosystem of GPU, DPU, storage and specific use cases still need to be tested.  As such, future tests will be run over a more conventional spine/leaf non-blocking architecture with multiple 8-way compute nodes. 

In iterative tests, Ethernet-enhancing features (including Ultra Ethernet modifications, ECMP entropy improvements, flowlets, packet spraying, network and NIC reordering, etc.) will be examined.

 

Figure 2:  Phase 2 Topology
Figure 2:  Phase 2 Topology

World Wide Technology has over 10 years of experience in the design and implementation of Big Data and AI/ML solutions. In late 2023, WWT announced a three-year, $500 million dollar investment in the creation of a unique AI Proving Grounds.  AIPG provides an ecosystem of best-of-breed hardware, software, and architecture where customers can answer pressing questions in AI infrastructure and design.  If a customer wants to gauge LLM training times with Cisco RoCE, AMD GPU and NetApp storage (for example) against an equivalent NVIDIA InfiniBand/NVIDIA GPU/Pure storage mix, this is the only lab on the planet where these on-demand mixes of hardware are available.

 

Figure 3:  AIPG Logical
Figure 3:  AIPG Logical

References

Data Center AI Networking , 650 Group (2024) https://650group.com/press-releases/data-center-ai-networking-to-surge-to-nearly-20b-in-2025-according-to-650-group/

The Basics of High Performance Networking, WWT (2024) https://www.wwt.com/article/the-basics-of-high-performance-networking

MLCommons (2024) http://www.mlcommons.org

Meta AI, Meta (2024) http://ai.meta.com

AI/ML Datacenter Networking Blueprint, Cisco (2024) https://www.cisco.com/c/en/us/td/docs/dcn/whitepapers/cisco-data-center-networking-blueprint-for-ai-ml-applications.html

AI Networking, Arista (2024) https://www.arista.com/assets/data/pdf/Whitepapers/AI-Network-WP.pdf

Technologies