Introduction to Arista's AI/ML GPU Networking Solution
In this article
Their significant data and computational intensity characterize AI workloads. A quintessential AI training regimen encompasses billions of parameters and a voluminous sparse matrix operation disseminated across an array of processors – encompassing CPUs, GPUs or TPUs. These processors engage in rigorous computation before partaking in data interchange with their counterparts. After the interchange, peer data is amalgamated with extant local data, precipitating another iteration of processing. Within this iterative compute-exchange amalgamation paradigm, an estimated 20-50 percent of the temporal span of the job is allocated to inter-network communication, whereby bottlenecks exert a pronounced influence on the temporal metric of Job Completion Time (JCT).
To achieve this goal, data center networks must provide high bandwidth, low latency and scalable connectivity for GPU servers. Traditional network architectures based on hierarchical switching and oversubscription will not be able to meet these requirements, especially for large-scale AI training workloads that involve massive data transfers and synchronization among GPU servers. Therefore, investments in new network designs and technologies are needed to enable efficient and cost-effective AI training in data centers.
We will examine how the world's biggest networking companies design versions of AI/ML Validated designs.
As GPUs get more powerful and AI/ML training becomes increasingly critical for the business, there will be a quick ramp-up of GPU intra-node(single node) PCIe speeds. Today, a top-of-the-line NVIDIA GPU can easily burst during synchronization to 400Gbps on the NIC. Once PCIe 6 and 7 become mainstream, we will see GPUs push 800 Gbps, so our GPU node networks will significantly differ from traditional Ethernet, as most network engineers know. Look for 800Gbps and 1.6 Tbps fabrics in the next two years, along with other innovations to provide the lowest latency to reduce Job Completion Times (JCT).
Please note we purposely have kept specific OEM model numbers out of the discussion. Due to the breakneck pace of OEM 400 and 800 Gbps solutions for these designs, some switches may be obsolete in 12 months for the fastest GPU-to-GPU communication. All OEM designs use spine/leaf topologies, support RoCE and advanced queueing combining ECN and PFCs to create a DCQCN-based non-blocking lossless fabric. Please get in touch with WWT sales for help planning your validated OEM solution so we can guide you through the latest OEM solution we have validated in our new state-of-the-art WWT AI Labs!
Arista AI/ML networks
Modern AI applications need a high-bandwidth, lossless, low-latency, scalable, multi-tenant network that can interconnect hundreds and thousands of GPUs at 100Gbps, 400Gbps, 800Gbps, and beyond. With support for Data Center Quantized Congestion Notification (DCQCN), Priority Quality of Service (QoS), and adjustable buffer allocation schemes, Aristas EOS Network Operating System provides all the necessary tools to achieve a premium lossless, high bandwidth, low latency network.
Through the support of Data Center Quantized Congestion Notification (DCQCN), EOS provides an end-to-end congestion control scheme using a combination of Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) to support RDMA over Ethernet. Without visibility into network traffic and buffer utilization, configuring appropriate PFC and ECN thresholds can be challenging without visibility into network traffic and buffer utilizationConfiguring. Arista EOS®(Extensible Operating System) offers in-depth visibility into workload traffic patterns using the AI Analyzer and Latency Analyzer features.
Arista's AI Analyzer
Arista's AI Analyzer monitors interface traffic counters at intervals of microseconds, while the Latency Analyzer tracks interface congestion and queuing latency with real-time reporting. AI Analyzer and Latency Analyzer help correlate the application's performance with network utilization and congestion events, allowing PFC and ECN values to be optimally configured to best suit the application's requirements. EOS takes real-time traffic utilization of the network links and considers and balances flow uniformly across them, avoiding network hotspots. EOS also offers source-interface-based hashing to prevent traffic deceleration in non-oversubscribed networks. Traffic flows arriving on host interfaces can be directly hashed to designated uplinks, avoiding traffic fan-ins and collisions.
Arista AI Agent
The Arista AI Agent is a pivotal system component that facilitates seamless communication and configuration coordination between the network and host to enhance AI clusters. It is an EOS-based agent from Arista that extends the capabilities of EOS on Arista switches to NICs and servers, providing centralized control and visibility for an AI Data Center. The agent, either situated on an NVIDIA BlueField-3 SuperNIC or operating on the server, enables the network switch's EOS to manage, track, and troubleshoot server network issues, ensuring consistent network setup and quality of service. The synergy between Arista's top-tier networking solutions and NVIDIA's computing platforms, including SuperNICs, which fosters a harmonized AI Data Center environment. The expansion of EOS to host machines via remote AI agents is poised to address the pressing challenge of scaling AI clusters, providing a singular management point for overseeing AI system health and performance. Consequently, AI clusters are more efficiently managed as a unified entity. Arista's goal is to boost communication efficiency across the network and GPU structures, thereby accelerating job completion through synchronized orchestration and monitoring of NVIDIA's computing resources and Arista's networking infrastructure.
The latest demonstration of this technology showcases the Arista EOS-based remote AI agent's ability to manage an interdependent AI cluster as a singular, integrated solution. With EOS operational on the network, it can now reach out to servers or SuperNICs through remote AI agents, allowing for immediate detection and documentation of performance issues or failures, facilitating quick isolation, and minimizing impact. By extending EOS to SuperNICs and servers, the remote AI agent enhances the coordinated optimization of quality of service throughout the AI Data Center, aiming to decrease the time it takes to complete jobs. Customer trials are expected in 2H 2024.
Arista's Etherlink
Arista's Etherlink collection offers high-performance, standards-compliant Ethernet solutions enhanced for AI networking. It boasts advanced load balancing and congestion management tools, such as the RDMA Aware QoS, which ensures consistent packet delivery for RoCE-compatible NICs. Additionally, the Etherlink suite presents the AI Analyzer, which streamlines cluster deployment, enhances operational consistency and grants comprehensive insights through AVA machine learning integration. The Etherlink functionalities extend over a wide array of 800G systems and line cards built on the Arista EOSⓇ platform and are designed to be forward-compatible with UEC technologies.
Platforms
The bandwidth and scale requirements for AI networks will vary from customer to customer and application to application. One size does not fit all. By leveraging industry-leading Ethernet chips such as Tomahawk 5 and Jericho3-AI, Arista provides the ideal accelerator-agnostic solution for AI clusters of any shape or size, outperforming proprietary technologies and providing flexible options for fixed, modular, and distributed switching platforms.
Arista AI Leafs
The Arista Leafs deliver high-density 200G and 400G server-facing ports optimized for Hyperscale Cloud, Artificial Intelligence, Machine Learning environments, and high network radix. The interconnecting spines are a modular system built on a 25.6Tbps high-capacity packet processor for data-intensive workloads requiring consistently low latency. They offer a flexible choice of industry-standard interfaces, significant power consumption, and system density improvements.
Arista AI Spine
The Arista 7800R3 Spine Series of purpose-built modular switches uses Broadcom Jericho2C+ chips to deliver the industry's highest performance. It scales to 460 Tbps of system throughput to meet the needs of the largest-scale data centers and high-performance computing networks. The Arista Spine Series delivers a non-blocking switching capacity that enables dramatically faster and simpler network designs for data centers while lowering capital and operational expenses.
The spines have some key characteristics that make it an ideal platform for AI Networking: Virtual Output Queuing (VoQ): a distributed scheduling mechanism is used within the switch to ensure fairness for traffic flows contending for access to a congested output port. A credit request/grant loop is utilized, and packets are queued in physical buffers on ingress packet processors within VoQs until the egress packet scheduler issues a credit grant for a given input packet, which improves on bottlenecks and hotspots.
The Arista 7800R3 Series represents the third generation of Arista's R-Series universal spine switches, designed for the next wave of large-scale virtualized and cloud networks. These modular switches boast high-density 400G and 800G ports, combined with internet-scale table sizes and extensive Layer 2 and Layer 3 features. They are engineered to support a wide array of advanced network monitoring and virtualization capabilities, ensuring robust investment protection and a seamless migration path for network infrastructure.
Aristas AI/ML capable switches use a Cell-Based Fabric: A cell-based fabric takes every packet and breaks it apart into evenly sized cells before evenly "spraying" across all fabric modules. This spraying action has several positive attributes, making for a very efficient internal switching fabric with an even flow balance to each forwarding engine. Cell-based fabrics are considered to be 100% efficient, irrespective of the traffic pattern.
The cell fabric's spraying behavior equips it to handle varying speeds easily. It operates independently of front panel connection speeds, allowing seamless integration of 200G, 400G, and 800G without issue. Additionally, its design prevents the 'flow collision' issues typical in Ethernet fabrics. Since the traffic is distributed across all available paths, it eliminates network congestion, making it particularly adept for managing the substantial 'elephant flow' traffic in AI/ML applications.
Density: The Arista Spine Series switches using the 7800R3 Broadcom Jericho2C+ line cards are available in a choice of 4, 8, 12, and 16-slot systems that support a rich range of line cards providing high-density 100G and 400G with a selection of forwarding table scales. At a considerable system level, the 16-slot Arista 7816R3 scales to 460 Tbps and enables 576 x 400G in a 32 RU front-to-rear power-efficient form factor, providing industry-leading performance and density without compromising on features and functionality.
Design based on AI app size
Spine/Leaf IP fabrics have proven to provide consistent performance that scales to support extreme 'East-West' traffic patterns. Customers have built small to large data center cloud networks using IP/Ethernet to support modern application and network requirements. Historically, AI/ML applications could coexist in the IP fabric with other applications. However, we recommend designing a dedicated network for these applications due to the significant growth in AI/ML applications and the complexity of adopting special-purpose GPUs, DPUs, and TPUs. It will allow operators to tune the network to handle better unique traffic patterns that come with modern AI/ML workloads.
Small AI applications
A pair of Arista leaf switches with 64 x 400G or 128 x 200G ports can effectively interconnect GPUs across a few racks. In this design, each GPU can communicate with all other GPUs in a non-blocking-blocking configuration at a predictably low latency. This option requires minimal tuning, simplifying operations and management. Also, growth is supported by adding Arista Spines and more leafs.
Moderate AI applications
A pair of Arista chassis-based switches supporting 576 x 400G ports can act as a simple, out-of-the-box AI interconnect to support moderate-sized AI applications. Since this design provides a consistent, single hop between the end hosts, it drives down the latency and power requirements. With their cell-based, non-blocking VOQ architecture, the Arista spines enable the non-blocking of an extensive, lossless network without any configuration or tuning. A single-hop solution ensures ECN and PFC configurations are required only on the host-facing ports, allowing GPUs to send and receive line rate data at all times.
Large AI applications
For large-scale AI applications requiring tens of thousands of GPUs to be connected in data centers, EthernetEthernet becomes the most viable option. Arista's Universal Leaf and Spine design offers the most simple, flexible, and scalable architecture to support AI workloads at the data center scale. This design allows more than 18,000 x 400G end hosts to be interconnected while keeping the latency predictive and low. In such a design, Arista EOS' intelligent load-balancing capabilities that consider real-time network traffic utilization to distribute traffic flows uniformly can be leveraged to avoid flow collisions. Arista EOS' advanced telemetry options, like AI Analyzer and Latency Analyzer, make it simple for network operators to determine optimal PFC and ECN configuration thresholds to allow GPUs to exchange line rate throughput across the network while preventing packet drops.
The Universal Leaf and Spine design provides an ideal solution for AI models requiring a few hundred GPUs. It offers the flexibility to scale out to tens of thousands of GPUs in the future with consistent performance.
Storage for AI networks
The amount of data used by AI has exponentially grown as businesses attempt to improve the accuracy of their models. In the training phase, large datasets are needed to improve the accuracy of the AI model. As such, organizations are forced to manage massive data collections, starting with Petabytes. This puts much strain on the network, which handles data transfer between GPUs and the storage nodes. A dedicated storage network is recommended to avoid expensive and in-demand GPUs from idling for data due to network bottlenecks. Most GPUs enable a direct data path using RMDA between their memory and remote storage using NVMe-oFmost GPUs to allow efficient data movement.
Using Arista's Cloud Vision Portal (CVP) and AI analyzer to build and monitor your AI
Arista Cloud Vision consists of 3 components;
- CVP – Cloud Vision Portal is GUI-driven access to the entire network's provisioning, state, and orchestration.
- Cloud Vision Builder is an add-on under CVP that automatically creates our entire EVPN configuration and applies it to the fabric. We can also add and remove VLANs, VRFs, and other day-two operations.
- Cloud Vision IPAM is an add-on to CVP for managing IP addressing used in the underlay, SVIs, addresses, and other P-P links.
- CVX – Cloud Vision Exchange is an integration point for the underlay and network services
- CloudVision WiFi – WiFI user application-driven AI/ML platform for provisioning Automation and Management of WiFi devices
Arista's CVP provides an automated way to provision changes to an Arista device, such as for EVPN, routing/switching, IPSEC, and MPLS. At the Core of this automation is the use of configlets. Configlets can be as simple as the primary management commands to make a device reachable or as complicated as an entire VXLAN EVPN fabric.
Configlets can be applied or removed from a device, and new configlets added have additional features and functionality. Once a configlet is added or removed from a device in the Network provisioning window, CVP compares the device's running configuration to the desired configuration (the combination of 1 or many configlets applied to a device. Suppose the desired configuration of the combination of configlets applied differs from the running configuration. In that case, Suppose a task gets created, turns orange, and has a "T" on the device, as seen below. Then, a change control is made to apply the desired configuration.
Arista AI Analyzer tool
Arista's new AI Analyzer tool allows users to examine switch flows to determine how well the ECMP load balancing works. Then, the user can modify the queues and controls for ECMP.
LANZ Data Collection
Arista Latency Analyzer (LANZ) is an integrated feature of EOS. LANZ provides precise real-time monitoring of micro-burst and congestion events before they impact applications, with the ability to identify the sources and capture affected traffic for analysis. Advanced analytics have features like buffer monitoring with configurable thresholds, in-band path, latency monitoring, event-driven trace packets, and granular time stamping.
With LANZ, the network operations teams and administrators will have more visibility than ever to determine if 'microbursts' are occurring. With sub-millisecond reporting intervals, congestion can be detected, and application-layer messages can be sent faster than some products can forward a packet.
To ensure maximum versatility in data consumption, LANZ presents data in various open standard formats for both real-time and historical usage:
- CLI Output: Instantaneous and continuous congestion data is accessible via the Command Line Interface (CLI) switches, facilitating quick and easy analysis by network administrators.
- Syslog Messaging: LANZ generates Syslog messages upon exceeding queue thresholds, enabling automatic alerts for congestion events.
- CSV Format: Congestion data is stored in CSV format for historical trending and third-party analysis, allowing storage on flash, USB, SSD, or external file systems (FTP, TFTP, NFS).
- Congestion Data Stream: Real-time visibility of congestion events is provided for proactive monitoring and potential capacity management. Congestion events can be streamed in real-time to external third-party monitoring tools using the industry-standard Google Protocol Buffers (GPB) format.
Conclusion
Arista's IP/Ethernet switches are at the forefront of powering AI/ML workloads, offering an optimal solution for GPU and Storage interconnects. The surge in AI applications necessitates a standardized transport like Ethernet to forge an energy-efficient interconnect while simplifying the complexities of scaling and administration inherent in traditional methods. Implementing an IP/Ethernet framework with Arista's high-caliber switches elevates application throughput and streamlines network management. Integrating the 7800R3 AI spine & 7060 AI leaf switches and EOS innovations presents a superior option for contemporary AI applications.