Understanding Data Center Quantized Congestion Notification (DCQCN)

Introduction

In modern data centers, especially those dedicated to high-performance computing (HPC) and AI clusters, swift data throughput with minimal delay is paramount. Due to significant CPU overhead, traditional TCP/IP stacks fall short of meeting these demands. RoCEv2 (RDMA over Converged Ethernet version 2) addresses this. RoCEv2 stands out from Infiniband's seamless integration with existing Ethernet infrastructures, offering a cost-effective solution while enhancing flexibility.

RoCEv2 establishes a lossless network environment essential for the robust performance of HPC and AI clusters. This is achieved by incorporating advanced features like Priority Flow Control (PFC) and Explicit Congestion Notification (ECN), which are unified in the DCQCN protocol. PFC manages data flow at the interface level, issuing Pause Frames to halt packet transmission when buffer thresholds are exceeded, preventing data overflow. However, PFC's limitations are evident as persistent Pause Frames can lead to network congestion and packet loss, adversely affecting network and application performance. This is due to head-of-line blocking, which creates bottlenecked 'victim flows' and the 'parking lot problem,' which results in unfair bandwidth distribution.

ECN, on the other hand, serves as a more granular end-to-end congestion management technique, signaling the sender to reduce data transmission rates upon detecting potential congestion. DCQCN leverages ECN's preemptive capabilities to mitigate congestion before PFC activation becomes necessary. The strategic operation of DCQCN hinges on a delicate balance: it must prevent premature PFC activation, thereby allowing ECN to effectively manage network congestion through early detection and sender rate adjustment. This balance is crucial for maintaining an efficient, high-performance network that supports the demanding workloads of HPC and AI applications.

RDMA over TCP is designed to leverage TCP's inherent reliability to ensure data is transferred without loss. Conversely, a lossless network utilizing RoCEv2 depends on flow control mechanisms to maintain its losslessness. The flow control methods discussed herein predominantly pertain to those implemented in RoCEv2 networks.

For Ethernet to be lossless, optimization of three key metrics is essential. Here are the detailed principles of implementation:

Server-side processing latency is diminished by RDMA technology, which also boosts the efficiency of computing and storage while lowering CPU usage. Nonetheless, this technology introduces challenges, notably exacerbating network congestion.
Network congestion leads to two critical issues: it prolongs network processing time and results in service packet loss. This loss, the need for retransmission, and further delayed services can significantly impede computing and storage performance.
Employing ECN and PFC addresses these congestion-related issues by preventing packet loss and the associated delays in retransmission, thus enhancing computing and storage operations. However, excessive PFC pauses can decrease network throughput and potentially trigger PFC deadlocks.

To sum up, the cornerstone of attaining "low latency," "zero packet loss," and "high throughput" in RDMA networks lies in the adept and careful application of flow control mechanisms. The reader needs to be aware that setting up ECN and PFC to support the lowest Job Completion Times will require careful monitoring of all the queues on all of the switches to be able to find and fix hot spots in the network. A few poorly performing ports and not repairing the hot spots can lead to PFC storms and shutting down all traffic. Monitoring and managing the queues on a GPU run basis is critical until the various jobs can be baselined with the proper ECN and PFC configurations.

ECN queuing behavior

We will visually represent 2 GPU nodes or hosts with 8 GPUs connected via a single High-Performance Network (HPN) Switch. On the left side, we have NIC 1,2,3 mapped to GPU 1,2,3. On the right side, we have NIC 9 mapped to GPU 9 on our GPU node. Our GPU cluster consists of 4 GPUS partitioned for a minor job, and they need to be connected via the HPN.

During synchronization events, GPU1,2,3 on node 1 want all their data synced with GPU 9 on Node 2 via the Global Send. As described in previous articles, these are long-lived and can saturate an entire 400Gbps link. It is important to note that it is a requirement that both ECN and PFC are configured not only on all the network devices in the path but also on all the NICs.

WRED Min not exceeded

We have NIC 1,2,3, which sends long-lived elephant flows to GPU 9 via NIC 9.
Since NICs 1,2,3 participate as injectors or Reaction Points (RP), they mark the ECN field with a 0x10, indicating they participate in ECN.
Note that the queue depth shown is before WRED min, so there are no ECNs.
Since there is no congestion on Eth 1/9, the switch does not change the ECN field and passes the flows to GPU9 via NIC 9.
Packets are sent to NIC 9 with the ECN bit still set to 0x10.

Priority queue above WRED Min threshold

Traffic congestion happens on the switch when the bandwidth from the GPUs exceeds the buffer of the switch port going to Node 2, leading to an oversubscribed port Eth 1/9. As a result, buffer usage on the switch begins to accumulate. Once the buffer hits the WRED minimum threshold, the switch signals congestion by marking the ECN field of random packets with a 0x11 value as it rises from 0 to 100%, showing there's a bottleneck in the data flow.

The synchronization flows from GPU 1,2,3 have now caused the buffer on Eth 1/9 to exceed the WRED Min threshold.
The QOS configuration on the switch will now remark several but not all packets going to GPU 9 from GPU 1,2,3 with ECN =0x11 (Congestion experienced).
Remarked packets are then sent to NIC 9. In this example, only packets from NIC1 are marked.
NIC 9 receives and processes the packets with ECN=0x11. Since it only marks some packets, CNPs will only be sent to the sender of marked packets. Consequently, the sender NIC will get numerous Congestion Notification Packets (CNP), prompting it to significantly lower the data transmission rate to the destination, as the NIC algorithm dictates. This action helps alleviate congestion, allowing the buffer to begin emptying. Following this, the traffic throughput should incrementally recover until further congestion indications emerge. Should congestion persist and buffer usage exceeds the Weighted Random Early Detection (WRED) maximum limit, the switch will label all packets with congestion notification bits.
NIC 9 is now the Notification Point (NP) and sends Congestion Notification packets (CNPs) back to the Injector or RP (in this case, NIC 1).
CNPs are sent to NICs 1.
NICs 1 receive the CNP packets, and the NIC will slow the packets down.

Queue above WRED Max

The synchronization flows from GPU 1,2,3 have now caused the buffer on Eth 1/9 going to GPU 9 to exceed the WRED max threshold.
The QOS configuration on the switch will now remark all packets going to GPU 9 from GPU 1,2,3 with ECN =0x11 (Congestion experienced)
Remarked packets are then sent to NIC 9.
NIC 9 receives and processes the packets with ECN=0x11.
NIC 9 is now the Notification Point (NP) and sends Congestion Notification packets (CNPs) back to all of the Injectors or RP (NICs 1,2,3).
CNPs are sent to NICs 1,2,3
NICs 1-3 receive the CNP packets, and the NICs will slow the packets down.

Queue return to below WRED Min threshold.

The NICs 1,2,3 in our GPU node have reduced their send rate, the queue has gone below the WRED Min point, and the switch stops marking packets. NIC 9 no longer sends CNP packets to the RP NICs 1,2,3, and they are starting to ramp up to full speed again.

The Priority queue buffer drops below WRED Min
The switch stops rewriting the ECN field to 0x11.
The packets leave the switch unchanged.

It's important to note that this process is dynamic, automatically slowing the traffic rate to keep the priority queues below WRED max. Careful tuning during training runs of the WRED Max and WRED Min thresholds is essential to prevent hot spots in the switch port priority queues. Monitoring these priority queues and adjusting the thresholds is critical to keeping the Job Completion Times(JCTs) low.

PFC Queueing behavior

PFC works differently than ECN as it does not re-mark packets when exceeding the xOFF threshold. However, like ECN, PFC sets up max and min thresholds in the same priority queue for marked RMDA traffic.

xOFF threshold not exceeded

The priority buffer for the xOFF threshold is at a higher level, which marks the buffer utilization point that triggers the creation and dispatch of a PFC frame back to the traffic's origin.
Once the buffer begins to empty and dips beneath the xON threshold, pause frames cease to be issued to the senders. This is when the system believes that congestion is over, and they can start transmitting again.
The buffer headroom is the space between sending pause frames and dropping traffic out of the priority queues. PFC needs to be configured so we never reach the drop point in the queue.

xOFF exceeded

xOFF queue threshold exceeded on Eth 1/9
In PFC, the switch becomes the notification point (NP)
PFC on switch sends PFC pause frames to all downstream senders (NICs or switches).
NICS 1,2,3 receives PFC and pauses sending frames, allowing queues to empty

Threshold below xON level

The pause frames stop once the priority queue dips below the xON threshold.
The switch stops sending PFC pause frames to downstream NICs and switches.

Combining ECN and PFC (Data Center Quantized Congestion Notification DCQCN)

At the SIGCOMM 2015 event, Microsoft introduced DCQCN, marking a significant advancement in congestion control research. Before this, RDMA devices depended solely on the PFC back pressure mechanism for regulating point-to-point speeds, lacking network card support for comprehensive flow control. DCQCN, drawing from QCN and DCTCP technologies, established an end-to-end congestion control scheme for RDMA networks. Its foundation lies in ECN marking, ensuring smooth integration with current Ethernet systems. DCQCN's strategy involves three key stages: congestion signaling by the forwarding switch (CP) via ECN marks, rate adjustment by the sender (RP), and feedback from the receiver (NP) through CNP protocol messages. It boasts a rapid recovery to baseline speeds in five update cycles and an additional phase for swift acceleration, enabling quick adaptation to optimal rates, even from slower speeds. DCQCN's adjustable parameters facilitate robust end-to-end congestion management, maintaining high throughput and minimal latency. Individually, ECN and PFC effectively handle congestion; however, their combined efforts enhance efficiency. ECN proactively addresses congestion, with PFC safeguarding against traffic loss when buffer usage peaks. Together, they form the backbone of the Data Center Quantized Congestion Notification system, optimizing congestion management and fostering the development of lossless Ethernet networks.

A look at the priority queue thresholds for DCQCN

The following diagram combines ECN (WRED) and PFC max and min thresholds to maintain the fastest throughput through the switch while providing a lossless fabric. Careful tuning is critical for all queues to minimize Job Completion Times. Some traffic may need to be marked with ECNs earlier to maintain the fastest JCTs. Testing and baselining will need to be done to optimize the fabric and prevent hotspots and drop traffic.

The WRED min threshold is where we start re-marking ECN to 0x11 for some traffic. This triggers CNP packets to slow down when returned to the sender NIC.
The WRED max threshold is where we remark all packets to ECN=0x11 in the priority queue. This triggers CNP packets to slow down when returned to the sender NIC.
The xON threshold is where the queue depth must deplete below once the xO threshold is exceeded to stop PFC pause packets.
The xOFF threshold is where the switches send PFC pause frames to their upstream neighbors (switches or NICs). It's critical to note that the switches send the PFC frames, whereas the NICs send the CNP frames.
The buffer headroom is the queue space left between crossing the xOFF threshold, triggering PFCs, and filling the buffer, at which point we drop packets.
Drop is where the buffer is full, and we drop RMDA packets out of the priority queue.

A practical example of DCQCN in action

GPU Synchronization traffic starting

Our example shows that the priority queue buffers on Leafs 1-2, 3-4, and 5-6, and the spines are all below the first threshold in orange (WRED Min). GPU Synchronization traffic flows ramp up from GPU nodes 1-4 and nodes 5-8, flowing to node 9 only because users have configured the GPU cluster that way. Traffic will flow normally; however, even though the network is designed as nonblocking, we still need to rely on DCQCN to provide a lossless nature as the buffers fill.

Priority queue buffer threshold diagram for reference
Spine buffer below WRED Min (Orange Threshold)
Leaf 1-2 buffers below WRED Min
Leaf 3-4 buffers below WRED Min
Leaf 5-6 buffers below WRED Min

GPU Synchronization traffic exceeds the WRED Min threshold on Leaf 5

Our synchronization traffic has ramped up, and we are filling up the buffers in our switches. GPU nodes 1-4 and 5-8 only send traffic to GPU node 9. Remember, the switches will rewrite the ECN bits to 0x11 and forward them, and then the receiving NICs (Congestion point) are responsible for sending CNPs to the sending NICs(Reaction Point) to slow down.

Priority queue buffer threshold diagram for reference
Priority Egress buffer queue for connection to Host 9 exceeds the WRED min threshold.
WRED randomly marks a percentage of the packets using the queue depth of 0% (WRED Min) to 100 % (WRED Max). Due to the nature of the WRED algorithms, not all of the sender's packets are marked 0x11 (Congestion Experienced) during the queue rise from 0-100%
Because some of the received packets are marked 0x11, the NIC on Host 9 sends Congestion Notification Packets (CNP) to the senders with the ECN bits modified to 0x11. (Not all sending packets may be marked with ECN 0x11
CNPs are forwarded to the spine.
CNPs are forwarded from node 9 NIC to nodes 1-4 that had the ECN bit modified by Leaf 5
CNPs are forwarded from node 9 NIC to nodes 5-8 that had the ECN bit modified by Leaf 5

Hosts that have received CNPs will slow down their transmission speeds to help relieve the buffer filling on Leaf 5. It is important to remember that ECNs marked at 0x11 trigger CNPs, and the NIC is responsible for sending the CNPs to the sending hosts.

GPU Synchronization traffic exceeds the WRED Max threshold on Leaf 5.

Priority queue buffer threshold diagram for reference.
Priority Egress buffer queue for leaf 5 exceeds the WRED max threshold.
WRED marks every packet 0x11 (Congestion Experienced)
Host 9, seeing that all the received packets are marked 0x11, the NIC on node 9 sends Congestion Notification Packets (CNP) to the senders with the ECN bits modified to 0x11. ( All sending node packets are ECN-marked due to the WRED Max threshold being passed)
CNPs are forwarded to the spine.
CNPs are forwarded to all Hosts 1-4 that had the ECN bit modified
CNPs are forwarded to all Hosts 5-8 that had the ECN bit modified

All sending hosts receive CNPs and will slow down their transmission speeds to help relieve the buffer filling on Leaf 5.

GPU Synchronization traffic exceeds the xOFF threshold on Leaf 5.

Priority queue buffer threshold diagram for reference.
Priority Egress buffer queue for connection to Host 9 exceeds the xOFF threshold (PFCs generated)
WRED still marks every packet and is marked 0x11 (Congestion Experienced)
Host 9, seeing the received packets are still marked 0x11, the NIC on Host 9 sends Congestion Notification Packets (CNP) to the senders with the ECN bits modified to 0x11. ( All sending hosts packets are ECN-marked due to the WRED Max threshold being passed)
Priority Flow Control (PFC) packets are sent from Leaf 5 to its upstream sending neighbor (Spine 1).
Spine 1 receives the PFC packets and pauses sending traffic to Leaf 5 to prevent the buffer from overrunning. The Buffer Headroom is used to avoid dropping traffic. Note that this has solved the Leaf 5 buffer issue, and it is starting to drain; however, now the buffers on Spine 1 begin to fill with traffic destined for node 9 due to paused traffic.
CNPs stop due to Spine 1 pausing traffic and CNP timers expiring. Traffic rate increases from Host 1-4, further exacerbating the buffer issues on Spine 1.
CNPs stop due to Spine 1 pausing traffic and CNP timers expiring. Traffic rate increases from Host 5-8, further exacerbating the buffer issues on Spine 1.

It is important to note that PFCs are sent by the switches, not the receiving NIC, like CNPs.

GPU Synchronization traffic exceeds the xOFF threshold on Spine 1 and Leaf 5.

Priority queue buffer threshold diagram for reference.
Leaf 5 buffer still exceeds the xOFF threshold and sends PFCs to Spine 1 to pause frames.
Spine 1's egress buffer has now exceeded the xOFF threshold, sending PFC frames to downstream neighbors Leaf 1 and 3.
PFCs are sent to Leaf 1 to pause traffic.
PFCs are sent to Leaf 3 to pause traffic.
CNPs stop due to Spine 1 pausing traffic and CNP timers expiring. The traffic rate increases from Host 5-8, but Leaf 1 is paused, so buffers start to fill on Leaf 1.
CNPs stop due to Spine 1 pausing traffic and CNP timers expiring. The traffic rate increases from Host 5-8, but Leaf 3 is paused, so buffers start to fill on Leaf 3.

GPU Synchronization traffic exceeds the xOFF threshold on Leaf 1 and Leaf 3

Priority queue buffer threshold diagram for reference.
Leaf 5 buffer exceeds the xOFF threshold and sends PFCs to Spine 1 to pause traffic.
Spine 1's egress buffer exceeds the xOFF threshold, sending PFC frames to downstream neighbors Leafs 1 and 3.
Leaf 1 and 3 buffer exceeds the xOFF threshold and sends PFCs to Hosts to pause traffic.
PFCs are sent to Hosts 1-4 to pause traffic.
PFCs are sent to hosts 5-8 to pause traffic.
Traffic is paused on Host 1-4.
Traffic is paused on Host 5-8.

GPU Synchronization traffic recovery

Priority queue buffer threshold diagram for reference.
The leaf 5 buffer empties first, passing the xON threshold, and stops the pause frames to Spine 1. The spine can now start sending to Leaf 5.
Spine 1's buffer empties, passing the xON threshold, and stops the pause frames to Leafs 1 and 3. Leafs 1 and 3 can now start sending traffic again.
Leaf 1 buffer empties, passing the xON threshold, and stops the pause frames for Hosts 1-4. Traffic is no longer paused, and it starts its rate increase algorithm.
The Leaf 3 buffer empties, passing the xON threshold, and stops the pause frames for Hosts 5-8. Traffic is no longer paused, and it starts its rate increase algorithm.

It's important to note that during the recovery, as queues empty below the xON threshold, traffic received from the paused buffers will be marked with 0X11 until the queues empty below WRED Min.

Conclusion

ECN and PFC are required to allow for a lossless fabric required for GPU-to-GPU communication during AI/ML training runs. The information presented here is essential to understand, as tuning the buffer thresholds may be necessary to alleviate hot spots and lower JCTs. Several OEMs give baseline WRED and PFC thresholds for their switches for AI/NL workloads, which differ depending on the switch ASIC used. Please work with your OEM or WWT to get the latest baseline configuration for queue thresholds and realize that the fabric will require careful tuning of the buffer thresholds to eliminate hot spots. ECN (WRED) tuning should be the first threshold monitored and tuned, and we understand that PFC is more of a fail-safe to prevent dropped packets. This is a simple example of 3 GPU nodes; managing the queues becomes critical when there are dozens or even hundreds of GPU nodes.