Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control

jwitsoe · December 14, 2021, 10:10pm

Originally published at: https://developer.nvidia.com/blog/scaling-zero-touch-roce-technology-with-round-trip-time-congestion-control/

The new NVIDIA RTTCC congestion control algorithm for ZTR delivers RoCE performance at scale, without special switch infrastructure configuration.

user135095 · January 16, 2022, 1:34am

Thanks for this write-up. I tried ZTR-RTTCC today between a Windows machine and Linux, with the Linux machine running a ConnectX-6 Dx card. I enabled ZTR-RTTCC with the mlxconfig -d /dev/mst/mt4125_pciconf0 -y s ROCE_CC_LEGACY_DCQCN=0 command, and the rping + nd_rping works with my Windows 10 machine (which is using a ConnectX-4 NIC). While continously running the rping + nd_rping commands, I did not see any traffic over my NIC on the Windows Task Manager Performance tab (the graph showing traffic over the NIC did not move whatsoever).

But then, when I put in the command that was used on the blog: mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set "0x0.0:8=2,0x4.0:4=15" -y
I noticed that when trying the ping again, I saw the traffic on the Task Manager Performance tab of my Windows machine. About 5MB/s for the continuous nd_rping. This used to be not visible at all, as RDMA bypasses the host.

Is this to be expected behavior with ZTR RTTCC?
What is the command to undo the mlxreg command that was provided in the article (just to do more testing)?
What is the best way to check for errors?

AvivB · January 19, 2022, 8:07am

Hi @user135095

ZTR-RTTCC requires both sides (notification and reaction points) support to work. There’s a sync mechanism through RDMA-CM but the mlxreg supplied in this article will force ZTR-RTTCC usage to simplify testing.
The RTT packets are QP1-MADs and handled by ConnectX6-DX+ devices in HW. Since ConnectX4-LX doesn’t support these packets they are forwarded to SW and this is likely what you see in the networking performance tab.
To disable the force mode you can use the following command:
sudo mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set “0x0.0:8=1,0x4.0:4=15” -y
(or just use mlxfwreset to reset the device)
The best way is to see the application performance is as expected. We usually also monitor the counters for dropped/out-of-sequence packets to indicate loss, which should be minimized by the congestion control.

Regards,
Aviv

user135095 · January 22, 2022, 4:56am

Hi @AvivB

Thank you for your reply, much appreciated. I did get it working on another system between a Linux server (ConnectX-6 DX card, 100G) and a Windows PC (ConnectX-4 EN card, specifically MCX416A-BCAT).
A few more questions that aren’t clear yet:

Will the ConnectX-4 EN (specifically, MCX416A-BCAT) work fully with ZTR? I’ve enabled to command in Windows using mlx5cmd and it shows that AdpRetrans is Enabled. I expect it would work then, but I can’t fully confirm it from the Windows side.
Is there a list of of Nvidia NICs that support the QP1-MADs packets?
What command on Windows + Linux do you recommend to test for dropped / out-of-sequence packets to test ZTR properly?

Thank you for your help and explanation.

AvivB · January 23, 2022, 10:41am

Hi again,
ZTR contains several sub-features, ConnectX4-EN indeed supports adaptive retransmission.
Newer devices support more (for example ConnectX5+ devices support tx-window).
ZTR-CC is the new congestion control algorithm and is only supported in ConnectX6-DX+ devices. Part of ZTR-CC is RTT measurements, and only these devices support handling the RTT packets. Other devices will fwd these packets to SW where it will be ignored.

The counters for drops/OOS are exposed in windows perfmon and in linux in /sys/class/infiniband

user135095 · January 25, 2022, 9:08am

Hi Aviv, thanks for the clarification.

Just to confirm - ConnectX-4 does support ZTR, but not ZTR-CC. I have enabled ZTR on my ConnectX-4 NIC, and have enabled ZTR-CC on my ConnectX-6 NIC. Currently, RDMA works just fine between those two cards. This should mean the regular ZTR is working, but not ZTR-CC correct? I followed the WinOF-2 documentation but there isn’t a way to verify that this is ZTR doing it work - although I do believe ZTR must be functioning correctly.

What would be really helpful is if the WinOF-2 and MLNX EN / OFED docs were more clear and went more in-depth about ZTR and ZTR-CC - it’s really exciting technology that I’m glad NVidia is working on.

EDIT: I’m also getting this error when trying to put in this command :

root@system~# mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set “0x0.0:8=1,0x4.0:4=15” -y
-E- Argument: “0x0 is invalid.

taekyounghan · April 14, 2022, 8:42am

Hi @AvivB

Thank you for sharing the wonderful experience.
I am a student in NXC lab at Seoul National University, Korea and my research interest is RDMA-based congestion control.
We are considering purchasing ConnectX-6 DX for test bed configuration. Before that, I’d like to ask you a few questions based on this post. It would be very helpful if you could answer.

Can you explain about the congestion control programability of the ConnectX-6 DX? It was difficult to find this information in documents other than firmware updates. Can you slightly explain about which route and programming language can change and update the congestion control algorithm in ConnectX-6 DX.
To the best of my knowledge, there are a variety of RoCE-based congestion controls(e.g., TIMELY, HPCC, Swift) in addition to ZTR-RTTCC and DCQCN. Some of those algorithms were implemented in forms of NIC+FPGA due to unsufficient NIC programmability. Do you think can these algorithms be implemented on top of ConnectX-6 DX?
I have a question about RTT measurement. As far as I understand, in ZTR-RTTCC, RTT measurement was not done for all data packets, but only for RTT packets. Is it possible to program in ConnectX-6 DX that measures RTT for each data packet on the sender? Also, if you have any interesting experience(e.g., noise) about the precision or accuracy of RTT measurement, I would really appreciate it if you could share it with me.

Best regards,
Taekyoung

sam_nan576 · October 26, 2023, 2:19am

Hi @AvivB ,
Thanks for the article. I am going to use ZTR-RTTCC along with RDMA-CM and ECE (Enhanced Connection Establishment). I have one question on congestion control setting: Does RDMA-CM automatically set the congestion control ZTR-RTTCC to the whole NIC or only for QPs? Thanks again.

empire4th · June 28, 2024, 3:48am

Hello,

Winof2 support RDMA-CM and ECE?

I have two cx6dx 25G cards. one side is windows server 2025, the other side is windows 11 perfessional workstation version. RoCEv2 is working fine. But, I want to try ZTR-RTTCC in windows environment.

Wating for reply.

Thanks.

AvivB · July 15, 2024, 6:42am

Hi, WinOF2 does support ECE.
We recommend using latest LTS (FW version XX.39.XXXX, WinOF version 23.10.26252) or later GA to get ZTRCC enabled by default via ECE mechanisms.

empire4th · July 15, 2024, 9:05am

Thank you!

I’ll try it.

314485689 · November 24, 2024, 5:29pm

hello, is ZTRCC supported in CX7?

AvivB · November 25, 2024, 6:30pm

Hi, ZTRCC is supported on ConnectX7, 8 and Bluefield3 as well

ricky14 · December 6, 2024, 6:22pm

Hi, is there an option to turn off ztr for cx7? Or is this now deprecated?
$ sudo mlxconfig -d mlx5_0 -y s ROCE_CC_LEGACY_DCQCN=0

Device #1:

Device type: ConnectX7
Name: MCX755106AS-HEA_Ax
Description: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled
Device: mlx5_0

Configurations: Next Boot New
-E- The Device doesn’t support ROCE_CC_LEGACY_DCQCN parameter

weicheng · December 9, 2024, 8:54am

Hi, please use below command to disable ZTR_RTTCC per port.
sudo mlxreg -d /dev/mst/mt4129_pciconf0 --reg_id 0x506e --reg_len 0x40 --set “0x0.0:8=2,0x0.16:8=1,0x4.0:4=0” -y
sudo mlxreg -d /dev/mst/mt4129_pciconf0.1 --reg_id 0x506e --reg_len 0x40 --set “0x0.0:8=2,0x0.16:8=1,0x4.0:4=0” -y

Topic		Replies	Views
[PFC+CC doesn't work] Enabling PFC disables DCQCN InfiniBand/VPI Adapter Cards mlxconfig , understanding-rocev2-congestion-man , mlnx_qos	13	691	May 20, 2024
Turbocharging Generative AI Workloads with NVIDIA Spectrum-X Networking Platform Technical Blog	0	352	May 29, 2023
OCI Accelerates HPC, AI, and Database Using RoCE and NVIDIA ConnectX Technical Blog	0	414	July 19, 2023
ConnectX-4 RoCE speed less than expected Ethernet Adapter Cards	7	1615	July 14, 2023
RoCE not working on Win 2016 (ConnectX-3 Pro) WinOF Driver disable , configure , remove	7	1191	March 26, 2018
Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs Technical Blog	17	1898	June 26, 2023
Enabling GPUs in the Container Runtime Ecosystem Technical Blog	12	694	February 23, 2022
TK1 mini-PCIe stuck at 2.5GT/s Jetson TK1	14	6172	November 26, 2014
End-to-End AI for NVIDIA-Based PCs: CUDA and TensorRT Execution Providers in ONNX Runtime Technical Blog	6	1079	October 31, 2024
SDMMC3: Data end bit error/CRC error in L4T 35.1 Jetson Xavier NX kernel , board-design	29	1416	December 7, 2022

Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control

Device #1:

Related topics