Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control

Originally published at: Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control | NVIDIA Developer Blog

The new NVIDIA RTTCC congestion control algorithm for ZTR delivers RoCE performance at scale, without special switch infrastructure configuration.

Thanks for this write-up. I tried ZTR-RTTCC today between a Windows machine and Linux, with the Linux machine running a ConnectX-6 Dx card. I enabled ZTR-RTTCC with the mlxconfig -d /dev/mst/mt4125_pciconf0 -y s ROCE_CC_LEGACY_DCQCN=0 command, and the rping + nd_rping works with my Windows 10 machine (which is using a ConnectX-4 NIC). While continously running the rping + nd_rping commands, I did not see any traffic over my NIC on the Windows Task Manager Performance tab (the graph showing traffic over the NIC did not move whatsoever).

But then, when I put in the command that was used on the blog: mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set "0x0.0:8=2,0x4.0:4=15" -y
I noticed that when trying the ping again, I saw the traffic on the Task Manager Performance tab of my Windows machine. About 5MB/s for the continuous nd_rping. This used to be not visible at all, as RDMA bypasses the host.

  1. Is this to be expected behavior with ZTR RTTCC?
  2. What is the command to undo the mlxreg command that was provided in the article (just to do more testing)?
  3. What is the best way to check for errors?

Hi @user135095

  1. ZTR-RTTCC requires both sides (notification and reaction points) support to work. There’s a sync mechanism through RDMA-CM but the mlxreg supplied in this article will force ZTR-RTTCC usage to simplify testing.
    The RTT packets are QP1-MADs and handled by ConnectX6-DX+ devices in HW. Since ConnectX4-LX doesn’t support these packets they are forwarded to SW and this is likely what you see in the networking performance tab.
  2. To disable the force mode you can use the following command:
    sudo mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set “0x0.0:8=1,0x4.0:4=15” -y
    (or just use mlxfwreset to reset the device)
  3. The best way is to see the application performance is as expected. We usually also monitor the counters for dropped/out-of-sequence packets to indicate loss, which should be minimized by the congestion control.

Regards,
Aviv

1 Like

Hi @AvivB

Thank you for your reply, much appreciated. I did get it working on another system between a Linux server (ConnectX-6 DX card, 100G) and a Windows PC (ConnectX-4 EN card, specifically MCX416A-BCAT).
A few more questions that aren’t clear yet:

  1. Will the ConnectX-4 EN (specifically, MCX416A-BCAT) work fully with ZTR? I’ve enabled to command in Windows using mlx5cmd and it shows that AdpRetrans is Enabled. I expect it would work then, but I can’t fully confirm it from the Windows side.
  2. Is there a list of of Nvidia NICs that support the QP1-MADs packets?
  3. What command on Windows + Linux do you recommend to test for dropped / out-of-sequence packets to test ZTR properly?

Thank you for your help and explanation.

Hi again,
ZTR contains several sub-features, ConnectX4-EN indeed supports adaptive retransmission.
Newer devices support more (for example ConnectX5+ devices support tx-window).
ZTR-CC is the new congestion control algorithm and is only supported in ConnectX6-DX+ devices. Part of ZTR-CC is RTT measurements, and only these devices support handling the RTT packets. Other devices will fwd these packets to SW where it will be ignored.

The counters for drops/OOS are exposed in windows perfmon and in linux in /sys/class/infiniband

1 Like

Hi Aviv, thanks for the clarification.

Just to confirm - ConnectX-4 does support ZTR, but not ZTR-CC. I have enabled ZTR on my ConnectX-4 NIC, and have enabled ZTR-CC on my ConnectX-6 NIC. Currently, RDMA works just fine between those two cards. This should mean the regular ZTR is working, but not ZTR-CC correct? I followed the WinOF-2 documentation but there isn’t a way to verify that this is ZTR doing it work - although I do believe ZTR must be functioning correctly.

What would be really helpful is if the WinOF-2 and MLNX EN / OFED docs were more clear and went more in-depth about ZTR and ZTR-CC - it’s really exciting technology that I’m glad NVidia is working on.

EDIT: I’m also getting this error when trying to put in this command :

root@system~# mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len 0x40 --set “0x0.0:8=1,0x4.0:4=15” -y
-E- Argument: “0x0 is invalid.

Hi @AvivB

Thank you for sharing the wonderful experience.
I am a student in NXC lab at Seoul National University, Korea and my research interest is RDMA-based congestion control.
We are considering purchasing ConnectX-6 DX for test bed configuration. Before that, I’d like to ask you a few questions based on this post. It would be very helpful if you could answer.

  1. Can you explain about the congestion control programability of the ConnectX-6 DX? It was difficult to find this information in documents other than firmware updates. Can you slightly explain about which route and programming language can change and update the congestion control algorithm in ConnectX-6 DX.

  2. To the best of my knowledge, there are a variety of RoCE-based congestion controls(e.g., TIMELY, HPCC, Swift) in addition to ZTR-RTTCC and DCQCN. Some of those algorithms were implemented in forms of NIC+FPGA due to unsufficient NIC programmability. Do you think can these algorithms be implemented on top of ConnectX-6 DX?

  3. I have a question about RTT measurement. As far as I understand, in ZTR-RTTCC, RTT measurement was not done for all data packets, but only for RTT packets. Is it possible to program in ConnectX-6 DX that measures RTT for each data packet on the sender? Also, if you have any interesting experience(e.g., noise) about the precision or accuracy of RTT measurement, I would really appreciate it if you could share it with me.

Best regards,
Taekyoung