Error checking lossy RoCE acceleration state

Hello,

I am having some trouble running jobs in my RoCE mini cluster and was suggested to enable lossy RoCE acceleration as described at https://community.mellanox.com/s/article/How-to-Enable-Disable-Lossy-RoCE-Accelerations. When I try to follow the instructions, I can see the registers, which include ROCE_ACCL, but everything else goes downhill from there. Either of the 4 commands:

sudo mlxreg -d 5e:00.0 --reg_name ROCE_ACCL --get

[sudo mlxreg -d 5e:00.0 --reg_name ROCE_ACCL --get]

[sudo mlxreg -d /dev/mst/mt4121_pciconf0 --reg_name ROCE_ACCL --get

[sudo mlxreg -d /dev/mst/mt4121_pciconf0.1 --reg_name ROCE_ACCL --get]

returns the error:

Sending access register…

-E- Failed to send access register: ME_ICMD_OPERATIONAL_ERROR

I cannot do much besides listing registers with mlxreg and got other error messages. However, I think that the first step is to understand what ME_ICMD_OPERATIONAL_ERROR means or why it’s happening.

Thanks.

Hello Arturo,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the information provided, this error message you are getting when the driver is not loaded/ or properly loaded.

See below example output based on a node running FreeBSD w/ ConnectX-6 configured in Ethernet mode.

With driver loaded:

mlxreg -d mst status | awk '{print$1}' | tail -n +3 --reg_name ROCE_ACCL --get

Sending access register…

Field Name | Data

====================================================

roce_adp_retrans_field_select | 0x00000001

roce_tx_window_field_select | 0x00000001

roce_slow_restart_field_select | 0x00000001

roce_slow_restart_idle_field_select | 0x00000001

roce_adp_retrans_en | 0x00000000

roce_tx_window_en | 0x00000000

roce_slow_restart_en | 0x00000000

roce_slow_restart_idle_en | 0x00000000

====================================================

Example output with driver unloaded:

kldunload mlx5en

kldunload mlx5ib

kldstat

Id Refs Address Size Name

1 28 0xffffffff80200000 227b0a0 kernel

3 1 0xffffffff824b7000 63fd0 mlx5.ko

4 2 0xffffffff8251b000 39a8 mlxfw.ko

5 2 0xffffffff8251f000 4790 xz.ko

6 4 0xffffffff82524000 2da38 linuxkpi.ko

7 1 0xffffffff82552000 3b98 dcons.ko

9 1 0xffffffff8258b000 a3538 ibcore.ko

10 1 0xffffffff82912000 1a20 fdescfs.ko

11 1 0xffffffff82914000 2698 intpm.ko

12 1 0xffffffff82917000 b40 smbus.ko

13 1 0xffffffff82918000 1860 uhid.ko

14 1 0xffffffff8291a000 2908 ums.ko

15 1 0xffffffff8291d000 46f0 autofs.ko

kldunload mlx5

mlxreg -d mst status | awk '{print$1}' | tail -n +3 --reg_name ROCE_ACCL --get

Sending access register…

-E- Failed to send access register: ME_ICMD_OPERATIONAL_ERROR

With driver loaded, and Enabling Lossy RoCE accelerations:

kldload mlx5 mlkx5ib mlx5en && /etc/netstart

mlxreg -d mst status | awk '{print$1}' | tail -n +3 --reg_name ROCE_ACCL --set “roce_adp_retrans_en=0x1,roce_tx_window_en=0x1,roce_slow_restart_en=0x1”

You are about to send access register: ROCE_ACCL with the following data:

Field Name | Data

====================================================

roce_adp_retrans_field_select | 0x00000001

roce_tx_window_field_select | 0x00000001

roce_slow_restart_field_select | 0x00000001

roce_slow_restart_idle_field_select | 0x00000001

roce_adp_retrans_en | 0x00000001

roce_tx_window_en | 0x00000001

roce_slow_restart_en | 0x00000001

roce_slow_restart_idle_en | 0x00000000

====================================================

Do you want to continue ? (y/n) [n] :

Also make sure the adapter is configured for Ethernet.

Instructions work as well when running Linux, just make sure that the driver is loaded properly.

Thank you and regards,

~NVIDIA Networking Technical Support

Hi Martijn,

I have installed & reinstalled the drivers several times and it always causes the same problems. I found an earlier post https://community.mellanox.com/s/article/understanding-rocev2-congestion-management, which is creating some doubts (in my mind) whether you still need to install a congestion manager or if the congestion manager could be responsible for the error.

Thanks,

Arturo