Ib_write_bw fails at sending 65536 Bytes when connection type is set to RC by default

Hi all,
I am testing my cluster’s RDMA, but it seems that the"ib_write_bw -a" command always fails when it tries to send 65536 or more bytes. I find that ib_write_bw can succeed if I set the connection type to UC. I have tried to use cable to connect my machine directly, but the problem remains the same. Can anyone help me trouble shoot?

the log When I am using RC:

command: ib_write_bw 10.10.10.192 -a

new post send flow is not supported, falling back to ibv_post_send

                RDMA_Write BW Test

Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: Unsupported
ibv_wr* API : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 2
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet

local address: LID 0000 QPN 0x011e PSN 0x78d963 RKey 0x009f65 VAddr 0x007f8104725000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:191
remote address: LID 0000 QPN 0x011f PSN 0xe8e408 RKey 0x00b392 VAddr 0x007f23a8c33000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:192

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1499.933000 != 2452.957000. CPU Frequency is not max.
2 5000 4.27 4.26 2.230961
Conflicting CPU frequency values detected: 1499.550000 != 2952.146000. CPU Frequency is not max.
4 5000 9.61 9.46 2.480504
Conflicting CPU frequency values detected: 1499.656000 != 2997.671000. CPU Frequency is not max.
8 5000 20.04 19.67 2.578615
Conflicting CPU frequency values detected: 1499.770000 != 1463.734000. CPU Frequency is not max.
16 5000 40.12 39.24 2.571379
Conflicting CPU frequency values detected: 1499.778000 != 2998.152000. CPU Frequency is not max.
32 5000 80.03 78.54 2.573588
Conflicting CPU frequency values detected: 1499.958000 != 2998.873000. CPU Frequency is not max.
64 5000 160.62 157.78 2.585021
Conflicting CPU frequency values detected: 1499.806000 != 2999.000000. CPU Frequency is not max.
128 5000 320.68 308.64 2.528346
Conflicting CPU frequency values detected: 1499.747000 != 2999.080000. CPU Frequency is not max.
256 5000 637.45 600.40 2.459247
Conflicting CPU frequency values detected: 1499.620000 != 2999.557000. CPU Frequency is not max.
512 5000 1262.81 1193.57 2.444440
Conflicting CPU frequency values detected: 1499.639000 != 2999.262000. CPU Frequency is not max.
1024 5000 2516.95 2370.48 2.427374
Conflicting CPU frequency values detected: 1499.607000 != 2999.490000. CPU Frequency is not max.
2048 5000 5126.39 4789.65 2.452302
Conflicting CPU frequency values detected: 1499.728000 != 2999.404000. CPU Frequency is not max.
4096 5000 6460.28 6175.59 1.580951
Conflicting CPU frequency values detected: 1499.713000 != 2999.890000. CPU Frequency is not max.
8192 5000 6528.70 117.22 0.015004
Conflicting CPU frequency values detected: 1499.861000 != 2999.505000. CPU Frequency is not max.
16384 5000 6573.54 6411.19 0.410316
Conflicting CPU frequency values detected: 1499.650000 != 2999.793000. CPU Frequency is not max.
32768 5000 6605.99 6311.28 0.201961
Conflicting CPU frequency values detected: 1499.738000 != 2999.748000. CPU Frequency is not max.
65536 5000 6610.41 4672.71 0.074763
mlx5: ubuntu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 0800011e 38803bd3
Completion with error at client
Failed status 10: wr_id 0 syndrom 0x88
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully

the log When I am using UC:

command: ib_write_bw 10.10.10.192 -a -c UC

new post send flow is not supported, falling back to ibv_post_send

                RDMA_Write BW Test

Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : UC Using SRQ : OFF
PCIe relax order: Unsupported
ibv_wr* API : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 2
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet

local address: LID 0000 QPN 0x011f PSN 0x8f0e32 RKey 0x00ab9e VAddr 0x007f00516a4000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:191
remote address: LID 0000 QPN 0x0120 PSN 0xc68df6 RKey 0x00d1af VAddr 0x007f0708fe9000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:192

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1498.490000 != 2998.944000. CPU Frequency is not max.
2 5000 5.18 5.15 2.700739
Conflicting CPU frequency values detected: 1499.797000 != 2998.669000. CPU Frequency is not max.
4 5000 10.40 10.39 2.722819
Conflicting CPU frequency values detected: 1499.773000 != 2999.242000. CPU Frequency is not max.
8 5000 20.60 20.59 2.698614
Conflicting CPU frequency values detected: 1499.855000 != 2999.768000. CPU Frequency is not max.
16 5000 41.09 40.97 2.684706
Conflicting CPU frequency values detected: 1499.675000 != 2999.804000. CPU Frequency is not max.
32 5000 82.63 82.53 2.704304
Conflicting CPU frequency values detected: 1499.691000 != 2999.799000. CPU Frequency is not max.
64 5000 164.67 164.51 2.695322
Conflicting CPU frequency values detected: 1499.719000 != 2999.961000. CPU Frequency is not max.
128 5000 329.03 327.07 2.679348
Conflicting CPU frequency values detected: 1499.632000 != 2999.812000. CPU Frequency is not max.
256 5000 655.71 655.59 2.685285
Conflicting CPU frequency values detected: 1499.493000 != 2999.830000. CPU Frequency is not max.
512 5000 1311.43 1310.63 2.684175
Conflicting CPU frequency values detected: 1499.574000 != 2999.977000. CPU Frequency is not max.
1024 5000 2641.77 2625.21 2.688217
Conflicting CPU frequency values detected: 1499.959000 != 3000.028000. CPU Frequency is not max.
2048 5000 5405.42 5392.40 2.760908
Conflicting CPU frequency values detected: 1499.783000 != 3000.012000. CPU Frequency is not max.
4096 5000 6499.66 6496.46 1.663094
Conflicting CPU frequency values detected: 1499.827000 != 3000.029000. CPU Frequency is not max.
8192 5000 6561.57 6560.75 0.839776
Conflicting CPU frequency values detected: 1499.691000 != 2999.835000. CPU Frequency is not max.
16384 5000 6608.71 6608.62 0.422952
Conflicting CPU frequency values detected: 1499.778000 != 2999.715000. CPU Frequency is not max.
32768 5000 6637.26 6636.79 0.212377
Conflicting CPU frequency values detected: 1482.963000 != 2999.785000. CPU Frequency is not max.
65536 5000 6653.24 6653.12 0.106450
Conflicting CPU frequency values detected: 1496.927000 != 1402.720000. CPU Frequency is not max.
131072 5000 6660.94 6660.88 0.053287
Conflicting CPU frequency values detected: 1496.259000 != 1388.171000. CPU Frequency is not max.
262144 5000 6663.07 6662.99 0.026652
Conflicting CPU frequency values detected: 1499.458000 != 2998.130000. CPU Frequency is not max.
524288 5000 6665.31 6665.30 0.013331
Conflicting CPU frequency values detected: 1499.326000 != 2999.340000. CPU Frequency is not max.
1048576 5000 6665.85 6665.84 0.006666
Conflicting CPU frequency values detected: 1498.296000 != 1797.506000. CPU Frequency is not max.
2097152 5000 6666.04 6666.04 0.003333
Conflicting CPU frequency values detected: 1498.496000 != 2414.338000. CPU Frequency is not max.
4194304 5000 6666.70 6666.70 0.001667
Conflicting CPU frequency values detected: 1498.592000 != 3000.033000. CPU Frequency is not max.
8388608 5000 6668.30 6668.30 0.000834

Hi user46,

Thank you for contacting Nvidia community support.

Could you please share the following information about your setup:

  1. Driver version?
  2. perftest version?
  3. Which cards
  4. Which firmware

Looking forward for your reply.

Thank you and regards,

Nvidia support

Thanks for your reply,
Driver version is 525.60.11, CUDA version is 12.0. And my GPUs are A6000. The network card is Mellanox Technologies MT27800 Family [ConnectX-5].
Perftest version is perftest-4.5-0.17, which I downloaded from this github link.
I am not sure what you are referring to by saying “firmware”, could you tell me how to check that?

Hi User46,

Thank you for the information provided.

Firmware version can be pulled out with this command:
ethtool -i
Get interface name - by using the command “ifconfig” or “ip a”

Looking forward for your reply.

Nvidia support

Hi ypetrov,
I use the ethtool command and get the following output:

driver: mlx5_core
version: 5.0-0
firmware-version: 16.27.2008 (MT_0000000010)
expansion-rom-version:
bus-info: 0000:23:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

It shows the firmware version is 16.27.2008

Hi user46,

Based on the information you provided, you are using inbox driver.
Please update to the latest MOFED driver and latest firmware available for this adapter, according to the OS you are using, and then re-test your setup. You can download it from here:

In case you still encounter any issues, you can contact enterprisesupport@nvidia.com and create a support case according to your entitlement.

Thank you and regards,

Nvidia support

Hi ypetrov,
Thanks for your reply. I tried to follow your instruction and install the latest driver(the driver’s version is MLNX_OFED_LINUX-5.9-0.5.6.0-ubuntu18.04-x86_64). But the IB network interface goes down after the driver is installed. I cannot find the network interface through ifconfig command. This issue remains even I try to reboot for a couple of times.
Here is some ib-relevant command output I get after is driver is updated:

command: ibstat
output:
CA ‘mlx5_0’
CA type: MT4119
Number of ports: 1
Firmware version: 16.27.2008
Hardware version: 0
Node GUID: 0x043f720300cbb392
System image GUID: 0x043f720300cbb392
Port 1:
State: Down
Physical state: Disabled
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x063f72fffecbb392
Link layer: Ethernet

command: ibdev2netdev
output:
mlx5_0 port 1 ==> enp35s0np0 (Down)

command: ibv_devinfo
output:
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.27.2008
node_guid: 043f:7203:00cb:b392
sys_image_guid: 043f:7203:00cb:b392
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000010
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

It looks like the ib port is down, I am wondering how I can make the port up.