Hi all,
I am testing my cluster’s RDMA, but it seems that the"ib_write_bw -a" command always fails when it tries to send 65536 or more bytes. I find that ib_write_bw can succeed if I set the connection type to UC. I have tried to use cable to connect my machine directly, but the problem remains the same. Can anyone help me trouble shoot?
the log When I am using RC:
command: ib_write_bw 10.10.10.192 -a
new post send flow is not supported, falling back to ibv_post_send
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: Unsupported
ibv_wr* API : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 2
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0000 QPN 0x011e PSN 0x78d963 RKey 0x009f65 VAddr 0x007f8104725000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:191
remote address: LID 0000 QPN 0x011f PSN 0xe8e408 RKey 0x00b392 VAddr 0x007f23a8c33000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:192
#bytes#iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1499.933000 != 2452.957000. CPU Frequency is not max.
2 5000 4.27 4.26 2.230961
Conflicting CPU frequency values detected: 1499.550000 != 2952.146000. CPU Frequency is not max.
4 5000 9.61 9.46 2.480504
Conflicting CPU frequency values detected: 1499.656000 != 2997.671000. CPU Frequency is not max.
8 5000 20.04 19.67 2.578615
Conflicting CPU frequency values detected: 1499.770000 != 1463.734000. CPU Frequency is not max.
16 5000 40.12 39.24 2.571379
Conflicting CPU frequency values detected: 1499.778000 != 2998.152000. CPU Frequency is not max.
32 5000 80.03 78.54 2.573588
Conflicting CPU frequency values detected: 1499.958000 != 2998.873000. CPU Frequency is not max.
64 5000 160.62 157.78 2.585021
Conflicting CPU frequency values detected: 1499.806000 != 2999.000000. CPU Frequency is not max.
128 5000 320.68 308.64 2.528346
Conflicting CPU frequency values detected: 1499.747000 != 2999.080000. CPU Frequency is not max.
256 5000 637.45 600.40 2.459247
Conflicting CPU frequency values detected: 1499.620000 != 2999.557000. CPU Frequency is not max.
512 5000 1262.81 1193.57 2.444440
Conflicting CPU frequency values detected: 1499.639000 != 2999.262000. CPU Frequency is not max.
1024 5000 2516.95 2370.48 2.427374
Conflicting CPU frequency values detected: 1499.607000 != 2999.490000. CPU Frequency is not max.
2048 5000 5126.39 4789.65 2.452302
Conflicting CPU frequency values detected: 1499.728000 != 2999.404000. CPU Frequency is not max.
4096 5000 6460.28 6175.59 1.580951
Conflicting CPU frequency values detected: 1499.713000 != 2999.890000. CPU Frequency is not max.
8192 5000 6528.70 117.22 0.015004
Conflicting CPU frequency values detected: 1499.861000 != 2999.505000. CPU Frequency is not max.
16384 5000 6573.54 6411.19 0.410316
Conflicting CPU frequency values detected: 1499.650000 != 2999.793000. CPU Frequency is not max.
32768 5000 6605.99 6311.28 0.201961
Conflicting CPU frequency values detected: 1499.738000 != 2999.748000. CPU Frequency is not max.
65536 5000 6610.41 4672.71 0.074763
mlx5: ubuntu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 0800011e 38803bd3
Completion with error at client
Failed status 10: wr_id 0 syndrom 0x88
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
the log When I am using UC:
command: ib_write_bw 10.10.10.192 -a -c UC
new post send flow is not supported, falling back to ibv_post_send
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : UC Using SRQ : OFF
PCIe relax order: Unsupported
ibv_wr* API : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 2
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0000 QPN 0x011f PSN 0x8f0e32 RKey 0x00ab9e VAddr 0x007f00516a4000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:191
remote address: LID 0000 QPN 0x0120 PSN 0xc68df6 RKey 0x00d1af VAddr 0x007f0708fe9000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:192
#bytes#iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1498.490000 != 2998.944000. CPU Frequency is not max.
2 5000 5.18 5.15 2.700739
Conflicting CPU frequency values detected: 1499.797000 != 2998.669000. CPU Frequency is not max.
4 5000 10.40 10.39 2.722819
Conflicting CPU frequency values detected: 1499.773000 != 2999.242000. CPU Frequency is not max.
8 5000 20.60 20.59 2.698614
Conflicting CPU frequency values detected: 1499.855000 != 2999.768000. CPU Frequency is not max.
16 5000 41.09 40.97 2.684706
Conflicting CPU frequency values detected: 1499.675000 != 2999.804000. CPU Frequency is not max.
32 5000 82.63 82.53 2.704304
Conflicting CPU frequency values detected: 1499.691000 != 2999.799000. CPU Frequency is not max.
64 5000 164.67 164.51 2.695322
Conflicting CPU frequency values detected: 1499.719000 != 2999.961000. CPU Frequency is not max.
128 5000 329.03 327.07 2.679348
Conflicting CPU frequency values detected: 1499.632000 != 2999.812000. CPU Frequency is not max.
256 5000 655.71 655.59 2.685285
Conflicting CPU frequency values detected: 1499.493000 != 2999.830000. CPU Frequency is not max.
512 5000 1311.43 1310.63 2.684175
Conflicting CPU frequency values detected: 1499.574000 != 2999.977000. CPU Frequency is not max.
1024 5000 2641.77 2625.21 2.688217
Conflicting CPU frequency values detected: 1499.959000 != 3000.028000. CPU Frequency is not max.
2048 5000 5405.42 5392.40 2.760908
Conflicting CPU frequency values detected: 1499.783000 != 3000.012000. CPU Frequency is not max.
4096 5000 6499.66 6496.46 1.663094
Conflicting CPU frequency values detected: 1499.827000 != 3000.029000. CPU Frequency is not max.
8192 5000 6561.57 6560.75 0.839776
Conflicting CPU frequency values detected: 1499.691000 != 2999.835000. CPU Frequency is not max.
16384 5000 6608.71 6608.62 0.422952
Conflicting CPU frequency values detected: 1499.778000 != 2999.715000. CPU Frequency is not max.
32768 5000 6637.26 6636.79 0.212377
Conflicting CPU frequency values detected: 1482.963000 != 2999.785000. CPU Frequency is not max.
65536 5000 6653.24 6653.12 0.106450
Conflicting CPU frequency values detected: 1496.927000 != 1402.720000. CPU Frequency is not max.
131072 5000 6660.94 6660.88 0.053287
Conflicting CPU frequency values detected: 1496.259000 != 1388.171000. CPU Frequency is not max.
262144 5000 6663.07 6662.99 0.026652
Conflicting CPU frequency values detected: 1499.458000 != 2998.130000. CPU Frequency is not max.
524288 5000 6665.31 6665.30 0.013331
Conflicting CPU frequency values detected: 1499.326000 != 2999.340000. CPU Frequency is not max.
1048576 5000 6665.85 6665.84 0.006666
Conflicting CPU frequency values detected: 1498.296000 != 1797.506000. CPU Frequency is not max.
2097152 5000 6666.04 6666.04 0.003333
Conflicting CPU frequency values detected: 1498.496000 != 2414.338000. CPU Frequency is not max.
4194304 5000 6666.70 6666.70 0.001667
Conflicting CPU frequency values detected: 1498.592000 != 3000.033000. CPU Frequency is not max.
8388608 5000 6668.30 6668.30 0.000834
Thanks for your reply,
Driver version is 525.60.11, CUDA version is 12.0. And my GPUs are A6000. The network card is Mellanox Technologies MT27800 Family [ConnectX-5].
Perftest version is perftest-4.5-0.17, which I downloaded from this github link.
I am not sure what you are referring to by saying “firmware”, could you tell me how to check that?
Based on the information you provided, you are using inbox driver.
Please update to the latest MOFED driver and latest firmware available for this adapter, according to the OS you are using, and then re-test your setup. You can download it from here:
In case you still encounter any issues, you can contact enterprisesupport@nvidia.com and create a support case according to your entitlement.
Hi ypetrov,
Thanks for your reply. I tried to follow your instruction and install the latest driver(the driver’s version is MLNX_OFED_LINUX-5.9-0.5.6.0-ubuntu18.04-x86_64). But the IB network interface goes down after the driver is installed. I cannot find the network interface through ifconfig command. This issue remains even I try to reboot for a couple of times.
Here is some ib-relevant command output I get after is driver is updated:
command: ibstat output:
CA ‘mlx5_0’
CA type: MT4119
Number of ports: 1
Firmware version: 16.27.2008
Hardware version: 0
Node GUID: 0x043f720300cbb392
System image GUID: 0x043f720300cbb392
Port 1:
State: Down
Physical state: Disabled
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x063f72fffecbb392
Link layer: Ethernet
command: ibdev2netdev output:
mlx5_0 port 1 ==> enp35s0np0 (Down)