Hi all,
I have a cluster running ROCE on Mellanox NIC.
# lspci | grep Mellanox
03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
03:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
There is a problem when I run large block size workload on it. The bandwidth is fairly poor. I tried to run ib_xxxx_bw tools. ib_read_bw showed the same issue, as showing below:
Server:
# ib_read_bw -d mlx5_1 -i 1 -s 131072 -n 10000 -F --report_gbits
************************************
* Waiting for client to connect… *
************************************
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x08cc PSN 0x9f5907 OUT 0x10 RKey 0x08da1a VAddr 0x007fdfbb260000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62
remote address: LID 0000 QPN 0x0ec1 PSN 0xc25c5e OUT 0x10 RKey 0x0d9351 VAddr 0x007f60c8de0000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
^@ 131072 10000 0.354959 0.091025 0.000087
---------------------------------------------------------------------------------------
Client:
# ib_read_bw -d mlx5_1 -i 1 -s 131072 -n 10000 -F --report_gbits 10.252.4.62
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0ec1 PSN 0xc25c5e OUT 0x10 RKey 0x0d9351 VAddr 0x007f60c8de0000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61
remote address: LID 0000 QPN 0x08cc PSN 0x9f5907 OUT 0x10 RKey 0x08da1a VAddr 0x007fdfbb260000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
^@ 131072 10000 0.354959 0.091025 0.000087
---------------------------------------------------------------------------------------
As you can see, the bw is only 91Mb/s, which is apparently not correct. I checked the possible causes and found that the ‘rx_discards_phy’ counter is increasing constantly when running the test.
# ethtool -S enp3s0f1 | grep discard
rx_discards_phy: 19459329
tx_discards_phy: 0
# ethtool -S enp3s0f1 | grep discard
rx_discards_phy: 19493876
tx_discards_phy: 0
# ethtool -S enp3s0f1 | grep discard
rx_discards_phy: 19517948
tx_discards_phy: 0
# ethtool -S enp3s0f1 | grep discard
rx_discards_phy: 19524980
tx_discards_phy: 0
# ethtool -S enp3s0f1 | grep discard
rx_discards_phy: 19660462
tx_discards_phy: 0
# ethtool -S enp3s0f1 | grep discard
rx_discards_phy: 19715074
tx_discards_phy: 0
From what I learned from another post Understanding mlx5 ethtool Counters https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters , this seems like the receive side is constantly dropping the packets because of lacking port receive buffers. But I don’t know how to find more informations or solve this issue starting from here. I also tried to increase the NIC ring buffer using the ethtool, but not too much help.
The number of received packets dropped due to lack of buffers on a physical port. If this counter is increasing, it implies that the adapter is congested and cannot absorb the traffic coming from the network.
Another interesting point is that both the ib_send_bw and ib_write_bw commands are running well without this issue.
# ib_send_bw -d mlx5_1 -i 1 -s 131072 -n 100000 -F --report_gbits 10.252.4.62
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0ec4 PSN 0xb420ab
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61
remote address: LID 0000 QPN 0x08cf PSN 0x98b61c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
131072 100000 0.00 92.16 0.087890
---------------------------------------------------------------------------------------
# ib_write_bw -d mlx5_1 -i 1 -s 131072 -n 100000 -F --report_gbits 10.252.4.62
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0ec5 PSN 0x31b13f RKey 0x0dbfc3 VAddr 0x007f59856a0000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61
remote address: LID 0000 QPN 0x08d0 PSN 0x25cb57 RKey 0x091496 VAddr 0x007fd7f8e20000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
131072 100000 0.00 92.57 0.088281
---------------------------------------------------------------------------------------
Does anyone have any clues on what might be causing the problem? Any suggestions are appreciated!