Poor bandwidth performance when running with large block size

Hi all,

I have a cluster running ROCE on Mellanox NIC.

# lspci | grep Mellanox

03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

03:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

There is a problem when I run large block size workload on it. The bandwidth is fairly poor. I tried to run ib_xxxx_bw tools. ib_read_bw showed the same issue, as showing below:

Server:

# ib_read_bw -d mlx5_1 -i 1 -s 131072 -n 10000 -F --report_gbits

************************************

* Waiting for client to connect… *

************************************

---------------------------------------------------------------------------------------

RDMA_Read BW Test

Dual-port : OFF Device : mlx5_1

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

CQ Moderation : 100

Mtu : 1024[B]

Link type : Ethernet

GID index : 3

Outstand reads : 16

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x08cc PSN 0x9f5907 OUT 0x10 RKey 0x08da1a VAddr 0x007fdfbb260000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

remote address: LID 0000 QPN 0x0ec1 PSN 0xc25c5e OUT 0x10 RKey 0x0d9351 VAddr 0x007f60c8de0000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

^@ 131072 10000 0.354959 0.091025 0.000087

---------------------------------------------------------------------------------------

Client:

# ib_read_bw -d mlx5_1 -i 1 -s 131072 -n 10000 -F --report_gbits 10.252.4.62

---------------------------------------------------------------------------------------

RDMA_Read BW Test

Dual-port : OFF Device : mlx5_1

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

TX depth : 128

CQ Moderation : 100

Mtu : 1024[B]

Link type : Ethernet

GID index : 3

Outstand reads : 16

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x0ec1 PSN 0xc25c5e OUT 0x10 RKey 0x0d9351 VAddr 0x007f60c8de0000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

remote address: LID 0000 QPN 0x08cc PSN 0x9f5907 OUT 0x10 RKey 0x08da1a VAddr 0x007fdfbb260000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

^@ 131072 10000 0.354959 0.091025 0.000087

---------------------------------------------------------------------------------------

As you can see, the bw is only 91Mb/s, which is apparently not correct. I checked the possible causes and found that the ‘rx_discards_phy’ counter is increasing constantly when running the test.

# ethtool -S enp3s0f1 | grep discard

rx_discards_phy: 19459329

tx_discards_phy: 0

# ethtool -S enp3s0f1 | grep discard

rx_discards_phy: 19493876

tx_discards_phy: 0

# ethtool -S enp3s0f1 | grep discard

rx_discards_phy: 19517948

tx_discards_phy: 0

# ethtool -S enp3s0f1 | grep discard

rx_discards_phy: 19524980

tx_discards_phy: 0

# ethtool -S enp3s0f1 | grep discard

rx_discards_phy: 19660462

tx_discards_phy: 0

# ethtool -S enp3s0f1 | grep discard

rx_discards_phy: 19715074

tx_discards_phy: 0

From what I learned from another post Understanding mlx5 ethtool Counters https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters , this seems like the receive side is constantly dropping the packets because of lacking port receive buffers. But I don’t know how to find more informations or solve this issue starting from here. I also tried to increase the NIC ring buffer using the ethtool, but not too much help.

The number of received packets dropped due to lack of buffers on a physical port. If this counter is increasing, it implies that the adapter is congested and cannot absorb the traffic coming from the network.

Another interesting point is that both the ib_send_bw and ib_write_bw commands are running well without this issue.

# ib_send_bw -d mlx5_1 -i 1 -s 131072 -n 100000 -F --report_gbits 10.252.4.62

---------------------------------------------------------------------------------------

Send BW Test

Dual-port : OFF Device : mlx5_1

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

TX depth : 128

CQ Moderation : 100

Mtu : 1024[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x0ec4 PSN 0xb420ab

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

remote address: LID 0000 QPN 0x08cf PSN 0x98b61c

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

131072 100000 0.00 92.16 0.087890

---------------------------------------------------------------------------------------

# ib_write_bw -d mlx5_1 -i 1 -s 131072 -n 100000 -F --report_gbits 10.252.4.62

---------------------------------------------------------------------------------------

RDMA_Write BW Test

Dual-port : OFF Device : mlx5_1

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

TX depth : 128

CQ Moderation : 100

Mtu : 1024[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x0ec5 PSN 0x31b13f RKey 0x0dbfc3 VAddr 0x007f59856a0000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

remote address: LID 0000 QPN 0x08d0 PSN 0x25cb57 RKey 0x091496 VAddr 0x007fd7f8e20000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

131072 100000 0.00 92.57 0.088281

---------------------------------------------------------------------------------------

Does anyone have any clues on what might be causing the problem? Any suggestions are appreciated!

I got it.

You can refer to this article https://community.mellanox.com/s/article/howto-tune-receive-buffers-on-mellanox-adapter-cards .

Tune each TC’s receive buffer in one physical port and test it.

Thanks Haonan.

However, my content of the /sys/class/net/xxx/qos/buffer_size file is a bit different:

# cat /sys/class/net/enp3s0f1/qos/buffer_size

Port buffer size = 262016

Spare buffer size = 0

Buffer Size xoff_threshold xon_threshold

0 262016 169984 82688

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

5 0 0 0

6 0 0 0

7 0 0 0

I’m not able to change it using the method in the post. Also, the post doesn’t have enough informations on what exactly this parameter means, such as, is it the size of one buffer or the number of buffers? What is the valid range of this parameter? etc.

Uh…I have tried and failed too.

The first column of results indicates the number of the buffer, which is corresponding to the number of the traffic class. And the second column indicates the buffer size.

For more details, maybe you need to send messages directly to its author or write emails to technical supports with detailed question description.

Its email address is supportadmin@mellanox.com mailto:supportadmin@mellanox.com .

Hi, ZQ Huang.

Does there exist the same result when the block size is smaller than 131072 or bigger than 131072?

Maybe you should enable the PFC on both adapters by mlnx_qos.

I would suggest to test in opposite direction and go through Mellanox Tuning Gide first, before changing TOS

I tried to directly connect the adapters on two hosts, and the problem is gone. So there must be something wrong with the switch…

That’s Interesting. I haven’t experienced this.

But there should be only two possible reasons: abnormal software status and mismatched hardware configuration, maybe you should check these:

  • Reload the mlnx module: /etc/init.d/openibd restart, re-test it.
  • In 4K/8K/16K, check whether their bandwidth is the same or linear increase. If there exist some serious fluctuations, you should enable PFC on all communicating devices(NICs, Switches) and test it.
  • Update OFED to the latest version and test it.

Thanks Haonan. Yes, the same problem is seen when the block size is 32k or bigger. But for small block size such as 4k/8k/16k, this problem doesn’t exist.

The fw is already the latest version, and I’ve even restarted the hosts. The 4k/8k/16k ib_read_bw results are fairly steady, around 87Gb/s.

The thing I don’t understand is that the ‘rx_discards_phy’ counter shows the adapter is running out of port receive buffer, but I can’t find any useful information on how to enlarge this buffer number. Also, I don’t know how this port receive buffer is related to the receive buffer which the application posts using ‘ibv_post_receive’. I tried to increase/decrease the posted buffer, but it doesn’t make too much difference. If these are 2 kinds of buffers, this means that the incoming packets are first put into the port receive buffer, and then copied to the posted receive buffer. And there should be some ways to change the port receive buffer sizes/numbers, though I don’t know yet.