How to configure the BF3 DPU to support line-rate RDMA over high-latency links

Hi,

I would like to know whether my BlueField-3 DPU can support line-rate RDMA communication over high-latency links.

Scenario
I am running RDMA communication over a link with high latency and facing performance issues. Both ends use ConnectX-7 NICs (on BlueField-3 DPUs). I simulated a high latency link by delaying the ACK reply on the switch. The link characteristics are:

  • RTT ≈ 5 ms

  • Bandwidth: 40 Gbps

  • No packet loss

When testing bandwidth with ib_write_bw, the observed throughput is very poor (less than 1 Gbps). The commands I used are:

Server:
$ sudo ib_write_bw -d mlx5_0 -q 1 -x 3 -n 5 -s 100000000 --report_gbits -R

Client:
$ sudo ib_write_bw 172.17.1.106 -d mlx5_0 -q 1 -x 3 -n 5 -s 100000000 --report_gbits -R

Option explanations:

  • -d, --ib-dev=<dev> : Use IB device (default first device)

  • -n, --iters=<iters> : Number of exchanges (default 5000)

  • -q, --qp=<num> : Number of QPs (default 1)

  • -R, --rdma_cm : Connect QPs with rdma_cm

  • -s, --size=<size> : Message size (default 65536)

  • -x, --gid-index=<index> : GID index to use

Observation:
From packet captures as below, I noticed that after the sender transmits a certain amount of data, it stops sending until the next ACK is received. Only then does it resume transmission. This behavior results in very poor throughput under high-RTT conditions.

Question:
It seems that the NIC may be enforcing a hidden transmission window at the sender side, limiting the number of in-flight packets.

  • Is there indeed such a window/limitation?

  • If so, is there any way to tune or disable it so that RDMA communication can achieve near line-rate (40 Gbps) throughput even with a 5 ms RTT?

My environment:

[~]$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e q

Device #1:
----------

Device type:        BlueField3          
Name:               900-9D3B6-00CV-A_Ax 
Description:        NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
Device:             /dev/mst/mt41692_pciconf0

[~]$ sudo mlxburn -d /dev/mst/mt41692_pciconf0  query
-I- Image type:            FS4
-I- FW Version:            32.41.1000
-I- FW Release Date:       28.4.2024
-I- Product Version:       32.41.1000
-I- Rom Info:              type=UEFI Virtio net version=21.4.13 cpu=AMD64,AARCH64
-I-                        type=UEFI Virtio blk version=22.4.13 cpu=AMD64,AARCH64
-I-                        type=UEFI version=14.34.12 cpu=AMD64,AARCH64
-I-                        type=PXE version=3.7.400 cpu=AMD64
-I- Description:           UID                GuidsNumber
-I- Base GUID:             5c257303006da676        38
-I- Base MAC:              5c25736da676            38
-I- Image VSD:             N/A
-I- Device VSD:            N/A
-I- PSID:                  MT_0000000884
-I- Security Attributes:   secure-fw

[~]$ cat /opt/mellanox/doca/applications/VERSION
2.9.3008

Thanks in advance for your help!

Hi,

I would recommend changing/adding the ib_write_bw command parameters as follows:

-q 16 instead of -q 1 - this will increase the number of QPs used in the test. More QPs = more simultaneous operations.

-t 1024 - increase tx-depth from default 128 to 1024 = more outstanding operations.

-n 100000 - increase iterations from 5 to 100000 to keep pipeline full.

These changes should theoretically allow you to get very close to line-rate, as long as the host/BlueField FW/Driver are properly tuned as well.

Thanks,

Jonathan.

Hi,

Thank you very much for your suggestions.

However, as shown in my packet capture results, the bandwidth limitation is caused by the sending window introduced when the Selective Repeat feature is enabled. This window restricts the number of in-flight packets (its size is 512 packets), while my link requires about 25,000 packets to fully utilize the bandwidth. Therefore, increasing tx-depth or the number of iterations cannot solve this issue.

Increasing the number of QPs does improve bandwidth, since each QP maintains its own window. However, my goal is to achieve line-rate performance with a single QP. Hence, I’d like to know whether there is any way to remove or enlarge this window limitation while keeping the Selective Repeat feature enabled.

Thank you!