How to configure the BF3 DPU to support line-rate RDMA over high-latency links

wushanbi19 · September 22, 2025, 7:22am

Hi,

I would like to know whether my BlueField-3 DPU can support line-rate RDMA communication over high-latency links.

Scenario
I am running RDMA communication over a link with high latency and facing performance issues. Both ends use ConnectX-7 NICs (on BlueField-3 DPUs). I simulated a high latency link by delaying the ACK reply on the switch. The link characteristics are:

RTT ≈ 5 ms
Bandwidth: 40 Gbps
No packet loss

When testing bandwidth with ib_write_bw, the observed throughput is very poor (less than 1 Gbps). The commands I used are:

Server:
$ sudo ib_write_bw -d mlx5_0 -q 1 -x 3 -n 5 -s 100000000 --report_gbits -R

Client:
$ sudo ib_write_bw 172.17.1.106 -d mlx5_0 -q 1 -x 3 -n 5 -s 100000000 --report_gbits -R

Option explanations:

-d, --ib-dev=<dev> : Use IB device (default first device)
-n, --iters=<iters> : Number of exchanges (default 5000)
-q, --qp=<num> : Number of QPs (default 1)
-R, --rdma_cm : Connect QPs with rdma_cm
-s, --size=<size> : Message size (default 65536)
-x, --gid-index=<index> : GID index to use

Observation:
From packet captures as below, I noticed that after the sender transmits a certain amount of data, it stops sending until the next ACK is received. Only then does it resume transmission. This behavior results in very poor throughput under high-RTT conditions.

Question:
It seems that the NIC may be enforcing a hidden transmission window at the sender side, limiting the number of in-flight packets.

Is there indeed such a window/limitation?
If so, is there any way to tune or disable it so that RDMA communication can achieve near line-rate (40 Gbps) throughput even with a 5 ms RTT?

My environment:

[~]$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e q

Device #1:
----------

Device type:        BlueField3          
Name:               900-9D3B6-00CV-A_Ax 
Description:        NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
Device:             /dev/mst/mt41692_pciconf0

[~]$ sudo mlxburn -d /dev/mst/mt41692_pciconf0  query
-I- Image type:            FS4
-I- FW Version:            32.41.1000
-I- FW Release Date:       28.4.2024
-I- Product Version:       32.41.1000
-I- Rom Info:              type=UEFI Virtio net version=21.4.13 cpu=AMD64,AARCH64
-I-                        type=UEFI Virtio blk version=22.4.13 cpu=AMD64,AARCH64
-I-                        type=UEFI version=14.34.12 cpu=AMD64,AARCH64
-I-                        type=PXE version=3.7.400 cpu=AMD64
-I- Description:           UID                GuidsNumber
-I- Base GUID:             5c257303006da676        38
-I- Base MAC:              5c25736da676            38
-I- Image VSD:             N/A
-I- Device VSD:            N/A
-I- PSID:                  MT_0000000884
-I- Security Attributes:   secure-fw

[~]$ cat /opt/mellanox/doca/applications/VERSION
2.9.3008

Thanks in advance for your help!

jtal · October 22, 2025, 6:41am

Hi,

I would recommend changing/adding the ib_write_bw command parameters as follows:

-q 16 instead of -q 1 - this will increase the number of QPs used in the test. More QPs = more simultaneous operations.

-t 1024 - increase tx-depth from default 128 to 1024 = more outstanding operations.

-n 100000 - increase iterations from 5 to 100000 to keep pipeline full.

These changes should theoretically allow you to get very close to line-rate, as long as the host/BlueField FW/Driver are properly tuned as well.

Thanks,

Jonathan.

wushanbi19 · October 22, 2025, 7:33am

Hi,

Thank you very much for your suggestions.

However, as shown in my packet capture results, the bandwidth limitation is caused by the sending window introduced when the Selective Repeat feature is enabled. This window restricts the number of in-flight packets (its size is 512 packets), while my link requires about 25,000 packets to fully utilize the bandwidth. Therefore, increasing tx-depth or the number of iterations cannot solve this issue.

Increasing the number of QPs does improve bandwidth, since each QP maintains its own window. However, my goal is to achieve line-rate performance with a single QP. Hence, I’d like to know whether there is any way to remove or enlarge this window limitation while keeping the Selective Repeat feature enabled.

Thank you!

Topic		Replies	Views
The raw throughput of BlueField-1 cannot reach the line rate BlueField	2	1053	April 13, 2022
BlueField-2 DPU's RDMA performance is weaker than ConnectX4 BlueField networking , rdma-and-roce	0	1015	May 30, 2022
Evaluating RDMA communication performance using the DOCA library Enterprise Networking	1	99	November 4, 2025
IP packets tx fails in Bluefield 3 DPU BlueField	2	196	August 20, 2024
connect/RDMA between host and DPU BlueField	2	1055	July 25, 2023
RDMA Host<->Device performance during external network communications Enterprise Networking	0	1286	October 28, 2021
Is DMA Supported Via DPDK on the Bluefield 3? BlueField	5	794	March 19, 2025
Bluefield 2 DPU handling only 500MBps traffic BlueField kb	2	801	May 11, 2023
RDMA doesn't work between host and DPU RDMA Software For GPU	1	1432	October 2, 2023
ConnectX-7 RDMA write_bw does not meet performence expectation InfiniBand/VPI Adapter Cards	1	85	August 20, 2025

How to configure the BF3 DPU to support line-rate RDMA over high-latency links

Related topics