Hi everyone,
Bluefield-3 DPU SF seems to have a limitation of allocating up to 32 QPs. I wonder if there is any method to break this limitation.
Background
Our use case is in-network compute. We try to offload the parameter server function to DPUs to avoid the PCIe bottleneck.
The application-level symptom when we scale to 7 workers is:
Check failed: (rdma_create_qp(cm_id, pd, &attr)) == (0)
Create RDMA queue pair failed: Cannot allocate memory
Environment
-
Platform: BlueField-3 DPU
-
Mode: DPU mode
-
OS on DPU: Ubuntu, 64KB page size
-
Kernel:
6.8.0-1013-bluefield-64k -
OFED:
MLNX_OFED_LINUX-25.10-1.7.1 -
FW:
32.47.1088
Relevant info:
getconf PAGESIZE
65536
ofed_info -s
MLNX_OFED_LINUX-25.10-1.7.1:
ibv_devinfo -v | grep -i fw_ver
fw_ver: 32.47.1088
Test: QP stress test
We try to create QP on SF with the following script:
#!/usr/bin/env python3
import sys
import time
from pyverbs.device import Context
from pyverbs.pd import PD
from pyverbs.cq import CQ
from pyverbs.qp import QP, QPCap, QPInitAttr
from pyverbs.pyverbs_error import PyverbsRDMAError
from pyverbs.enums import IBV_QPT_RC
def main():
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <mlx5_x> [cq_depth] [max_iter] [sleep_sec]")
sys.exit(1)
devname = sys.argv[1]
cq_depth = int(sys.argv[2]) if len(sys.argv) > 2 else 16
max_iter = int(sys.argv[3]) if len(sys.argv) > 3 else 200000
sleep_sec = int(sys.argv[4]) if len(sys.argv) > 4 else 60
ctx = Context(name=devname)
attr = ctx.query_device()
print(
f"device={devname} "
f"max_qp={attr.max_qp} max_cq={attr.max_cq} "
f"max_pd={attr.max_pd} max_qp_wr={attr.max_qp_wr} max_cqe={attr.max_cqe}"
)
pd = PD(ctx)
objs = []
ok = 0
for i in range(max_iter):
try:
cq = CQ(ctx, cq_depth)
cap = QPCap(max_send_wr=1, max_recv_wr=1, max_send_sge=1, max_recv_sge=1)
init_attr = QPInitAttr(
qp_type=IBV_QPT_RC,
scq=cq,
rcq=cq,
cap=cap
)
qp = QP(pd, init_attr)
objs.append((cq, qp))
ok += 1
if ok % 1000 == 0:
print(f"created {ok} CQ/QP pairs", flush=True)
except PyverbsRDMAError as e:
print(f"STOP at i={i}: {e}")
break
except Exception as e:
print(f"STOP at i={i}: {e}")
break
print(f"SUCCESS: created {ok} CQ/QP pairs on {devname}")
print(f"sleep {sleep_sec}s", flush=True)
time.sleep(sleep_sec)
if __name__ == "__main__":
main()
Result on Arm-side SF
Example on mlx5_2:
./qp_stress.py mlx5_2 1 200000 60
device=mlx5_2 max_qp=131072 max_cq=16777216 max_pd=8388608 max_qp_wr=32768 max_cqe=4194303
STOP at i=32: Failed to create QP. Errno: 12, Cannot allocate memory
SUCCESS: created 32 CQ/QP pairs on mlx5_2
sleep 60s
So although query_device() reports very large max_qp/max_cq, on the Arm-side SF I can only create 32 RC QPs before ENOMEM.
Very important observation: PF vs SF
I compared PF and SF behavior on the DPU Arm side.
SF behavior
-
mlx5_2can create 32 QPs -
mlx5_4(a newly created “clean” SF) can also create 32 QPs -
mlx5_2andmlx5_4can each create 32 QPs simultaneously
PF behavior
On the same DPU Arm side, using PF:
mlx5_0can create 63952 QPs
Also, while mlx5_2 is holding its 32 QPs, mlx5_0 can still create 63952 QPs.
This makes it look like the limitation is per-SF, not global to the whole HCA.
SF setup details
I checked the default SF profile and saw:
Function max_io_eqs: 8
I tried increasing max_io_eqs, but it did not change the 32-QP ceiling.
I also tried:
- creating a new “clean” SF instead of using the default
sf0 - increasing
PF_SF_BAR_SIZE - increasing
PF_LOG_BAR_SIZE(currently set to 7) - reducing application queue depths aggressively
None of these changed the 32 QP per SF behavior.
What I would like to understand
-
What causes this RC QP limit per SF on BlueField-3 Arm side in DPU mode?
Is this limited by hardware or some software configuration? -
Is this related to Arm-side SF UAR / doorbell / BAR / per-function resource limits?
We are on a 64KB page-size system. -
Is there any supported way to increase the usable RC QP count per SF on the Arm side?
We would really appreciate it if you could tell us how to bypass this limitation so that we can scale our training cluster. -
If the answer is that SF is not intended for this many RC QPs, is there any suggested way to support large-QP-count RDMA workloads on Arm-side in DPU mode?
Thanks.