BlueField-3 DPU mode: Arm-side SF can only create 32 RC QPs

Hi everyone,

Bluefield-3 DPU SF seems to have a limitation of allocating up to 32 QPs. I wonder if there is any method to break this limitation.

Background

Our use case is in-network compute. We try to offload the parameter server function to DPUs to avoid the PCIe bottleneck.

The application-level symptom when we scale to 7 workers is:

Check failed: (rdma_create_qp(cm_id, pd, &attr)) == (0)
Create RDMA queue pair failed: Cannot allocate memory

Environment

  • Platform: BlueField-3 DPU

  • Mode: DPU mode

  • OS on DPU: Ubuntu, 64KB page size

  • Kernel: 6.8.0-1013-bluefield-64k

  • OFED: MLNX_OFED_LINUX-25.10-1.7.1

  • FW: 32.47.1088

Relevant info:

getconf PAGESIZE
65536

ofed_info -s
MLNX_OFED_LINUX-25.10-1.7.1:

ibv_devinfo -v | grep -i fw_ver
fw_ver: 32.47.1088

Test: QP stress test

We try to create QP on SF with the following script:

#!/usr/bin/env python3
import sys
import time

from pyverbs.device import Context
from pyverbs.pd import PD
from pyverbs.cq import CQ
from pyverbs.qp import QP, QPCap, QPInitAttr
from pyverbs.pyverbs_error import PyverbsRDMAError
from pyverbs.enums import IBV_QPT_RC

def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} <mlx5_x> [cq_depth] [max_iter] [sleep_sec]")
        sys.exit(1)

    devname = sys.argv[1]
    cq_depth = int(sys.argv[2]) if len(sys.argv) > 2 else 16
    max_iter = int(sys.argv[3]) if len(sys.argv) > 3 else 200000
    sleep_sec = int(sys.argv[4]) if len(sys.argv) > 4 else 60

    ctx = Context(name=devname)
    attr = ctx.query_device()

    print(
        f"device={devname} "
        f"max_qp={attr.max_qp} max_cq={attr.max_cq} "
        f"max_pd={attr.max_pd} max_qp_wr={attr.max_qp_wr} max_cqe={attr.max_cqe}"
    )

    pd = PD(ctx)
    objs = []
    ok = 0

    for i in range(max_iter):
        try:
            cq = CQ(ctx, cq_depth)
            cap = QPCap(max_send_wr=1, max_recv_wr=1, max_send_sge=1, max_recv_sge=1)
            init_attr = QPInitAttr(
                qp_type=IBV_QPT_RC,
                scq=cq,
                rcq=cq,
                cap=cap
            )
            qp = QP(pd, init_attr)
            objs.append((cq, qp))
            ok += 1

            if ok % 1000 == 0:
                print(f"created {ok} CQ/QP pairs", flush=True)

        except PyverbsRDMAError as e:
            print(f"STOP at i={i}: {e}")
            break
        except Exception as e:
            print(f"STOP at i={i}: {e}")
            break

    print(f"SUCCESS: created {ok} CQ/QP pairs on {devname}")
    print(f"sleep {sleep_sec}s", flush=True)
    time.sleep(sleep_sec)

if __name__ == "__main__":
    main()

Result on Arm-side SF

Example on mlx5_2:

./qp_stress.py mlx5_2 1 200000 60
device=mlx5_2 max_qp=131072 max_cq=16777216 max_pd=8388608 max_qp_wr=32768 max_cqe=4194303
STOP at i=32: Failed to create QP. Errno: 12, Cannot allocate memory
SUCCESS: created 32 CQ/QP pairs on mlx5_2
sleep 60s

So although query_device() reports very large max_qp/max_cq, on the Arm-side SF I can only create 32 RC QPs before ENOMEM.

Very important observation: PF vs SF

I compared PF and SF behavior on the DPU Arm side.

SF behavior

  • mlx5_2 can create 32 QPs

  • mlx5_4 (a newly created “clean” SF) can also create 32 QPs

  • mlx5_2 and mlx5_4 can each create 32 QPs simultaneously

PF behavior

On the same DPU Arm side, using PF:

  • mlx5_0 can create 63952 QPs

Also, while mlx5_2 is holding its 32 QPs, mlx5_0 can still create 63952 QPs.

This makes it look like the limitation is per-SF, not global to the whole HCA.

SF setup details

I checked the default SF profile and saw:

Function max_io_eqs: 8

I tried increasing max_io_eqs, but it did not change the 32-QP ceiling.

I also tried:

  • creating a new “clean” SF instead of using the default sf0
  • increasing PF_SF_BAR_SIZE
  • increasing PF_LOG_BAR_SIZE (currently set to 7)
  • reducing application queue depths aggressively

None of these changed the 32 QP per SF behavior.

What I would like to understand

  1. What causes this RC QP limit per SF on BlueField-3 Arm side in DPU mode?
    Is this limited by hardware or some software configuration?

  2. Is this related to Arm-side SF UAR / doorbell / BAR / per-function resource limits?
    We are on a 64KB page-size system.

  3. Is there any supported way to increase the usable RC QP count per SF on the Arm side?
    We would really appreciate it if you could tell us how to bypass this limitation so that we can scale our training cluster.

  4. If the answer is that SF is not intended for this many RC QPs, is there any suggested way to support large-QP-count RDMA workloads on Arm-side in DPU mode?

Thanks.

Hi,

We were able to resolve this issue.

The root cause was related to the SF BAR size configuration.
After increasing the BAR allocation for SFs on the DPU side, the limitation where the Arm-side SF could only create 32 RC QPs was resolved.

Solution

Run the following commands inside the DPU:

sudo mlxconfig -d 0000:03:00.0 s PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10
sudo mlxconfig -d 0000:03:00.1 s PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10

Important

After applying the configuration, a cold reboot / power cycle is required:

  • shut down the system

  • completely remove power

  • discharge the system

  • power it back on

A normal reboot is not sufficient.

After this change and cold boot, the problem was fixed.