Hello:
I’m posting here to ask for advice on an issue I’ve been facing while running a DOCA-based UCX application on a BlueField-3 DPU.
The following error occurs during execution:
[1761260065.615737] [urom-daemon-bf:536101:0] ib_mlx5.c:592 UCX ERROR mlx5dv_devx_alloc_uar(device=mlx5_2, flags=0x0) type=WC failed: Cannot allocate memory. Consider increasing PF_LOG_BAR_SIZE using mlxconfig tool (requires reboot)
Following the message, I used the mlxconfig tool to increase PF_LOG_BAR_SIZE from the default value 5 to 8 and rebooted the system, but the same error still occurs.
I also tried increasing the value of PF_BAR2_SIZE, but it did not help.
Environment
-
Platform: NVIDIA BlueField-3
-
DOCA: 3.1.0
-
UCX: 1.19.0
-
OS: Ubuntu 22.04
-
Kernel: 5.15.0-1074-bluefield
-
Driver (OFED): MLNX_OFED_LINUX-25.07-0.9.7
What I have tried
-
Verified that
PF_LOG_BAR_SIZE=8using
mlxconfig -d <device> query | grep PF_LOG_BAR_SIZE -
Rebooted the system, but the same error persists
-
Increased
PF_BAR2_SIZE, but no change -
Other parameters (
NUM_OF_UARS,LOG_BAR_SIZE, etc.) remain at their default values
Additional context
This issue occurs while running the DOCA UROM sample applications:
urom_multi_workers_bootstrap_sample and worker_graph.
When I set the number of workers to 2, the application runs successfully with the default PF_LOG_BAR_SIZE=5.
However, when I increase the number of workers to 4, the above error appears.
Even after increasing PF_LOG_BAR_SIZE to 8, the same error still occurs.
Questions
-
What other possible causes could lead to this error?
-
Are there additional parameters (e.g., PCI BAR resource limits, UAR allocation) that should be adjusted besides
PF_LOG_BAR_SIZEandPF_BAR2_SIZE? -
Has anyone experienced a similar issue with UCX on BlueField-3? Any suggestions or workarounds would be greatly appreciated.
Would you like me to slightly reformat it (for example, to match NVIDIA Developer Forums style with Markdown headings and code formatting)? It’ll make it clearer when you post it there.
I was unable to edit the original topic, so I posted the same content again at the following link: