We are experiencing the same issue in one of our clusters with ConnectX-6 cards. Our configurations are as follows:
Kernel: 3.10.0-1127.19.1.el7.x86_64 (CentOS 7.8)
OFED: MLNX_OFED_LINUX-5.0-2.1.8.0
Firmware: 20.27.6106
Hardware: Mellanox ConnectX-6 Single Port VPI HDR100 QSFP Adapter
In our case, the problem is not limited to UCX but also impacting other transports like OFI and verbs. We notice a lot of message in dmesg/syslog like these:
Failed to create a completion queue (CQ):
Hostname: node-a440
Requested CQE: 16384
Error: Cannot allocate memory
Check the CQE attribute.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
Hostname: node-a440
--------------------------------------------------------------------------
[1609972400.139736] [node-a440:57085:0] ib_mlx5dv_md.c:710 UCX ERROR mlx5dv_devx_umem_reg() zero umem failed: Cannot allocate memory
[node-a440:57085:0:57085] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x318)
[1609972400.144726] [node-a440:57088:0] ib_mlx5dv_md.c:710 UCX ERROR mlx5dv_devx_umem_reg() zero umem failed: Cannot allocate memory
[node-a440:57088:0:57088] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x318)
==== backtrace (tid: 57085) ====
0 0x000000000004cb95 ucs_debug_print_backtrace() ???:0
1 0x000000000001ac29 uct_ib_md_open() ???:0
2 0x000000000000e432 uct_md_open() ???:0
3 0x0000000000010308 ???() /usr/lib64/libucp.so.0:0
4 0x00000000000113a1 ucp_init_version() ???:0
5 0x00000000001b2db0 mca_pml_ucx_open() ???:0
6 0x0000000000077aa7 mca_base_framework_components_open() ???:0
7 0x00000000001aefaf mca_pml_base_open() pml_base_frame.c:0
8 0x0000000000081e91 mca_base_framework_open() ???:0
9 0x0000000000073474 ompi_mpi_init() ???:0
10 0x00000000000a288f PMPI_Init() ???:0
11 0x0000000000400933 main() ???:0
12 0x0000000000022555 __libc_start_main() ???:0
13 0x0000000000400859 _start() ???:0
We currently have no way to mitigate the situation except rebooting the affected nodes. The problem seems to appear randomly on a subset of the nodes.
Is this a known firmware/ofed issue and what triggers it?
Any help will be appreciated. Please let us know if you need more info.