UCX error with driver 5.1-2.5.8 on RHEL 7.9

I get an error when executing ‘ucx_info -d’ as normal user:

ucx_info -d

Transport: rc_verbs

Device: mlx5_0:1

[1608791980.432700] [drp-srcf-mon001:17816:0] ib_iface.c:961 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory

< failed to open interface >

Note that the same command looks OK when running as root:

root> ucx_info -d

Transport: rc_verbs

Device: mlx5_0:1

capabilities:

bandwidth: 94353.86/ppn + 0.00 MB/sec

latency: 600 + 1.000 * N nsec

overhead: 75 nsec

put_short: <= 124

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 3 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 4K

get_bcopy: <= 8256

get_zcopy: 65…1G, up to 3 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 4K

am_short: <= 123

am_bcopy: <= 8255

am_zcopy: <= 8255, up to 2 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 127

domain: device

atomic_add: 64 bit

atomic_fadd: 64 bit

atomic_cswap: 64 bit

connection: to ep

device priority: 50

device num paths: 1

max eps: 256

device address: 3 bytes

ep address: 5 bytes

error handling: peer failure

Current setup:

ethtool -i ib0

driver: mlx5_core[ib_ipoib]

version: 5.1-2.5.8

firmware-version: 20.28.2006 (MT_0000000222)

expansion-rom-version:

bus-info: 0000:01:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

uname -a

Linux drp-srcf-cmp034 3.10.0-1160.6.1.el7.x86_64 #1 SMP Wed Oct 21 13:44:38 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

rpm -qa | grep ucx

ucx-cma-1.9.0-1.51258.x86_64

ucx-1.9.0-1.51258.x86_64

ucx-knem-1.9.0-1.51258.x86_64

ucx-devel-1.9.0-1.51258.x86_64

ucx-rdmacm-1.9.0-1.51258.x86_64

ucx-ib-1.9.0-1.51258.x86_64

cat /etc/redhat-release

Red Hat Enterprise Linux Server release 7.9 (Maipo)

This error is currently preventing me from running mpirun using UCX.

Thank you very much for your help in this matter,

Amedeo

Hi,

Can you provide the full output ?

Is your issue similar to this one :

If yes, a workaround is suggested there

Regards

Marc

Hi,

We are experiencing the same issue in one of our clusters with ConnectX-6 cards. Our configurations are as follows:


Kernel: 3.10.0-1127.19.1.el7.x86_64 (CentOS 7.8)

OFED: MLNX_OFED_LINUX-5.0-2.1.8.0

Firmware: 20.27.6106

Hardware: Mellanox ConnectX-6 Single Port VPI HDR100 QSFP Adapter

In our case, the problem is not limited to UCX but also impacting other transports like OFI and verbs. We notice a lot of message in dmesg/syslog like these:


mlx5_core 0000:21:00.0: mlx5_cmd_check:795:(pid 46943): CREATE_QP(0x500) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x19c6d6)

and


[3654721.433472] mlx5_core 0000:21:00.0: reclaim_pages:437:(pid 12420): failed reclaiming pages: err -5

[3654721.437288] mlx5_core 0000:21:00.0: pages_work_handler:496:(pid 12420): reclaim fail -5

[3654728.566133] mlx5_core 0000:21:00.0: mlx5_cmd_check:795:(pid 12420): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)

The user-level MPI error is:


Failed to create a completion queue (CQ):

Hostname: node-a440

Requested CQE: 16384

Error: Cannot allocate memory

Check the CQE attribute.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

Open MPI has detected that there are UD-capable Verbs devices on your

system, but none of them were able to be setup properly. This may

indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component

in this run.

Hostname: node-a440

--------------------------------------------------------------------------

[1609972400.139736] [node-a440:57085:0] ib_mlx5dv_md.c:710 UCX ERROR mlx5dv_devx_umem_reg() zero umem failed: Cannot allocate memory

[node-a440:57085:0:57085] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x318)

[1609972400.144726] [node-a440:57088:0] ib_mlx5dv_md.c:710 UCX ERROR mlx5dv_devx_umem_reg() zero umem failed: Cannot allocate memory

[node-a440:57088:0:57088] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x318)

==== backtrace (tid: 57085) ====

0 0x000000000004cb95 ucs_debug_print_backtrace() ???:0

1 0x000000000001ac29 uct_ib_md_open() ???:0

2 0x000000000000e432 uct_md_open() ???:0

3 0x0000000000010308 ???() /usr/lib64/libucp.so.0:0

4 0x00000000000113a1 ucp_init_version() ???:0

5 0x00000000001b2db0 mca_pml_ucx_open() ???:0

6 0x0000000000077aa7 mca_base_framework_components_open() ???:0

7 0x00000000001aefaf mca_pml_base_open() pml_base_frame.c:0

8 0x0000000000081e91 mca_base_framework_open() ???:0

9 0x0000000000073474 ompi_mpi_init() ???:0

10 0x00000000000a288f PMPI_Init() ???:0

11 0x0000000000400933 main() ???:0

12 0x0000000000022555 __libc_start_main() ???:0

13 0x0000000000400859 _start() ???:0

We currently have no way to mitigate the situation except rebooting the affected nodes. The problem seems to appear randomly on a subset of the nodes.

Is this a known firmware/ofed issue and what triggers it?

Any help will be appreciated. Please let us know if you need more info.

Best regards,

Amiya.

We fixed this error on our system by adding the file /etc/security/limits.d/rdma.conf containing:

soft memlock unlimited

hard memlock unlimited