UCX error with driver 5.1-2.5.8 on RHEL 7.9

perazzo · December 24, 2020, 6:53am

I get an error when executing ‘ucx_info -d’ as normal user:

ucx_info -d

…

Transport: rc_verbs

Device: mlx5_0:1

[1608791980.432700] [drp-srcf-mon001:17816:0] ib_iface.c:961 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory

< failed to open interface >

…

Note that the same command looks OK when running as root:

root> ucx_info -d

Transport: rc_verbs

Device: mlx5_0:1

capabilities:

bandwidth: 94353.86/ppn + 0.00 MB/sec

latency: 600 + 1.000 * N nsec

overhead: 75 nsec

put_short: <= 124

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 3 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 4K

get_bcopy: <= 8256

get_zcopy: 65…1G, up to 3 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 4K

am_short: <= 123

am_bcopy: <= 8255

am_zcopy: <= 8255, up to 2 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 127

domain: device

atomic_add: 64 bit

atomic_fadd: 64 bit

atomic_cswap: 64 bit

connection: to ep

device priority: 50

device num paths: 1

max eps: 256

device address: 3 bytes

ep address: 5 bytes

error handling: peer failure

Current setup:

ethtool -i ib0

driver: mlx5_core[ib_ipoib]

version: 5.1-2.5.8

firmware-version: 20.28.2006 (MT_0000000222)

expansion-rom-version:

bus-info: 0000:01:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

uname -a

Linux drp-srcf-cmp034 3.10.0-1160.6.1.el7.x86_64 #1 SMP Wed Oct 21 13:44:38 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

rpm -qa | grep ucx

ucx-cma-1.9.0-1.51258.x86_64

ucx-1.9.0-1.51258.x86_64

ucx-knem-1.9.0-1.51258.x86_64

ucx-devel-1.9.0-1.51258.x86_64

ucx-rdmacm-1.9.0-1.51258.x86_64

ucx-ib-1.9.0-1.51258.x86_64

cat /etc/redhat-release

Red Hat Enterprise Linux Server release 7.9 (Maipo)

This error is currently preventing me from running mpirun using UCX.

Thank you very much for your help in this matter,

Amedeo

march · December 24, 2020, 10:02am

Hi,

Can you provide the full output ?

Is your issue similar to this one :

github.com/openucx/ucx

no ucx with connectx-6

opened 04:39PM - 09 Dec 19 UTC

closed 09:17PM - 08 Feb 20 UTC

schluenz

Bug

### Describe the bug ucx_info -d shows various errors (depending on the ucx v…ersion) on nodes with connectx-6 hca. I tried several versions of ucx but didn't succeed using it with connectx-6. Before posting lots of details: is connectx-6 supported at all? ``` Some Errors for ucx_info: ucx 1.4.0: [1575906412.855268] [max-exfl200:137500:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16 ucx 1.6.1: [1575906773.636572] [max-exfl200:138596:0] ib_mlx5_dv.c:157 UCX ERROR ibv_create_cq() failed: Invalid argument ucx 1.7.0 [1575908314.020997] [max-exfl200:168320:0] ib_iface.c:618 UCX ERROR ibv_create_cq(cqe=4096) failed: Invalid argument Short extract from ibv_devinfo -v: hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.26.1040 node_guid: b859:9f03:004e:bd14 sys_image_guid: b859:9f03:004e:bd14 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 ```

If yes, a workaround is suggested there

Regards

Marc

Amiya · January 6, 2021, 10:55pm

Hi,

We are experiencing the same issue in one of our clusters with ConnectX-6 cards. Our configurations are as follows:


Kernel: 3.10.0-1127.19.1.el7.x86_64 (CentOS 7.8)

OFED: MLNX_OFED_LINUX-5.0-2.1.8.0

Firmware: 20.27.6106

Hardware: Mellanox ConnectX-6 Single Port VPI HDR100 QSFP Adapter

In our case, the problem is not limited to UCX but also impacting other transports like OFI and verbs. We notice a lot of message in dmesg/syslog like these:


mlx5_core 0000:21:00.0: mlx5_cmd_check:795:(pid 46943): CREATE_QP(0x500) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x19c6d6)

and


[3654721.433472] mlx5_core 0000:21:00.0: reclaim_pages:437:(pid 12420): failed reclaiming pages: err -5

[3654721.437288] mlx5_core 0000:21:00.0: pages_work_handler:496:(pid 12420): reclaim fail -5

[3654728.566133] mlx5_core 0000:21:00.0: mlx5_cmd_check:795:(pid 12420): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)

The user-level MPI error is:


Failed to create a completion queue (CQ):

Hostname: node-a440

Requested CQE: 16384

Error: Cannot allocate memory

Check the CQE attribute.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

Open MPI has detected that there are UD-capable Verbs devices on your

system, but none of them were able to be setup properly. This may

indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component

in this run.

Hostname: node-a440

--------------------------------------------------------------------------

[1609972400.139736] [node-a440:57085:0] ib_mlx5dv_md.c:710 UCX ERROR mlx5dv_devx_umem_reg() zero umem failed: Cannot allocate memory

[node-a440:57085:0:57085] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x318)

[1609972400.144726] [node-a440:57088:0] ib_mlx5dv_md.c:710 UCX ERROR mlx5dv_devx_umem_reg() zero umem failed: Cannot allocate memory

[node-a440:57088:0:57088] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x318)

==== backtrace (tid: 57085) ====

0 0x000000000004cb95 ucs_debug_print_backtrace() ???:0

1 0x000000000001ac29 uct_ib_md_open() ???:0

2 0x000000000000e432 uct_md_open() ???:0

3 0x0000000000010308 ???() /usr/lib64/libucp.so.0:0

4 0x00000000000113a1 ucp_init_version() ???:0

5 0x00000000001b2db0 mca_pml_ucx_open() ???:0

6 0x0000000000077aa7 mca_base_framework_components_open() ???:0

7 0x00000000001aefaf mca_pml_base_open() pml_base_frame.c:0

8 0x0000000000081e91 mca_base_framework_open() ???:0

9 0x0000000000073474 ompi_mpi_init() ???:0

10 0x00000000000a288f PMPI_Init() ???:0

11 0x0000000000400933 main() ???:0

12 0x0000000000022555 __libc_start_main() ???:0

13 0x0000000000400859 _start() ???:0

We currently have no way to mitigate the situation except rebooting the affected nodes. The problem seems to appear randomly on a subset of the nodes.

Is this a known firmware/ofed issue and what triggers it?

Any help will be appreciated. Please let us know if you need more info.

Best regards,

Amiya.

perazzo · June 24, 2021, 2:32am

We fixed this error on our system by adding the file /etc/security/limits.d/rdma.conf containing:

soft memlock unlimited

hard memlock unlimited

Topic		Replies	Views
I'm trying to diagnose a problem with my mpi setup. Whenever I run an openmpi task, or try to use the ucx_perftest , I get ibv_quer_device(mlx5_0) . I tried to find line 1773 in ib_md.c , but that file only has 1600 or so lines. What else can I do? Software And Drivers	1	802	March 23, 2020
Building and Running Applications with HPC-X Software And Drivers hpc , hpc-x , grep	2	2109	January 25, 2022
Building openMPI with UCX - General Advice Software And Drivers	4	4699	January 25, 2022
Pinned memory limit CUDA Programming and Performance	16	13574	May 1, 2016
Mlnx-ofed 5.4: ibv_create_qp cannot malloc memory more than 4026 clients on one sigle node Container: HPC linux	1	1208	December 2, 2022
OpenMPI not finding the device Software And Drivers	4	1644	June 18, 2020
MLNX_OFED_LINUX-5.7-1: mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed Mellanox OFED	1	1294	November 10, 2022
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	1030	April 26, 2023
AMD 7302, MCX456A-FCAT, CentOS 7.8 - IB won't link, ibutils hang, reads of sysfs timeout Mellanox OFED	2	497	February 21, 2024
Call to collective mpi subroutine with openacc host_data directive Legacy PGI Compilers	8	1007	March 26, 2021

UCX error with driver 5.1-2.5.8 on RHEL 7.9

Transport: rc_verbs

Device: mlx5_0:1

< failed to open interface >

Transport: rc_verbs

Device: mlx5_0:1

capabilities:

bandwidth: 94353.86/ppn + 0.00 MB/sec

latency: 600 + 1.000 * N nsec

overhead: 75 nsec

put_short: <= 124

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 3 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 4K

get_bcopy: <= 8256

get_zcopy: 65…1G, up to 3 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 4K

am_short: <= 123

am_bcopy: <= 8255

am_zcopy: <= 8255, up to 2 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 127

domain: device

atomic_add: 64 bit

atomic_fadd: 64 bit

atomic_cswap: 64 bit

connection: to ep

device priority: 50

device num paths: 1

max eps: 256

device address: 3 bytes

ep address: 5 bytes

error handling: peer failure

Related topics