We are seeing the sample problem with Mellanox on lentos CentOS Linux release 8.1.1911 (Core)
Manx card installed: [root@client2 ~]# lspci | grep Mell
03:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
[root@client2 ~]#
with mdtest run:
/mdtest
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: client2
Local device: mlx4_0
Local port: 1
CPCs attempted: rdmacm, udcm
[client2:2394 :0:2394] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc54b7d6768)
==== backtrace ====
0 /lib64/libucs.so.0(+0x18bb0) [0x7fc54b169bb0]
1 /lib64/libucs.so.0(+0x18d8a) [0x7fc54b169d8a]
2 /lib64/libuct.so.0(+0x1655b) [0x7fc5506f955b]
3 /lib64/ld-linux-x86-64.so.2(+0xfd0a) [0x7fc55e453d0a]
4 /lib64/ld-linux-x86-64.so.2(+0xfe0a) [0x7fc55e453e0a]
5 /lib64/ld-linux-x86-64.so.2(+0x13def) [0x7fc55e457def]
6 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7fc55d8ecab7]
7 /lib64/ld-linux-x86-64.so.2(+0x1365e) [0x7fc55e45765e]
8 /lib64/libdl.so.2(+0x11ba) [0x7fc55d0461ba]
9 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7fc55d8ecab7]
10 /lib64/libc.so.6(_dl_catch_error+0x33) [0x7fc55d8ecb53]
11 /lib64/libdl.so.2(+0x1939) [0x7fc55d046939]
12 /lib64/libdl.so.2(dlopen+0x4a) [0x7fc55d04625a]
13 /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6df05) [0x7fc55d2b6f05]
14 /usr/lib64/openmpi/lib/libopen-pal.s
15 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35a) [0x7fc55d293a5a]
16 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7fc55d29f3ce]
17 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x252) [0x7fc55d29f8b2]
18 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x15) [0x7fc55d29f915]
19 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x674) [0x7fc55dde8494]
20 /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72) [0x7fc55de186b2]
21 ./mdtest() [0x407f24]
22 /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fc55d7d7873]
23 ./mdtest() [0x401a8e]
===================
Segmentation fault (core dumped)