MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA

##Problem: Segmentation fault: invalid permissions for mapped object running mpi with CUDA

##Configurations

OS:


Centos 7.5 (3.10.0-862.el7.x86_64)

Connetivity:


Back to Back

Softwares:


cuda-repo-rhel7-9-2-local-9.2.88-1.x86_64

nccl_2.2.13-1+cuda9.2_x86_64.tar

MLNX_OFED_LINUX-4.3-3.0.2.1-rhel7.5-x86_64.tgz

nvidia-peer-memory_1.0-7.tar.gz

openmpi-3.1.1.tar.bz2

osu-micro-benchmarks-5.4.2.tar.gz

[root@LOCALNODE ~]# lsmod | grep nv_peer_mem

nv_peer_mem 13163 0

ib_core 283851 11 rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib

nvidia 14019833 9 nv_peer_mem,nvidia_modeset,nvidia_uvm

[root@LOCALNODE ~]#

Steps Followed

Followed document : http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf

Openmpi command: mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0:1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

## Two issues/problem seen where we need help from MNLX

  1. While running osu micro benchmarks between Device to Device (i.e D D ) getting segmentation fault.

  2. Though normal RDMA traffic (ib_send_*) is running fine between both the Nodes and on Both the Ports, But while running osu micro benchmarks, traffic is only going through Port 1 (MLX5_1)

Note: NVidia GPU and Mellanox Adapter are in different NUMA Nodes.

[root@LOCALNODE ~]# cat /sys/module/mlx5_core/drivers/pci:mlx5_core/0000:*/numa_node

1

1

[root@LOCALNODE ~]# cat /sys/module/nvidia/drivers/pci:nvidia/0000:*/numa_node

0

[root@LOCALNODE ~]# lspci -tv | grep -i nvidia

| ±02.0-[19]----00.0 NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]

[root@LOCALNODE ~]# lspci -tv | grep -i mellanox

-±[0000:d7]-±02.0-[d8]–±00.0 Mellanox Technologies MT27800 Family [ConnectX-5]

| | -00.1 Mellanox Technologies MT27800 Family [ConnectX-5]

## Issue Details:

******************************

Issue 1:

[root@LOCALNODE nccl-tests]# mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D


No OpenFabrics connection schemes reported that they were able to be

used on a specific port. As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

Local host: LOCALNODE

Local device: mlx5_0

Local port: 1

CPCs attempted: rdmacm, udcm


OSU MPI-CUDA Latency Test v5.4.1

Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

Size Latency (us)

0 1.20

[LOCALNODE:5297 :0:5297] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd69ea00000)

==== backtrace ====

0 0x0000000000045e92 ucs_debug_cleanup() ???:0

1 0x000000000000f6d0 _L_unlock_13() funlockfile.c:0

2 0x0000000000156e50 __memcpy_ssse3_back() :0

3 0x00000000000318e1 uct_rc_mlx5_ep_am_short() ???:0

4 0x0000000000027a5a ucp_tag_send_nbr() ???:0

5 0x0000000000004c71 mca_pml_ucx_send() ???:0

6 0x0000000000080202 MPI_Send() ???:0

7 0x0000000000401d42 main() /home/NVIDIA/osu-micro-benchmarks-5.4.2/mpi/pt2pt/osu_latency.c:116

8 0x0000000000022445 __libc_start_main() ???:0

9 0x000000000040205b _start() ???:0

===================


Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.



mpirun noticed that process rank 0 with PID 0 on node LOCALNODE exited on signal 11 (Segmentation fault).


[LOCALNODE:05291] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

[LOCALNODE:05291] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

[root@LOCALNODE nccl-tests]#

Issue 2:

[root@LOCALNODE ~]# cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_*

0

0

0

0

0

0

0

0

0

0

0

[root@LOCALNODE ~]# cat /sys/class/infiniband/mlx5_1/ports/1/counters/port_*

0

18919889

0

1011812

0

0

0

9549739941

0

35318041

0

[root@LOCALNODE ~]#

Thanks & Regards

Ratan B

Thanks a lot for the reply. It solved the above issue but after running mpirun, i do not see any latency difference with and without GDR

My Questions :

  1. Why I do not see any latency difference with and without GDR. ?
  2. Does below sequence or steps correct ? Does it matter for my Question 1

Note: I am having single GPU on both host and peer. Iommu is disabled.

nvidia-smi topo -m

GPU0 mlx5_0 mlx5_1 CPU Affinity

GPU0 X PHB PHB 18-35

mlx5_0 PHB X PIX

mlx5_1 PHB PIX X

Steps followed are:

  1. Install CUDA 9.2 and add the library and bin path in .bashrc

  2. Install latest MLX OFED

  3. Compile and Install nv_peer_mem driver

  4. Get UCX from git. Configure UCX with cuda and Install UCX

  5. Configure Openmpi-3.1.1 and install it.

./configure --prefix=/usr/local --with-wrapper-ldflags=-Wl,-rpath,/lib --enable-orterun-prefix-by-default --disable-io-romio --enable-picky --with-cuda=/usr/local/cuda-9.2

  1. Configure OSU Benchmarks-5.4.2 with cuda and install it

./configure prefix=/root/osu_benchmarks CC=mpicc --enable-cuda --with-cuda=/usr/local/cuda-9.2

Run mpirun. I do not see any latency difference with and without GDR.

Thanks for your Help.

I have encountered this question, too.

It was because of the ucx do not compile with cuda.(The mlnx install the default ucx).

When I recompile the ucx with cuda and reinstall it ,It works.

I’m not sure have you resolved seg 11 problem by my way.

As far as I see,I compile the openmpi with my ucx:

./configure --prefix=/usr/local/openmpi-3.1.1 --with-wrapper-ldflags=-Wl,-rpath,/lib --disable-vt --enable-orterun-prefix-by-default -disable-io-romio --enable-picky --with-cuda=/usr/local/cuda --with-ucx=/opt/ucx-cuda --enable-mem-debug --enable-debug --enable-timing

Actually, It will be less latency on GDR. What kind of net card have you been using?CX4 or CX 3?

Wish you share some test data and test environment configuration,it will be great.

Yes using your way segmentation fault got resolved.

I am using “Mellanox ConnectX-5” adapter.

OS - CentOS74

Is the below topology looks good to you

nvidia-smi topo -m

GPU0 mlx5_0 mlx5_1 CPU Affinity

GPU0 X PHB PHB 18-35

mlx5_0 PHB X PIX

mlx5_1 PHB PIX X

Running the below command to check the latency

mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

PHB:Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

Based on the topo you give, the mlx5_1 and mlx5_0 is connected to gpu0 by a PCIe Host Bridge.

It meas that , even gdr, the flow from GPU0 to localnode Host,then nic(mlx5_1) on local node.

On remote node,the flow from nic(mlx5_1) to host,then GPU0.

At the non-gdr,it just replaces the GPU with mem(DDR).Still, the flow through the host. Maybe that’s why it seems the same result.

How much is your test latency ?

Hi Jainkun yang,

Sorry for very late reply.

I am getting 7 micro seconds latency for the starting Bytes.

When i run osu_bw test, i am seeing that System memory is also getting used along with GPU Memory. These seems strange right. With GPUDirect RDMA, we should not see any system memory usage right? Am i missing something?

lspcu -tv output is for both the systems

±[0000:80]-±00.0-[81]–

| ±01.0-[82]–

| ±01.1-[83]–

| ±02.0-[84]–

| ±02.2-[85]----00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

| ±03.0-[86]----00.0 NVIDIA Corporation Device 15f8

On Host Systems:

80:02.2 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 2 (rev 02) (prog-if 00 [Normal decode])

80:03.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 3 (rev 02) (prog-if 00 [Normal decode])

On Peer System:

80:02.2 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 2 (rev 01) (prog-if 00 [Normal decode])

80:03.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 (rev 01) (prog-if 00 [Normal decode])

Host CPU:

lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 72

On-line CPU(s) list: 0-71

Thread(s) per core: 2

Core(s) per socket: 18

Socket(s): 2

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 63

Model name: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

Stepping: 2

CPU MHz: 1202.199

CPU max MHz: 3600.0000

CPU min MHz: 1200.0000

BogoMIPS: 4590.86

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 46080K

NUMA node0 CPU(s): 0-71

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

Peer CPU:

lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 32

On-line CPU(s) list: 0-31

Thread(s) per core: 2

Core(s) per socket: 8

Socket(s): 2

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 79

Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Stepping: 1

CPU MHz: 1201.019

CPU max MHz: 3000.0000

CPU min MHz: 1200.0000

BogoMIPS: 4191.23

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 20480K

NUMA node0 CPU(s): 0-31

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

We are seeing the sample problem with Mellanox on lentos CentOS Linux release 8.1.1911 (Core)

with mdtest run:

/mdtest


No OpenFabrics connection schemes reported that they were able to be

used on a specific port. As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

Local host: client2

Local device: mlx4_0

Local port: 1

CPCs attempted: rdmacm, udcm


[client2:2394 :0:2394] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc54b7d6768)

==== backtrace ====

0 /lib64/libucs.so.0(+0x18bb0) [0x7fc54b169bb0]

1 /lib64/libucs.so.0(+0x18d8a) [0x7fc54b169d8a]

2 /lib64/libuct.so.0(+0x1655b) [0x7fc5506f955b]

3 /lib64/ld-linux-x86-64.so.2(+0xfd0a) [0x7fc55e453d0a]

4 /lib64/ld-linux-x86-64.so.2(+0xfe0a) [0x7fc55e453e0a]

5 /lib64/ld-linux-x86-64.so.2(+0x13def) [0x7fc55e457def]

6 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7fc55d8ecab7]

7 /lib64/ld-linux-x86-64.so.2(+0x1365e) [0x7fc55e45765e]

8 /lib64/libdl.so.2(+0x11ba) [0x7fc55d0461ba]

9 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7fc55d8ecab7]

10 /lib64/libc.so.6(_dl_catch_error+0x33) [0x7fc55d8ecb53]

11 /lib64/libdl.so.2(+0x1939) [0x7fc55d046939]

12 /lib64/libdl.so.2(dlopen+0x4a) [0x7fc55d04625a]

13 /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6df05) [0x7fc55d2b6f05]

14 /usr/lib64/openmpi/lib/libopen-pal.s

15 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35a) [0x7fc55d293a5a]

16 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7fc55d29f3ce]

17 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x252) [0x7fc55d29f8b2]

18 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x15) [0x7fc55d29f915]

19 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x674) [0x7fc55dde8494]

20 /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72) [0x7fc55de186b2]

21 ./mdtest() [0x407f24]

22 /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fc55d7d7873]

23 ./mdtest() [0x401a8e]

===================

Segmentation fault (core dumped)

is there a workaround ?