##Problem: Segmentation fault: invalid permissions for mapped object running mpi with CUDA
##Configurations
OS:
Centos 7.5 (3.10.0-862.el7.x86_64)
Connetivity:
Back to Back
Softwares:
cuda-repo-rhel7-9-2-local-9.2.88-1.x86_64
nccl_2.2.13-1+cuda9.2_x86_64.tar
MLNX_OFED_LINUX-4.3-3.0.2.1-rhel7.5-x86_64.tgz
nvidia-peer-memory_1.0-7.tar.gz
openmpi-3.1.1.tar.bz2
osu-micro-benchmarks-5.4.2.tar.gz
[root@LOCALNODE ~]# lsmod | grep nv_peer_mem
nv_peer_mem 13163 0
ib_core 283851 11 rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
nvidia 14019833 9 nv_peer_mem,nvidia_modeset,nvidia_uvm
[root@LOCALNODE ~]#
Steps Followed
Followed document : http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf
Openmpi command: mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0:1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D
## Two issues/problem seen where we need help from MNLX
-
While running osu micro benchmarks between Device to Device (i.e D D ) getting segmentation fault.
-
Though normal RDMA traffic (ib_send_*) is running fine between both the Nodes and on Both the Ports, But while running osu micro benchmarks, traffic is only going through Port 1 (MLX5_1)
Note: NVidia GPU and Mellanox Adapter are in different NUMA Nodes.
[root@LOCALNODE ~]# cat /sys/module/mlx5_core/drivers/pci:mlx5_core/0000:*/numa_node
1
1
[root@LOCALNODE ~]# cat /sys/module/nvidia/drivers/pci:nvidia/0000:*/numa_node
0
[root@LOCALNODE ~]# lspci -tv | grep -i nvidia
| ±02.0-[19]----00.0 NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
[root@LOCALNODE ~]# lspci -tv | grep -i mellanox
-±[0000:d7]-±02.0-[d8]–±00.0 Mellanox Technologies MT27800 Family [ConnectX-5]
| | -00.1 Mellanox Technologies MT27800 Family [ConnectX-5]
## Issue Details:
******************************
Issue 1:
[root@LOCALNODE nccl-tests]# mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_0 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: LOCALNODE
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
OSU MPI-CUDA Latency Test v5.4.1
Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
Size Latency (us)
0 1.20
[LOCALNODE:5297 :0:5297] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd69ea00000)
==== backtrace ====
0 0x0000000000045e92 ucs_debug_cleanup() ???:0
1 0x000000000000f6d0 _L_unlock_13() funlockfile.c:0
2 0x0000000000156e50 __memcpy_ssse3_back() :0
3 0x00000000000318e1 uct_rc_mlx5_ep_am_short() ???:0
4 0x0000000000027a5a ucp_tag_send_nbr() ???:0
5 0x0000000000004c71 mca_pml_ucx_send() ???:0
6 0x0000000000080202 MPI_Send() ???:0
7 0x0000000000401d42 main() /home/NVIDIA/osu-micro-benchmarks-5.4.2/mpi/pt2pt/osu_latency.c:116
8 0x0000000000022445 __libc_start_main() ???:0
9 0x000000000040205b _start() ???:0
===================
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node LOCALNODE exited on signal 11 (Segmentation fault).
[LOCALNODE:05291] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[LOCALNODE:05291] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[root@LOCALNODE nccl-tests]#
Issue 2:
[root@LOCALNODE ~]# cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_*
0
0
0
0
0
0
0
0
0
0
0
[root@LOCALNODE ~]# cat /sys/class/infiniband/mlx5_1/ports/1/counters/port_*
0
18919889
0
1011812
0
0
0
9549739941
0
35318041
0
[root@LOCALNODE ~]#
Thanks & Regards
Ratan B