NVHPC Code with Multiple GPUs inside Singularity Container gives UCX Error

pushkarp · May 6, 2024, 5:20pm

Hi Developers,

I m trying to run my CUDA_MPI code developed within a singularity container using --nv tagbut the issue I am facing is that the code successfully runs with NPROCS=1 but when I try to run it with NPROCS=2 it gives the following error :

[node3:2233858:0:2233858] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7e9fef81b0)
[node3:2233857:0:2233857] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd463ef81b0)
==== backtrace (tid:2233858) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018b8f5 __nss_database_lookup()  ???:0
 2 0x000000000004bf19 ucp_dt_pack()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/dt/dt.c:118
 3 0x000000000007e48c ucp_tag_pack_eager_common()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/eager_snd.c:31
 4 0x000000000001a793 uct_mm_ep_am_common_send()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/sm/mm/base/mm_ep.c:326
 5 0x000000000001a793 uct_mm_ep_am_bcopy()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/sm/mm/base/mm_ep.c:416
 6 0x00000000000800ef uct_ep_am_bcopy()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:3020
 7 0x00000000000800ef ucp_tag_eager_bcopy_single()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/eager_snd.c:132
 8 0x0000000000087f68 ucp_request_try_send()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:334
 9 0x0000000000087f68 ucp_request_send()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:357
10 0x0000000000087f68 ucp_tag_send_req()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/tag_send.c:116
11 0x0000000000087f68 ucp_tag_send_nbx()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/tag_send.c:298
12 0x00000000000047b6 mca_pml_ucx_send_nbr()  /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:904
13 0x00000000000047b6 mca_pml_ucx_send()  /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:944
14 0x0000000000072ac5 PMPI_Sendrecv()  /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/c/profile/psendrecv.c:91
15 0x0000000000405465 main()  /source/KKS_FD_CUDA_MPI/./microsim_kks_fd_cuda_mpi.c:443
16 0x0000000000024083 __libc_start_main()  ???:0
17 0x000000000040366e _start()  ???:0
=================================
[node3:2233858] *** Process received signal ***
[node3:2233858] Signal: Segmentation fault (11)
[node3:2233858] Signal code:  (-6)
[node3:2233858] Failing at address: 0x40200221602
[node3:2233858] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f8045f42420]
[node3:2233858] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18b8f5)[0x7f804584a8f5]
[node3:2233858] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(ucp_dt_pack+0x99)[0x7f80400fff19]
[node3:2233858] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(+0x7e48c)[0x7f804013248c]
[node3:2233858] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x133)[0x7f8040095793]
[node3:2233858] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(+0x800ef)[0x7f80401340ef]
[node3:2233858] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(ucp_tag_send_nbx+0x7d8)[0x7f804013bf68]
[node3:2233858] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/hpcx-2.13/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf6)[0x7f80179c87b6]
[node3:2233858] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/hpcx-2.13/ompi/lib/libmpi.so.40(MPI_Sendrecv+0x95)[0x7f80470b4ac5]
[node3:2233858] [ 9] ./microsim_kks_fd_cuda_mpi[0x405465]
[node3:2233858] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f80456e3083]
[node3:2233858] [11] ./microsim_kks_fd_cuda_mpi[0x40366e]
[node3:2233858] *** End of error message ***
==== backtrace (tid:2233857) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018b8f5 __nss_database_lookup()  ???:0
 2 0x000000000004bf19 ucp_dt_pack()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/dt/dt.c:118
 3 0x000000000007e48c ucp_tag_pack_eager_common()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/eager_snd.c:31
 4 0x000000000001a793 uct_mm_ep_am_common_send()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/sm/mm/base/mm_ep.c:326
 5 0x000000000001a793 uct_mm_ep_am_bcopy()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/sm/mm/base/mm_ep.c:416
 6 0x00000000000800ef uct_ep_am_bcopy()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:3020
 7 0x00000000000800ef ucp_tag_eager_bcopy_single()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/eager_snd.c:132
 8 0x0000000000087f68 ucp_request_try_send()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:334
 9 0x0000000000087f68 ucp_request_send()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:357
10 0x0000000000087f68 ucp_tag_send_req()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/tag_send.c:116
11 0x0000000000087f68 ucp_tag_send_nbx()  /build-result/src/hpcx-v2.13-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/tag/tag_send.c:298
12 0x00000000000047b6 mca_pml_ucx_send_nbr()  /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:904
13 0x00000000000047b6 mca_pml_ucx_send()  /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:944
14 0x0000000000072ac5 PMPI_Sendrecv()  /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/c/profile/psendrecv.c:91
15 0x0000000000405465 main()  /source/KKS_FD_CUDA_MPI/./microsim_kks_fd_cuda_mpi.c:443
16 0x0000000000024083 __libc_start_main()  ???:0
17 0x000000000040366e _start()  ???:0
=================================
[node3:2233857] *** Process received signal ***
[node3:2233857] Signal: Segmentation fault (11)
[node3:2233857] Signal code:  (-6)
[node3:2233857] Failing at address: 0x40200221601
[node3:2233857] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fd609836420]
[node3:2233857] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18b8f5)[0x7fd60913e8f5]
[node3:2233857] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(ucp_dt_pack+0x99)[0x7fd5f0097f19]
[node3:2233857] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(+0x7e48c)[0x7fd5f00ca48c]
[node3:2233857] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x133)[0x7fd604059793]
[node3:2233857] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(+0x800ef)[0x7fd5f00cc0ef]
[node3:2233857] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/latest/ucx/mt/lib/libucp.so.0(ucp_tag_send_nbx+0x7d8)[0x7fd5f00d3f68]
[node3:2233857] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/hpcx-2.13/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf6)[0x7fd5d75ba7b6]
[node3:2233857] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/comm_libs/hpcx/hpcx-2.13/ompi/lib/libmpi.so.40(MPI_Sendrecv+0x95)[0x7fd60a9a8ac5]
[node3:2233857] [ 9] ./microsim_kks_fd_cuda_mpi[0x405465]
[node3:2233857] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fd608fd7083]
[node3:2233857] [11] ./microsim_kks_fd_cuda_mpi[0x40366e]
[node3:2233857] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node3 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [Makefile:171: run] Error 139

I have 4 A100 GPUs which I intend to use instead of simply 1 so it will be quite helpful if you could provide me with some headers regarding this issue.

Many Thanks
Pushkar

Topic		Replies	Views
The problem of installing and using the NVhpc SDK nvc, nvc++ and nvfortran	3	496	January 23, 2024
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	955	April 26, 2023
Using _mm_cvtpd_epi32 results in compiler error nvc, nvc++ and nvfortran	2	456	May 19, 2021
NVHPC + OSU Benchmark docker image won't build with ubuntu:22.04 base image CUDA NVCC Compiler cuda , ubuntu , nvcc	0	679	August 25, 2023
nvcc Segfault CUDA Programming and Performance	6	11408	October 14, 2010
Openacc with cuda nvc, nvc++ and nvfortran cuda	4	394	April 22, 2023
Raise error when link nvshmem in my application Legacy PGI Compilers cuda , cudnn	13	1128	January 2, 2024
Error when running optimized code but runs fine with debug nvc, nvc++ and nvfortran	4	1303	August 30, 2022
CUDA version not available message with nvc++ on Ubuntu nvc, nvc++ and nvfortran	11	7541	April 30, 2021
Nvcc error : 'cicc' died with status 0xC0000005 - Only in DEBUG mode CUDA NVCC Compiler	7	2362	April 30, 2024

NVHPC Code with Multiple GPUs inside Singularity Container gives UCX Error

Related topics