NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes

caplanr · February 1, 2024, 10:07pm

Hi,

I have been testing the new nvhpc 24.1 on our multigpu MPI+OpenACC test code, availble here:

I am using the main code (that uses manual data managment).

I am compiling using the compile line in the README (with the cc changed to the current GPU):

mpif90 psi_multigpu_test_code.f -acc=gpu -gpu=cc80,nomanaged,nounified -Minfo=accel -o psi_multigpu_test_code

I am loading the nvhpc 24.1 using the following (from the documentation):

version=24.1

NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/$version/compilers/bin:$PATH; export PATH

export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/man

When I do this, the test code works fine on the Delta system at NCSA on an A100, and on a local workstation with a GTX 2080Ti. It even works out-of-the-box on multiple nodes of Delta (nice!).

However, when I try it on three additional systems (with a RTX 3090Ti, RTX 3060Ti, and RTX 4070 respectively) the code seg faults when it hits the MPI Wait call after the MPI iSend and iRecv involving a derived type array directly with host_data:

 Starting seam cycles...
[rolly-linux:129326:0:129326] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f1b595c8580)

15 0x000000000004e299 ompi_waitall_f()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
16 0x00000000004043ca seam_vvec_()  
<PATH>/psi_multigpu_test_code/psi_multigpu_test_code.f:700

The CUDA driver version, CUDA runtime version, etc. is identical across all workstations.

The only difference I can see is the Linux kernel version.

On the two systems that worked, the Linux kernel was 4.18 and 5.15.

On the three systems that did not work, the kernel was 6.5.

I have also tried invoking the hpcx init script and the hpcx_load command and it still doesn’t work.

However, if I revert back to using the OpenMPI 3 that is (thankfully) still included in nvhpc by using:

export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/man

the code works!

Maybe there is some kind of incompatibility of the HPCX MPI with Linux kernel 6?

Thanks!

– Ron

mfatica · February 2, 2024, 12:21am

Could you try to set up HPCX in this way?

$ source /opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/12.3/hpcx/hpcx-2.17.1/hpcx-mt-init-ompi.sh
$ hpcx_load

caplanr · February 2, 2024, 12:24am

Hi,

Nope - still seg faults.

– Ron

MatColgrove · February 2, 2024, 8:18pm

I doubt this would be a kernel issue and when I test your code on an Ubuntu 22.04 system it works fine. Though I’m using A100s in this case so it could be a combination of RTX on newer kernels.

I asked Chris, who does our HPCX builds, and he asked that you check if the multithread UCX is getting pulled. To check, he said to run your binary in a debugger. Once it traps the segv, cat “/proc//maps” (where pid is the process ids for the failing rank), and see which UCX is getting pulled in. I’m not sure what to look for, so if you can post the output, I’ll ask Chris to review.

Also, I didn’t think the RTX 40xx cards supported peer-to-peer communication which could cause problems with GPU direct / CUDA Aware MPI with the newer HPCX. While we don’t ship this with the NVHPC SDK, the CUDA SDK includes a sample test called “simpleP2P” which you might try to see if P2P is enabled. (on my system it’s located in “/opt/cuda-12.1/samples/Samples/0_Introduction/simpleP2P/”). I’d test it myself, but all our RTX systems only have a single device.

If neither of those help diagnose the issue, I’ll need to submit a report to the HPCX team so they can investigate.

-Mat

caplanr · February 2, 2024, 9:35pm

Hi,

I am not familiar in running a debugger.
What commands would I do to run it that way?

I installed the CUDA toolkit but those samples don’t seem to be there.

I am only doing these runs on a single RTX GPU. However, I DO have an MPi call that sends and receives to the same rank (rank 0) so maybe there is an issue there?

I am running on Linux Mint as follows:

Linux rlap4 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  
NAME="Linux Mint"
VERSION="21.3 (Virginia)"
ID=linuxmint
ID_LIKE="ubuntu debian"
PRETTY_NAME="Linux Mint 21.3"
VERSION_ID="21.3"
HOME_URL="https://www.linuxmint.com/"
SUPPORT_URL="https://forums.linuxmint.com/"
BUG_REPORT_URL="http://linuxmint-troubleshooting-guide.readthedocs.io/en/latest/"
PRIVACY_POLICY_URL="https://www.linuxmint.com/"
VERSION_CODENAME=virginia
UBUNTU_CODENAME=jammy

– Ron

caplanr · February 2, 2024, 10:05pm

I found the samples github and ran P2P:

RLAP4-NV2401: ~/Desktop/cuda_samples/cuda-samples/Samples/0_Introduction/simpleP2P $ ./simpleP2P 
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 1
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Waiving test.

– Ron

MatColgrove · February 2, 2024, 11:59pm

Ok, I assumed you were doing multi-GPU so the output is expected for a single device, though P2P shouldn’t matter. I did run on a single RTX 4080 here (1, 2, and 4 ranks) but didn’t see any issue.

To debug, you run “mpirun -np 1 gdb ./psi_multigpu_test_code”, then type “run” from the command prompt. We’re just using the debugger to pause execution so you can then run “top” to get the PID for the process and cat the correct map file.

However, I DO have an MPi call that sends and receives to the same rank (rank 0) so maybe there is an issue there?

Are you using multiple ranks, i.e. mpirun -np 2, with the launch_psi_multigpu_test_code_stdpar.sh script?

The script sets the CUDA Visible devices env var to match the rank number, so the second rank device mapping would be wrong. I don’t the system, but if it has a second device, like one use for a display, the rank could be mapping there? If there isn’t a second device, I’d expect a different error, but possible.

caplanr · February 5, 2024, 6:24pm

Hi,

I am not using that launch script as it is only needed for the “stdpar” version.
I am running the original version (psi_multigpu_test_code.f) with 1 GPU.
I get the seg fault on both a system where a single GPU is used for graphics and compute, as well as on a system where CUDA_VISIBLE_DEVICES is set to use 1 GPU for compute and there is another GPU for graphics.

I ran the debugger as you showed and get:

(gdb) run
Starting program: /home/sumseq/Dropbox/PSI/TOOLS_DEV/psi_multigpu_test_code/psi_multigpu_test_code 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff0bff640 (LWP 6936)]
[New Thread 0x7fffeabff640 (LWP 6937)]
[New Thread 0x7fffe17ff640 (LWP 6938)]
[New Thread 0x7fffdbfff640 (LWP 6939)]
[New Thread 0x7fffe0ffe640 (LWP 6940)]
[New Thread 0x7fffca1ff640 (LWP 6945)]
[New Thread 0x7fffc99fe640 (LWP 6946)]
 Grid size per dimension per rank:           250
 Grid size per rank:      15625000
  
 Number of ranks in DIM1:             1
 Number of ranks in DIM2:             1
 Number of ranks in DIM3:             1
 Total number of ranks:             1
  
World rank   0 has cart rank   0 and shared rank   0 (i.e. GPU device number   1 out of   1)
 Starting seam cycles...

Thread 1 "psi_multigpu_te" received signal SIGSEGV, Segmentation fault.
317	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317

However, I do not know what you can then run “top” to get the PID for the process and cat the correct map file. means.
What is the map file? I found the PID in nvidia-smi, what do I do with it?

– Ron

caplanr · February 5, 2024, 6:28pm

Hi,

Sorry - I just read the posts above and found the map stuff.

I found the map file:

maps.txt (94.1 KB)

– Ron

MatColgrove · February 5, 2024, 6:39pm

Thanks Ron. “map” shows that the multi-threaded UCX libraries are getting pulled in, so that’s not the issue.

Can you rerun the code in the debugger and then from the prompt type “where”? Then post the output from the back trace so I can see where the segv is coming from.

caplanr · February 5, 2024, 6:54pm

Hi,

It’s coming from the “Waitall” after the non-blocking iSend and iRecv:

(gdb) where
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317
#1 0x00007ffff7af9a76 in ucs_memcpy_relaxed (len=498000, src=0x7fff815c8580, dst=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucs/arch/x86_64/cpu.h:112
#2 ucp_memcpy_pack_unpack (name=<synthetic pointer>, length=498000, data=0x7fff815c8580, buffer=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/dt.h:74
#3 ucp_dt_contig_unpack (mem_type=<optimized out>, length=498000, src=0x7fff815c8580, dest=<optimized out>,
worker=0x5aaa70)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/dt_contig.h:55
#4 ucp_datatype_iter_unpack (src=0x7fff815c8580, offset=<optimized out>, length=498000, worker=0x5aaa70,
dt_iter=0xed2018)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/datatype_iter.inl:445
#5 ucp_proto_rndv_progress_rkey_ptr (arg=0x5aaa70) at rndv/rndv_rkey_ptr.c:138
#6 0x00007ffff76caec1 in ucs_callbackq_spill_elems_dispatch (cbq=0x5ca9c0) at datastruct/callbackq.c:383
#7 ucs_callbackq_proxy_callback (arg=0x5ca9c0) at datastruct/callbackq.c:479
#8 0x00007ffff7ad2eba in ucs_callbackq_dispatch (cbq=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucs/datastruct/callbackq.h:215
#9 uct_worker_progress (worker=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/uct/api/uct.h:2787
#10 ucp_worker_progress (worker=0x5aaa70) at core/ucp_worker.c:2991
#11 0x00007ffff3036b9c in opal_progress () at ../../opal/runtime/opal_progress.c:231
#12 0x00007ffff303d5e5 in ompi_sync_wait_mt (sync=sync@entry=0x7fffffff7f10) at ../../opal/threads/wait_sync.c:85
#13 0x00007ffff704ea98 in ompi_request_default_wait_all (count=12, requests=0x688e80, statuses=0x688ee0)
at ../../ompi/request/req_wait.c:234
#14 0x00007ffff707620c in PMPI_Waitall (count=12, requests=requests@entry=0x688e80, statuses=statuses@entry=0x688ee0)
at pwaitall.c:80
#15 0x00007ffff744e299 in ompi_waitall_f (count=0x411628 <.C360_seam_vvec_>, array_of_requests=0x43dbc0 <.BSS9>,
array_of_statuses=0x43dc40 <mpi_fortran_statuses_ignore_>, ierr=0x7fffffff81fc) at pwaitall_f.c:104
#16 0x00000000004043ca in seam_vvec () at psi_multigpu_test_code.f:700
#17 0x000000000040f5b5 in psi_multigpu_test_code () at psi_multigpu_test_code.f:1062

– Ron

MatColgrove · February 5, 2024, 7:41pm

I’ll pass this to Chris, but we’ll likely need to pass this on to the HPCX folks. Hopefully the traceback gives them some clues on how to reproduce the error.

caplanr · March 14, 2024, 12:18am

Hi,

Just FYI - this issue is still there with the new NV 24.3.

– Ron

caplanr · May 23, 2024, 3:07pm

Hi,

Here are some updates on this issue with the new 24.5.

I am activating the compiler environment with:

#!/bin/bash
version=24.5
NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/$version/compilers/bin:$PATH; export PATH   

export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/man
. /opt/nvidia/hpc_sdk/Linux_x86_64/${version}/comm_libs/12.4/hpcx/latest/hpcx-mt-init-ompi.sh
hpcx_load

To test this issue, I am using the test code:
https://github.com/predsci/multigpu-test-code

I compile the test with:
mpif90 psi_multigpu_test_code.f -march=native -O3 -acc=gpu -gpu=ccnative,nomanaged,nounified -Minfo=accel -o psi_multigpu_test_code

I am using Linux Mint 21.3 with kernel 6.5.0-35-generic.

I have two systems I am testing.

My laptop with a RTX 4070 (CUDA driver 550)
My desktop with a RTX 3090Ti (CUDA driver 545)

On (1), the test code runs correctly, but on (2) it seg faults with:

PREDSCI-GPU2-NV2405: ~/Dropbox/PSI/TOOLS_DEV/psi_multigpu_test_code $ mpiexec -np 1 ./psi_multigpu_test_code
 Grid size per dimension per rank:           250
 Grid size per rank:      15625000
  
 Number of ranks in DIM1:             1
 Number of ranks in DIM2:             1
 Number of ranks in DIM3:             1
 Total number of ranks:             1
  
World rank   0 has cart rank   0 and shared rank   0 (i.e. GPU device number   1 out of   1)
 Starting seam cycles...
[PREDSCI-GPU2:6793 :0:6793] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x731ccd5c8580)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
==== backtrace (tid:   6793) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000001a07cd __nss_database_lookup()  ???:0
 2 0x0000000000079979 ucs_memcpy_relaxed()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucs/arch/x86_64/cpu.h:112
 3 0x0000000000079979 ucp_memcpy_pack_unpack()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucp/dt/dt.h:74
 4 0x0000000000079979 ucp_dt_contig_unpack()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucp/dt/dt_contig.h:55
 5 0x0000000000079979 ucp_datatype_iter_unpack()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucp/dt/datatype_iter.inl:445
 6 0x0000000000079979 ucp_proto_rndv_progress_rkey_ptr()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucp/rndv/rndv_rkey_ptr.c:137
 7 0x0000000000054791 ucs_callbackq_spill_elems_dispatch()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucs/datastruct/callbackq.c:380
 8 0x000000000005118a ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucs/datastruct/callbackq.h:215
 9 0x000000000005118a uct_worker_progress()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/uct/api/uct.h:2787
10 0x000000000005118a ucp_worker_progress()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-master/src/ucp/core/ucp_worker.c:2996
11 0x0000000000036b9c opal_progress()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/opal/../../opal/runtime/opal_progress.c:231
12 0x000000000003d5e5 ompi_sync_wait_mt()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/opal/../../opal/threads/wait_sync.c:85
13 0x000000000004ea98 ompi_request_default_wait_all()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/../../ompi/request/req_wait.c:234
14 0x000000000007620c PMPI_Waitall()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pwaitall.c:80
15 0x000000000004e299 ompi_waitall_f()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
16 0x000000000040439e seam_vvec_()  /home/sumseq/Dropbox/PSI/TOOLS_DEV/psi_multigpu_test_code/psi_multigpu_test_code.f:700
17 0x000000000040f5f5 MAIN_()  /home/sumseq/Dropbox/PSI/TOOLS_DEV/psi_multigpu_test_code/psi_multigpu_test_code.f:1062
18 0x0000000000402531 main()  ???:0
19 0x0000000000029d90 __libc_init_first()  ???:0
20 0x0000000000029e40 __libc_start_main()  ???:0
21 0x0000000000402425 _start()  ???:0
=================================
[PREDSCI-GPU2:06793] *** Process received signal ***
[PREDSCI-GPU2:06793] Signal: Segmentation fault (11)
[PREDSCI-GPU2:06793] Signal code:  (-6)
[PREDSCI-GPU2:06793] Failing at address: 0x3e800001a89
[PREDSCI-GPU2:06793] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x731d4a842520]
[PREDSCI-GPU2:06793] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a07cd)[0x731d4a9a07cd]
[PREDSCI-GPU2:06793] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ucx/mt/lib/libucp.so.0(+0x79979)[0x731d4e71c979]
[PREDSCI-GPU2:06793] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ucx/mt/lib/libucs.so.0(+0x54791)[0x731d4dec8791]
[PREDSCI-GPU2:06793] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ucx/mt/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x731d4e6f418a]
[PREDSCI-GPU2:06793] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x731d4a036b9c]
[PREDSCI-GPU2:06793] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x731d4a03d5e5]
[PREDSCI-GPU2:06793] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x388)[0x731d4e04ea98]
[PREDSCI-GPU2:06793] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40(PMPI_Waitall+0x1c)[0x731d4e07620c]
[PREDSCI-GPU2:06793] [ 9] /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/comm_libs/12.4/hpcx/hpcx-2.19/ompi/lib/libmpi_mpifh.so.40(pmpi_waitall+0x79)[0x731d4e44e299]
[PREDSCI-GPU2:06793] [10] ./psi_multigpu_test_code[0x40439e]
[PREDSCI-GPU2:06793] [11] ./psi_multigpu_test_code[0x40f5f5]
[PREDSCI-GPU2:06793] [12] ./psi_multigpu_test_code[0x402531]
[PREDSCI-GPU2:06793] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x731d4a829d90]
[PREDSCI-GPU2:06793] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x731d4a829e40]
[PREDSCI-GPU2:06793] [15] ./psi_multigpu_test_code[0x402425]
[PREDSCI-GPU2:06793] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node PREDSCI-GPU2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I updated the cuda driver to 550 on (2) and it still seg faults.

UPDATE!!

So now, I installed the cuda toolkit on (2) which includes the driver 555 version.

It now works and does not seg fault.

So anyone having this issue - it seems an update to the 555 driver fixes it! (or maybe it just needs the full cuda toolkit installation in addition to nvhpc?)

– Ron

scamp1 · May 23, 2024, 5:34pm

Glad that your issue has been resolved!

system · June 6, 2024, 5:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Request support/help for PBS with OpenMPI Legacy PGI Compilers	21	14898	August 9, 2022
Windows 11 + WSL + CUDA-aware MPI + GeForce 40 series = seg fault, but with GeForce 30 series = OK nvc, nvc++ and nvfortran	3	379	May 15, 2024
NVHPC + OSU Benchmark docker image won't build with ubuntu:22.04 base image CUDA NVCC Compiler cuda , ubuntu , nvcc	0	700	August 25, 2023
Nsys profile mpirun -np 1 ./MyOpenACC_App ./input.file has float point error Profiling Linux Targets cuda	16	898	November 28, 2023
MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C nvc, nvc++ and nvfortran	10	101	September 6, 2024
The problem of installing and using the NVhpc SDK nvc, nvc++ and nvfortran	3	578	January 23, 2024
Nvidia-smi gives "No devies were found" - incorrect gcc version Linux	25	1516	March 14, 2024
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	2339	March 11, 2024
Raise error when link nvshmem in my application Legacy PGI Compilers cuda , cudnn	13	1319	January 2, 2024
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430098	March 25, 2010

NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes

Related topics