NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes

Hi,

I have been testing the new nvhpc 24.1 on our multigpu MPI+OpenACC test code, availble here:

I am using the main code (that uses manual data managment).

I am compiling using the compile line in the README (with the cc changed to the current GPU):

mpif90 psi_multigpu_test_code.f -acc=gpu -gpu=cc80,nomanaged,nounified -Minfo=accel -o psi_multigpu_test_code

I am loading the nvhpc 24.1 using the following (from the documentation):

version=24.1

NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/$version/compilers/bin:$PATH; export PATH

export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/man

When I do this, the test code works fine on the Delta system at NCSA on an A100, and on a local workstation with a GTX 2080Ti. It even works out-of-the-box on multiple nodes of Delta (nice!).

However, when I try it on three additional systems (with a RTX 3090Ti, RTX 3060Ti, and RTX 4070 respectively) the code seg faults when it hits the MPI Wait call after the MPI iSend and iRecv involving a derived type array directly with host_data:

 Starting seam cycles...
[rolly-linux:129326:0:129326] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f1b595c8580)

15 0x000000000004e299 ompi_waitall_f()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
16 0x00000000004043ca seam_vvec_()  
<PATH>/psi_multigpu_test_code/psi_multigpu_test_code.f:700

The CUDA driver version, CUDA runtime version, etc. is identical across all workstations.

The only difference I can see is the Linux kernel version.

On the two systems that worked, the Linux kernel was 4.18 and 5.15.

On the three systems that did not work, the kernel was 6.5.

I have also tried invoking the hpcx init script and the hpcx_load command and it still doesn’t work.

However, if I revert back to using the OpenMPI 3 that is (thankfully) still included in nvhpc by using:

export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/man

the code works!

Maybe there is some kind of incompatibility of the HPCX MPI with Linux kernel 6?

Thanks!

– Ron

Could you try to set up HPCX in this way?

$ source /opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/12.3/hpcx/hpcx-2.17.1/hpcx-mt-init-ompi.sh
$ hpcx_load

Hi,

Nope - still seg faults.

– Ron

I doubt this would be a kernel issue and when I test your code on an Ubuntu 22.04 system it works fine. Though I’m using A100s in this case so it could be a combination of RTX on newer kernels.

I asked Chris, who does our HPCX builds, and he asked that you check if the multithread UCX is getting pulled. To check, he said to run your binary in a debugger. Once it traps the segv, cat “/proc//maps” (where pid is the process ids for the failing rank), and see which UCX is getting pulled in. I’m not sure what to look for, so if you can post the output, I’ll ask Chris to review.

Also, I didn’t think the RTX 40xx cards supported peer-to-peer communication which could cause problems with GPU direct / CUDA Aware MPI with the newer HPCX. While we don’t ship this with the NVHPC SDK, the CUDA SDK includes a sample test called “simpleP2P” which you might try to see if P2P is enabled. (on my system it’s located in “/opt/cuda-12.1/samples/Samples/0_Introduction/simpleP2P/”). I’d test it myself, but all our RTX systems only have a single device.

If neither of those help diagnose the issue, I’ll need to submit a report to the HPCX team so they can investigate.

-Mat

Hi,

I am not familiar in running a debugger.
What commands would I do to run it that way?

I installed the CUDA toolkit but those samples don’t seem to be there.

I am only doing these runs on a single RTX GPU. However, I DO have an MPi call that sends and receives to the same rank (rank 0) so maybe there is an issue there?

I am running on Linux Mint as follows:

Linux rlap4 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  
NAME="Linux Mint"
VERSION="21.3 (Virginia)"
ID=linuxmint
ID_LIKE="ubuntu debian"
PRETTY_NAME="Linux Mint 21.3"
VERSION_ID="21.3"
HOME_URL="https://www.linuxmint.com/"
SUPPORT_URL="https://forums.linuxmint.com/"
BUG_REPORT_URL="http://linuxmint-troubleshooting-guide.readthedocs.io/en/latest/"
PRIVACY_POLICY_URL="https://www.linuxmint.com/"
VERSION_CODENAME=virginia
UBUNTU_CODENAME=jammy

– Ron

I found the samples github and ran P2P:

RLAP4-NV2401: ~/Desktop/cuda_samples/cuda-samples/Samples/0_Introduction/simpleP2P $ ./simpleP2P 
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 1
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Waiving test.

– Ron

Ok, I assumed you were doing multi-GPU so the output is expected for a single device, though P2P shouldn’t matter. I did run on a single RTX 4080 here (1, 2, and 4 ranks) but didn’t see any issue.

To debug, you run “mpirun -np 1 gdb ./psi_multigpu_test_code”, then type “run” from the command prompt. We’re just using the debugger to pause execution so you can then run “top” to get the PID for the process and cat the correct map file.

However, I DO have an MPi call that sends and receives to the same rank (rank 0) so maybe there is an issue there?

Are you using multiple ranks, i.e. mpirun -np 2, with the launch_psi_multigpu_test_code_stdpar.sh script?

The script sets the CUDA Visible devices env var to match the rank number, so the second rank device mapping would be wrong. I don’t the system, but if it has a second device, like one use for a display, the rank could be mapping there? If there isn’t a second device, I’d expect a different error, but possible.

Hi,

I am not using that launch script as it is only needed for the “stdpar” version.
I am running the original version (psi_multigpu_test_code.f) with 1 GPU.
I get the seg fault on both a system where a single GPU is used for graphics and compute, as well as on a system where CUDA_VISIBLE_DEVICES is set to use 1 GPU for compute and there is another GPU for graphics.

I ran the debugger as you showed and get:

(gdb) run
Starting program: /home/sumseq/Dropbox/PSI/TOOLS_DEV/psi_multigpu_test_code/psi_multigpu_test_code 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff0bff640 (LWP 6936)]
[New Thread 0x7fffeabff640 (LWP 6937)]
[New Thread 0x7fffe17ff640 (LWP 6938)]
[New Thread 0x7fffdbfff640 (LWP 6939)]
[New Thread 0x7fffe0ffe640 (LWP 6940)]
[New Thread 0x7fffca1ff640 (LWP 6945)]
[New Thread 0x7fffc99fe640 (LWP 6946)]
 Grid size per dimension per rank:           250
 Grid size per rank:      15625000
  
 Number of ranks in DIM1:             1
 Number of ranks in DIM2:             1
 Number of ranks in DIM3:             1
 Total number of ranks:             1
  
World rank   0 has cart rank   0 and shared rank   0 (i.e. GPU device number   1 out of   1)
 Starting seam cycles...

Thread 1 "psi_multigpu_te" received signal SIGSEGV, Segmentation fault.
317	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317

However, I do not know what you can then run “top” to get the PID for the process and cat the correct map file. means.
What is the map file? I found the PID in nvidia-smi, what do I do with it?

– Ron

Hi,

Sorry - I just read the posts above and found the map stuff.

I found the map file:

maps.txt (94.1 KB)

– Ron

Thanks Ron. “map” shows that the multi-threaded UCX libraries are getting pulled in, so that’s not the issue.

Can you rerun the code in the debugger and then from the prompt type “where”? Then post the output from the back trace so I can see where the segv is coming from.

Hi,

It’s coming from the “Waitall” after the non-blocking iSend and iRecv:

(gdb) where
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317
#1 0x00007ffff7af9a76 in ucs_memcpy_relaxed (len=498000, src=0x7fff815c8580, dst=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucs/arch/x86_64/cpu.h:112
#2 ucp_memcpy_pack_unpack (name=<synthetic pointer>, length=498000, data=0x7fff815c8580, buffer=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/dt.h:74
#3 ucp_dt_contig_unpack (mem_type=<optimized out>, length=498000, src=0x7fff815c8580, dest=<optimized out>,
worker=0x5aaa70)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/dt_contig.h:55
#4 ucp_datatype_iter_unpack (src=0x7fff815c8580, offset=<optimized out>, length=498000, worker=0x5aaa70,
dt_iter=0xed2018)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/datatype_iter.inl:445
#5 ucp_proto_rndv_progress_rkey_ptr (arg=0x5aaa70) at rndv/rndv_rkey_ptr.c:138
#6 0x00007ffff76caec1 in ucs_callbackq_spill_elems_dispatch (cbq=0x5ca9c0) at datastruct/callbackq.c:383
#7 ucs_callbackq_proxy_callback (arg=0x5ca9c0) at datastruct/callbackq.c:479
#8 0x00007ffff7ad2eba in ucs_callbackq_dispatch (cbq=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucs/datastruct/callbackq.h:215
#9 uct_worker_progress (worker=<optimized out>)
at /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/uct/api/uct.h:2787
#10 ucp_worker_progress (worker=0x5aaa70) at core/ucp_worker.c:2991
#11 0x00007ffff3036b9c in opal_progress () at ../../opal/runtime/opal_progress.c:231
#12 0x00007ffff303d5e5 in ompi_sync_wait_mt (sync=sync@entry=0x7fffffff7f10) at ../../opal/threads/wait_sync.c:85
#13 0x00007ffff704ea98 in ompi_request_default_wait_all (count=12, requests=0x688e80, statuses=0x688ee0)
at ../../ompi/request/req_wait.c:234
#14 0x00007ffff707620c in PMPI_Waitall (count=12, requests=requests@entry=0x688e80, statuses=statuses@entry=0x688ee0)
at pwaitall.c:80
#15 0x00007ffff744e299 in ompi_waitall_f (count=0x411628 <.C360_seam_vvec_>, array_of_requests=0x43dbc0 <.BSS9>,
array_of_statuses=0x43dc40 <mpi_fortran_statuses_ignore_>, ierr=0x7fffffff81fc) at pwaitall_f.c:104
#16 0x00000000004043ca in seam_vvec () at psi_multigpu_test_code.f:700
#17 0x000000000040f5b5 in psi_multigpu_test_code () at psi_multigpu_test_code.f:1062

– Ron

I’ll pass this to Chris, but we’ll likely need to pass this on to the HPCX folks. Hopefully the traceback gives them some clues on how to reproduce the error.