Windows 11 + WSL + CUDA-aware MPI + GeForce 40 series = seg fault, but with GeForce 30 series = OK

Hi,

I am trying to install and run POT3D (GitHub - predsci/POT3D: POT3D: High Performance Potential Field Solver) on a GPU on a Windows 11 Laptop using WSL.

I am using WSL2, with Ubuntu 22.04 (wsl --install Ubuntu-22.04).

I then installed the NV HPC SDK 24.3.

I am activating the compiler with:

#!/bin/bash
version=24.3
NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/$version/compilers/bin:$PATH; export PATH
export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/openmpi/openmpi-3.1.5/man
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:${LD_LIBRARY_PATH}

With the last line allowing accesss to the Windows CUDA driver.

Using this configuration, I am able to install and successfully run the GPU code HipFT (GitHub - predsci/HipFT: High-performance Flux Transport) which does NOT use CUDA-aware MPI.

However, for POT3D, the code compiles fine, but when I try to run it, it seg faults with an “address not mapped”.

POT3D uses CUDA-aware MPI (through the use of OpenACC’s host_data directives) even when run on 1 GPU (due to a periodic boundary condition).

A colleague of mine also has a Windows 11 laptop, and went through the exact same setup as I did.
He also got the seg fault when using the OpenMPI 4 HPCX, but when using the OpenMPI3 as above, his code runs fine!

The only difference between our setups that we can find is that he has a 30-series GPU (3070), while I have a 40-series GPU (4070).

I remember reading somewhere that the 40-series have their GPU-GPU direct memory transfer disabled.
Could that be a possible cause? If so, is there a compiler update (or something I can do now) that may help get around this when using 1 GPU?

– Ron

This is a bit out of my area so I’m not 100% sure, but according to the GPU Direct Docs:

GPUDirect RDMA is available on both Tesla and Quadro GPUs.

Hence I question if it’s really using GPU Direct on the 3070 given its an RTX. I’d run the program under Nsight-Systems with MPI tracing enabled to see if the data is being brought back to the host rather that directly between the devices.

Again I’m not positive, but I wouldn’t think this would cause the HPCX segv. I’d expect it to fallback to the host. Though you’re using WSL so maybe?

I’ve had issues with HPCX and CUDA Aware MPI before (which I report to the HPCX team), but my typically work around is to change the transport via the following environment variables.

UCX_TLS=self,shm,cuda
UCX_MEMTYPE_CACHE=n

Not sure this will work for you, but the different transports are documented at: Frequently Asked Questions — OpenUCX documentation

Also after looking at the Known Issues for HPCX

Other another thing to try is setting: UCX_IB_GPU_DIRECT_RDMA=n

This disables GPU Direct so you wouldn’t see much benefit from CUDA Aware MPI, but if the 4070 doesn’t support GPI Direct anyway, and gets you past this error, then it should be ok.

Hi,

So it turns out I mis-typed and also missed something during the first tests.

The situation as of now is that on both systems (3070 and 4070) when using OpenMPI3 the code seg faults.

On both systems, using HPCX (even without your additional env) the code works!

However, the code does not work with more than 1 MPI rank (i.e. oversubscribing) but that is not an issue for me at the moment (this mode works on my desktop system with OpenMPI3 however).

In case anyone needs it, the following is how I activate HPCX (after commenting out the openmpi3 lines above):

export PATH=$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/$version/comm_libs/mpi/man
. /opt/nvidia/hpc_sdk/Linux_x86_64/${version}/comm_libs/12.3/hpcx/hpcx-2.17.1/hpcx-mt-init-ompi.sh
hpcx_load

– Ron

Well I am not sure what happened, but now I seg fault even with HPCX, and even with all the flags etc.

Below is the stack trace.

It is happening in the “waitall” after the MPI iRecv and iSend combination that is communicating with itself (only 1 rank).

Running POT3D with 1 MPI rank...
[RLAP4:21235:0:21235] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x731150138)
==== backtrace (tid:  21235) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000001a07cd __nss_database_lookup()  ???:0
 2 0x0000000000076a76 ucs_memcpy_relaxed()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucs/arch/x86_64/cpu.h:112
 3 0x0000000000076a76 ucp_memcpy_pack_unpack()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/dt.h:74
 4 0x0000000000076a76 ucp_dt_contig_unpack()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/dt_contig.h:55
 5 0x0000000000076a76 ucp_datatype_iter_unpack()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/dt/datatype_iter.inl:445
 6 0x0000000000076a76 ucp_proto_rndv_progress_rkey_ptr()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/rndv/rndv_rkey_ptr.c:138
 7 0x0000000000052ec1 ucs_callbackq_spill_elems_dispatch()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucs/datastruct/callbackq.c:383
 8 0x000000000004feba ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucs/datastruct/callbackq.h:215
 9 0x000000000004feba uct_worker_progress()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/uct/api/uct.h:2787
10 0x000000000004feba ucp_worker_progress()  /build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-02432d35d8228f44e9a3b809964cccdebc45703a/src/ucp/core/ucp_worker.c:2991
11 0x0000000000036b9c opal_progress()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/opal/../../opal/runtime/opal_progress.c:231
12 0x000000000003d5e5 ompi_sync_wait_mt()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/opal/../../opal/threads/wait_sync.c:85
13 0x000000000004ea98 ompi_request_default_wait_all()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/../../ompi/request/req_wait.c:234
14 0x000000000007620c PMPI_Waitall()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pwaitall.c:80
15 0x000000000004e299 ompi_waitall_f()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
16 0x00000000004359df seam_()  /home/sumseq/gpu/swig/pot3d/src/pot3d_cpp.f:5665
17 0x0000000000423997 potfld_()  /home/sumseq/gpu/swig/pot3d/src/pot3d_cpp.f:3848
18 0x000000000040b211 MAIN_()  /home/sumseq/gpu/swig/pot3d/src/pot3d_cpp.f:815
19 0x0000000000404ab1 main()  ???:0
20 0x0000000000029d90 __libc_init_first()  ???:0
21 0x0000000000029e40 __libc_start_main()  ???:0
22 0x00000000004049a5 _start()  ???:0
=================================
[RLAP4:21235] *** Process received signal ***
[RLAP4:21235] Signal: Segmentation fault (11)
[RLAP4:21235] Signal code:  (-6)
[RLAP4:21235] Failing at address: 0x3e8000052f3
[RLAP4:21235] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff8fc419520]
[RLAP4:21235] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a07cd)[0x7ff8fc5777cd]
[RLAP4:21235] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucx/mt/lib/libucp.so.0(+0x76a76)[0x7ff90059fa76]
[RLAP4:21235] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucx/mt/lib/libucs.so.0(+0x52ec1)[0x7ff8ffccaec1]
[RLAP4:21235] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucx/mt/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7ff900578eba]
[RLAP4:21235] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ompi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7ff8fb636b9c]
[RLAP4:21235] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7ff8fb63d5e5]
[RLAP4:21235] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x388)[0x7ff8ffe4ea98]
[RLAP4:21235] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ompi/lib/libmpi.so.40(PMPI_Waitall+0x1c)[0x7ff8ffe7620c]
[RLAP4:21235] [ 9] /opt/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ompi/lib/libmpi_mpifh.so.40(pmpi_waitall+0x79)[0x7ff90024e299]
[RLAP4:21235] [10] /home/sumseq/gpu/swig/pot3d/testsuite/../bin/pot3d[0x4359df]
[RLAP4:21235] [11] /home/sumseq/gpu/swig/pot3d/testsuite/../bin/pot3d[0x423997]
[RLAP4:21235] [12] /home/sumseq/gpu/swig/pot3d/testsuite/../bin/pot3d[0x40b211]
[RLAP4:21235] [13] /home/sumseq/gpu/swig/pot3d/testsuite/../bin/pot3d[0x404ab1]
[RLAP4:21235] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff8fc400d90]
[RLAP4:21235] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff8fc400e40]
[RLAP4:21235] [16] /home/sumseq/gpu/swig/pot3d/testsuite/../bin/pot3d[0x4049a5]
[RLAP4:21235] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

It seems this may be a OpenMPI issue?

I still do not understand why it works on my colleague’s WSL but not mine.

– Ron