Openacc with cuda

Using nvhpc-hpcx/23.3 binary from nvidia testing mpi_alltoall fails with a segfault. On a single node with 4 A100s and 128 cores. rocky8.

I’m not an expert in using ACC for accelerators. I’m told this code has been built successfully under nvhpc on other systems. We are bringing a new system and these tests are failing. Any ideas what is going on here?

stacktrace:
[node2304:2154584] Failing at address: 0x3ec0020e058
0 0x0000000000012c20 __funlockfile() :0
1 0x0000000000160895 __memmove_avx_unaligned_erms() :0
2 0x0000000000050251 non_overlap_copy_content_same_ddt() /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/opal/datatype/…/…/…/opal/datatype/opal_datatype_copy.h:155
3 0x000000000005fed3 ompi_datatype_sndrcv() /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/datatype/…/…/…/ompi/datatype/ompi_datatype_sndrcv.c:62
4 0x00000000000904c3 ompi_coll_base_alltoall_intra_basic_linear() /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mca/coll/…/…/…/…/ompi/mca/coll/base/coll_base_alltoall.c:643
5 0x0000000000005906 ompi_coll_tuned_alltoall_intra_dec_fixed() /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mca/coll/tuned/…/…/…/…/…/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:407
6 0x0000000000060fd1 PMPI_Alltoall() /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/c/profile/palltoall.c:110
7 0x0000000000044abc ompi_alltoall_f() /var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/palltoall_f.c:86
/xxxx/mpi_tests/test_alltoall.F90:121
9 0x00000000004020f3 main() ???:0
10 0x0000000000023493 __libc_start_main() ???:0
11 0x0000000000401fde _start() ???:0

code:
mpifort acc flags: -acc -Minfo=accel -ta=nvidia -Mcudalib=cufft

!$acc enter data copyin(t2)
!$acc enter data copyin(t1)

some code

!$acc host_data use_device(t1,t2)
call MPI_ALLTOALL(t1, & !<— crashes here
bufsize, &
MPI_DOUBLE, &
t2, &
bufsize, &
MPI_DOUBLE, &
NEW_COMM_1, &
i_err)
!$acc end host_data

Hi jcwright,

It appears that CUDA Aware MPI support is disabled hence the device pointer is causing the runtime to crash.

Are you explicitly disabling it via “–mca mpi_cuda_support 0” or other method?

-Mat

Nope. Just plain vanilla nvhpc tgz file from nvidia. loading bundled module nvhpc/23.3

ompi_info:
[jcwright@eofe4 platform]$ ompi_info |grep cuda
Configure command line: ‘–prefix=/proj/nv/libraries/Linux_x86_64/23.3/openmpi/224180-rel-1’ ‘–enable-shared’ ‘–enable-static’ ‘–without-tm’ ‘–enable-mpi-cxx’ ‘–disable-wrapper-runpath’ ‘–without-ucx’ ‘–without-libnl’ ‘–with-wrapper-ldflags=-Wl,-rpath -Wl,$ORIGIN:$ORIGIN/…/…/lib:$ORIGIN/…/…/…/lib:$ORIGIN/…/…/…/compilers/lib:$ORIGIN/…/…/…/…/compilers/lib:$ORIGIN/…/…/…/…/…/compilers/lib’ ‘–enable-mpirun-prefix-by-default’ ‘–with-libevent=internal’ ‘–with-slurm’ ‘–without-libnl’ ‘–with-cuda=/proj/cuda/10.0/Linux_x86_64’
MPI extensions: affinity, cuda
MCA btl: smcuda (MCA v2.1.0, API v3.0.0, Component v3.1.5)
MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v3.1.5)

The only odd thing is this is supposed to be for cuda 12 but the bundled openmpi is linked to cuda10. Don’t know if that causes a problem or not.

Open ACC flags for mpifort (using nvfortran) are:
-acc -Minfo=accel -ta=nvidia -Mcudalib=cufft

-john

The only odd thing is this is supposed to be for cuda 12 but the bundled openmpi is linked to cuda10. Don’t know if that causes a problem or not.

This could be it. Our team is looking for ways the HPCX to be bundled for CUDA 12 as well as earlier versions, but for now it’s linked against CUDA 11. This can cause problems as CUDA Aware MPI may be disabled when running on CUDA 12 systems.

I’ve seen similar issues with my codes. While I don’t see seg faults, I set the environment variable “UCX_TLS=self,shm,cuda_copy” and my tests fail because the “cuda_copy” transport is unavailable.

The work around I was given was to set the LD_LIBRARY_PATH to point to the CUDA 11.8 lib64 directory. i.e. “/Linux_x86_64/23.3/cuda/11.8/lib64/”. Hopefully it will work for you as well.

Alternatively, you might try using the OpenMPI 4.0.5 install instead of HPCX.

That seemed to work. Case ran w/o errors. Now to do some timings.