[MultiGPU][OpenMP][Cuda]: fatal error: Could not launch CUDA kernel on device 1

Hello all,

I am facing an issue with my code, using OpenMP offloading and Cuda library (AmgX). On a multiGPU node, if using more MPI processes than available devices, I got the error:
“fatal error: Could not launch CUDA kernel on device 1” when the first OpenMP kernel will be launched (and after AmgX library was initialized).
-If I am using less MPI processes than devices, it works
-If I am using only one device and even more MPI processes, it works
-If I disable AmgX library, it works whatever MPI processes I am using on the multiGPU node

Could be related to a missing option when building the code using both OpenMP offloading and Cuda ?

Thanks,

Pierre

Hi Pierre,

I’ve not worked with AmgX myself so not sure what’s going on.

Could be related to a missing option when building the code using both OpenMP offloading and Cuda ?

Possible, but doubtful. Though you can try adding the “-cuda” flag to see if it has an effect.

How are you setting the device number to use? Could AmgX be setting it to a different device?

-Mat

1 Like

Thanks Mat,

I was thinking indeed to the -cuda option I am not currently using during the build. So I need to try, and will inspect how the device number are set. For OpenMP, I am using this in the code:

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  int device = rank % omp_get_num_devices();
  omp_set_default_device(device);
  cerr << "Initializing OpenMP offload on devices..."  << endl;
  cerr << "Detecting " << omp_get_num_devices() << " devices." << endl;
  cerr << "Assigning rank " << rank << " to device " << device << endl;
  // Dummy target region, so as not to measure startup time later:
  #pragma omp target
  { ; }

For AmgX (an issue have been opened too: How AmgX can cause an OpenMP-target code to crash when using more MPI processes than devices on a GPU node ? · Issue #233 · NVIDIA/AMGX · GitHub), I will have a look. Thanks,

Pierre

I rebuilt the code with -cuda option but got an error during the link:

/volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMPI/mpich/bin/mpicxx -DNDEBUG -mavx2 -O3 -fPIC -Ktrap=fp -Kieee -noswitcherror -std=c++14 -Wno-long-long -Wall -Wshadow -Wextra -Wno-unused-parameter -Werror --display_error_number --diag_suppress185 -cuda -mp -target=gpu   -lgfortran -lnvf -mp  CMakeFiles/TRUST_mpi_opt.dir/MAIN/the_main.cpp.o CMakeFiles/TRUST_mpi_opt.dir/MAIN/mon_main.cpp.o -o /volatile/catA/pl254994/trust/copyFromDevice/exec/TRUST_mpi_opt  -Wl,-rpath,/volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBAMGX/AmgX/lib:/volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64:/volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBROCALUTION/lib: /volatile/catA/pl254994/trust/copyFromDevice/lib/TRUST_mpi_opt.a /volatile/catA/pl254994/trust/copyFromDevice/lib/libmfft.a /volatile/catA/pl254994/trust/copyFromDevice/lib/libmfft_c.a /volatile/catA/pl254994/trust/copyFromDevice/lib/libskit.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBLATAFILTER/lib/liblatafilter.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBICOCOAPI/lib/libicocoapi.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libpetsc.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libsuperlu_dist.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libcmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libdmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libmumps_common.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libsmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libzmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libpord.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libHYPRE.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscalapack.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libparmetis.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libmetis.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptesmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotcherr.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotcherrexit.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotchparmetis.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotch.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscotch.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscotcherr.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscotcherrexit.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBAMGX/AmgX/lib/libamgxsh.so /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBAMGX/AmgXWrapper/lib64/libAmgXWrapper.a /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcudart.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcublas.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcublasLt.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcufft.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcusparse.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcusolver.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libnvToolsExt.so /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBLAPACK/liblapack.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBLAPACK/libblas.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBVC/lib/libVc.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBROCALUTION/lib/librocalution.so /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBOSQP/lib/libosqp.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedloader.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedcouplingremapper.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedcoupling.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedicoco.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libparamedmem.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libparamedloader.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libinterpkernel.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libmed.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libmedimport.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libmedC.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libhdf5.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libhdf5_hl.a -Wl,-Bstatic -lmpifort -Wl,-Bdynamic -ldl -lX11 -lpthread -ldl -lgfortran -lnvf -mp -lX11 -lpthread -lgfortran -lnvf -mp -lmpifort -lnvf -lrt
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: error: redefinition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp'
    3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
    4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
      |                               ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
    9 |   void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **),       \
      |        ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:1364:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
 1364 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp)
      | ^~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: note: previous definition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp' was here
    3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
    4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
      |                               ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
    9 |   void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **),       \
      |        ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:3:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
    3 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp)
      | ^~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: error: redefinition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp'
    3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
    4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
      |                               ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
    9 |   void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **),       \
      |        ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:1365:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
 1365 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp)
      | ^~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: note: previous definition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp' was here
    3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
    4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
      |                               ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
    9 |   void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **),       \
      |        ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:2:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
    2 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp)
      | ^~~~~~~~~~~~~~~~~~~~
pgacclnk: child process exit status 2: /product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/bin/tools/nvdd

Spot on Mat! Indeed, there was different process setting on devices between AmgX part and OpenMP part of the code. Thanks a lot for your suggestion, it saves my day :-)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.