CdaCrr
February 23, 2023, 7:29pm
1
Hello all,
I am facing an issue with my code, using OpenMP offloading and Cuda library (AmgX). On a multiGPU node, if using more MPI processes than available devices, I got the error:
“fatal error: Could not launch CUDA kernel on device 1” when the first OpenMP kernel will be launched (and after AmgX library was initialized).
-If I am using less MPI processes than devices, it works
-If I am using only one device and even more MPI processes, it works
-If I disable AmgX library, it works whatever MPI processes I am using on the multiGPU node
Could be related to a missing option when building the code using both OpenMP offloading and Cuda ?
Thanks,
Pierre
Hi Pierre,
I’ve not worked with AmgX myself so not sure what’s going on.
Could be related to a missing option when building the code using both OpenMP offloading and Cuda ?
Possible, but doubtful. Though you can try adding the “-cuda” flag to see if it has an effect.
How are you setting the device number to use? Could AmgX be setting it to a different device?
-Mat
1 Like
CdaCrr
February 24, 2023, 7:04am
3
Thanks Mat,
I was thinking indeed to the -cuda option I am not currently using during the build. So I need to try, and will inspect how the device number are set. For OpenMP, I am using this in the code:
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int device = rank % omp_get_num_devices();
omp_set_default_device(device);
cerr << "Initializing OpenMP offload on devices..." << endl;
cerr << "Detecting " << omp_get_num_devices() << " devices." << endl;
cerr << "Assigning rank " << rank << " to device " << device << endl;
// Dummy target region, so as not to measure startup time later:
#pragma omp target
{ ; }
For AmgX (an issue have been opened too: How AmgX can cause an OpenMP-target code to crash when using more MPI processes than devices on a GPU node ? · Issue #233 · NVIDIA/AMGX · GitHub ), I will have a look. Thanks,
Pierre
CdaCrr
February 24, 2023, 7:55am
4
I rebuilt the code with -cuda option but got an error during the link:
/volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMPI/mpich/bin/mpicxx -DNDEBUG -mavx2 -O3 -fPIC -Ktrap=fp -Kieee -noswitcherror -std=c++14 -Wno-long-long -Wall -Wshadow -Wextra -Wno-unused-parameter -Werror --display_error_number --diag_suppress185 -cuda -mp -target=gpu -lgfortran -lnvf -mp CMakeFiles/TRUST_mpi_opt.dir/MAIN/the_main.cpp.o CMakeFiles/TRUST_mpi_opt.dir/MAIN/mon_main.cpp.o -o /volatile/catA/pl254994/trust/copyFromDevice/exec/TRUST_mpi_opt -Wl,-rpath,/volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBAMGX/AmgX/lib:/volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64:/volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBROCALUTION/lib: /volatile/catA/pl254994/trust/copyFromDevice/lib/TRUST_mpi_opt.a /volatile/catA/pl254994/trust/copyFromDevice/lib/libmfft.a /volatile/catA/pl254994/trust/copyFromDevice/lib/libmfft_c.a /volatile/catA/pl254994/trust/copyFromDevice/lib/libskit.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBLATAFILTER/lib/liblatafilter.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBICOCOAPI/lib/libicocoapi.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libpetsc.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libsuperlu_dist.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libcmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libdmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libmumps_common.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libsmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libzmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libpord.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libHYPRE.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscalapack.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libparmetis.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libmetis.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptesmumps.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotcherr.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotcherrexit.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotchparmetis.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libptscotch.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscotch.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscotcherr.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBPETSC/petsc/linux_opt/lib/libscotcherrexit.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBAMGX/AmgX/lib/libamgxsh.so /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBAMGX/AmgXWrapper/lib64/libAmgXWrapper.a /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcudart.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcublas.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcublasLt.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcufft.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcusparse.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libcusolver.so /volatile/catA/pl254994/.tmp_TRUST_trust/cuda-11.2.0/lib64/libnvToolsExt.so /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBLAPACK/liblapack.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBLAPACK/libblas.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBVC/lib/libVc.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBROCALUTION/lib/librocalution.so /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBOSQP/lib/libosqp.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedloader.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedcouplingremapper.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedcoupling.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libmedicoco.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libparamedmem.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libparamedloader.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMEDCOUPLING/install/lib/libinterpkernel.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libmed.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libmedimport.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libmedC.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libhdf5.a /volatile/catA/pl254994/trust/copyFromDevice/lib/src/LIBMED/lib/libhdf5_hl.a -Wl,-Bstatic -lmpifort -Wl,-Bdynamic -ldl -lX11 -lpthread -ldl -lgfortran -lnvf -mp -lX11 -lpthread -lgfortran -lnvf -mp -lmpifort -lnvf -lrt
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: error: redefinition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp'
3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
| ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
9 | void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **), \
| ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:1364:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
1364 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp)
| ^~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: note: previous definition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp' was here
3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
| ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
9 | void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **), \
| ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:3:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
3 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_mon_main_cpp)
| ^~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: error: redefinition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp'
3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
| ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
9 | void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **), \
| ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:1365:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
1365 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp)
| ^~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:3:36: note: previous definition of '__cudaRegisterLinkedBinary_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp' was here
3 | #define __REGISTERFUNCNAME_CORE(X) __cudaRegisterLinkedBinary##X
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:4:31: note: in expansion of macro '__REGISTERFUNCNAME_CORE'
4 | #define __REGISTERFUNCNAME(X) __REGISTERFUNCNAME_CORE(X)
| ^~~~~~~~~~~~~~~~~~~~~~~
/product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/include_acc/linkstub.c:9:8: note: in expansion of macro '__REGISTERFUNCNAME'
9 | void __REGISTERFUNCNAME(id)(void (*callback_fp)(void **), \
| ^~~~~~~~~~~~~~~~~~
/volatile/catA/pl254994/.tmp_TRUST_trust/pgcudaq-QoiW-Fdr-Fg.reg.c:2:1: note: in expansion of macro 'DEFINE_REGISTER_FUNC'
2 | DEFINE_REGISTER_FUNC(_66_volatile_catA_pl254994_trust_copyFromDevice_src_MAIN_the_main_cpp)
| ^~~~~~~~~~~~~~~~~~~~
pgacclnk: child process exit status 2: /product/ubuntu20-x86_64/apps/NVHPC-nompi/22.1/Linux_x86_64/22.1/compilers/bin/tools/nvdd
CdaCrr
February 24, 2023, 12:23pm
5
Spot on Mat! Indeed, there was different process setting on devices between AmgX part and OpenMP part of the code. Thanks a lot for your suggestion, it saves my day :-)
system
Closed
March 10, 2023, 12:24pm
6
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.