Hello,
I am trying to run my multi-GPU code using mpif90 and OpenACC. I am using the following compilation flags:
-acc -fast -ta=multicore -ta=tesla:managed -Minfo=accel
and I run the code with:
mpirun --allow-run-as-root -np $(np) ./bin/dew
When I do this, I get the following error:
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[36,1],4]
Exit code: 1
--------------------------------------------------------------------------
make: *** [Makefile:125: run] Error 1
I tried to debug using the following flags:
-g -C
Initially, there were a few errors that I have since fixed. However, now when I run the code using the debug flags, it runs fine. If I switch back to the optimized flags (first ones I provided), the code returns the error I pasted above. I tried using compute-sanitizer
but that did not really help either.
Any advice would be appreciated. Thank you.
EDIT: I also tried using PGI_ACC_DEBUG=1 and PGI_ACC_FILL=1. Neither did anything. Nothing was printed.
EDIT: I also tried cuda-memcheck. I get the following output:
========= CUDA-MEMCHECK
========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2caa7b]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0 (uct_cuda_base_query_devices_common + 0x23) [0x6cc3]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0 (uct_md_query_tl_resources + 0x93) [0x130a3]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x222e1]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x230a1]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 (ucp_init_version + 0x378) [0x23e48]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so (mca_pml_ucx_open + 0x11f) [0x62af]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_components_open + 0xc5) [0x5adc5]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 [0xacd47]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_open + 0x85) [0x657b5]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (ompi_mpi_init + 0x6dd) [0xb626d]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (MPI_Init + 0x9b) [0x6d09b]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40 (PMPI_Init_f08 + 0x25) [0x4a765]
========= Host Frame:./bin/dew (mpiwrapper_initmpi_ + 0x26) [0x369a6]
========= Host Frame:./bin/dew (MAIN_ + 0x29) [0x57e9]
========= Host Frame:./bin/dew (main + 0x33) [0x5773]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x24083]
========= Host Frame:./bin/dew (_start + 0x2e) [0x566e]
=========
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2caa7b]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0 (uct_cuda_base_query_devices_common + 0x23) [0x6cc3]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0 (uct_md_query_tl_resources + 0x93) [0x130a3]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x222e1]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x230a1]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 (ucp_init_version + 0x378) [0x23e48]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so (mca_pml_ucx_open + 0x11f) [0x62af]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_components_open + 0xc5) [0x5adc5]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 [0xacd47]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_open + 0x85) [0x657b5]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (ompi_mpi_init + 0x6dd) [0xb626d]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (MPI_Init + 0x9b) [0x6d09b]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40 (PMPI_Init_f08 + 0x25) [0x4a765]
========= Host Frame:./bin/dew (mpiwrapper_initmpi_ + 0x26) [0x369a6]
========= Host Frame:./bin/dew (MAIN_ + 0x29) [0x57e9]
========= Host Frame:./bin/dew (main + 0x33) [0x5773]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x24083]
========= Host Frame:./bin/dew (_start + 0x2e) [0x566e]
=========
Below is the output from compute-sanitizer:
========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x2caa7b]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:base/cuda_iface.c:22:uct_cuda_base_query_devices_common [0x6cc3]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0
========= Host Frame:base/uct_md.c:115:uct_md_query_tl_resources [0x130a3]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0
========= Host Frame:core/ucp_context.c:1332:ucp_add_component_resources [0x222e1]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
========= Host Frame:core/ucp_context.c:1470:ucp_fill_resources [0x230a1]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
========= Host Frame:core/ucp_context.c:1887:ucp_init_version [0x23e48]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
========= Host Frame:../../../../../ompi/mca/pml/ucx/pml_ucx.c:236:mca_pml_ucx_open [0x62af]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so
========= Host Frame:../../../../opal/mca/base/mca_base_components_open.c:68:mca_base_framework_components_open [0x5adc5]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
========= Host Frame:../../../../ompi/mca/pml/base/pml_base_frame.c:183:mca_pml_base_open [0xacd47]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
========= Host Frame:../../../../opal/mca/base/mca_base_framework.c:181:mca_base_framework_open [0x657b5]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
========= Host Frame:../../ompi/runtime/ompi_mpi_init.c:617:ompi_mpi_init [0xb626d]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
========= Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pinit.c:67:MPI_Init [0x6d09b]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
========= Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pinit_f.c:85:PMPI_Init_f08 [0x4a765]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40
========= Host Frame:/home/hoffmann/work/main8_initProfile/src/mpi/mpiWrapper.F90:30:mpiwrapper_initmpi_ [0x369a6]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
========= Host Frame:/home/hoffmann/work/main8_initProfile/src/common/main.F90:37:MAIN_ [0x57e9]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
========= Host Frame:main [0x5773]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
========= Host Frame:__libc_start_main [0x24083]
========= in /usr/lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x566e]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
=========
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x2caa7b]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:base/cuda_iface.c:22:uct_cuda_base_query_devices_common [0x6cc3]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0
========= Host Frame:base/uct_md.c:115:uct_md_query_tl_resources [0x130a3]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0
========= Host Frame:core/ucp_context.c:1332:ucp_add_component_resources [0x222e1]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
========= Host Frame:core/ucp_context.c:1470:ucp_fill_resources [0x230a1]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
========= Host Frame:core/ucp_context.c:1887:ucp_init_version [0x23e48]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
========= Host Frame:../../../../../ompi/mca/pml/ucx/pml_ucx.c:236:mca_pml_ucx_open [0x62af]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so
========= Host Frame:../../../../opal/mca/base/mca_base_components_open.c:68:mca_base_framework_components_open [0x5adc5]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
========= Host Frame:../../../../ompi/mca/pml/base/pml_base_frame.c:183:mca_pml_base_open [0xacd47]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
========= Host Frame:../../../../opal/mca/base/mca_base_framework.c:181:mca_base_framework_open [0x657b5]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
========= Host Frame:../../ompi/runtime/ompi_mpi_init.c:617:ompi_mpi_init [0xb626d]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
========= Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pinit.c:67:MPI_Init [0x6d09b]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
========= Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pinit_f.c:85:PMPI_Init_f08 [0x4a765]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40
========= Host Frame:/home/hoffmann/work/main8_initProfile/src/mpi/mpiWrapper.F90:30:mpiwrapper_initmpi_ [0x369a6]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
========= Host Frame:/home/hoffmann/work/main8_initProfile/src/common/main.F90:37:MAIN_ [0x57e9]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
========= Host Frame:main [0x5773]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
========= Host Frame:__libc_start_main [0x24083]
========= in /usr/lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x566e]
========= in /home/hoffmann/work/main8_initProfile/./bin/dew
=========