Error when running optimized code but runs fine with debug

Hello,

I am trying to run my multi-GPU code using mpif90 and OpenACC. I am using the following compilation flags:

 -acc -fast -ta=multicore -ta=tesla:managed -Minfo=accel

and I run the code with:

mpirun --allow-run-as-root -np $(np) ./bin/dew

When I do this, I get the following error:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[36,1],4]
  Exit code:    1
--------------------------------------------------------------------------
make: *** [Makefile:125: run] Error 1

I tried to debug using the following flags:

-g -C

Initially, there were a few errors that I have since fixed. However, now when I run the code using the debug flags, it runs fine. If I switch back to the optimized flags (first ones I provided), the code returns the error I pasted above. I tried using compute-sanitizer but that did not really help either.

Any advice would be appreciated. Thank you.

EDIT: I also tried using PGI_ACC_DEBUG=1 and PGI_ACC_FILL=1. Neither did anything. Nothing was printed.

EDIT: I also tried cuda-memcheck. I get the following output:

========= CUDA-MEMCHECK
========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2caa7b]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0 (uct_cuda_base_query_devices_common + 0x23) [0x6cc3]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0 (uct_md_query_tl_resources + 0x93) [0x130a3]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x222e1]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x230a1]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 (ucp_init_version + 0x378) [0x23e48]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so (mca_pml_ucx_open + 0x11f) [0x62af]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_components_open + 0xc5) [0x5adc5]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 [0xacd47]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_open + 0x85) [0x657b5]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (ompi_mpi_init + 0x6dd) [0xb626d]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (MPI_Init + 0x9b) [0x6d09b]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40 (PMPI_Init_f08 + 0x25) [0x4a765]
=========     Host Frame:./bin/dew (mpiwrapper_initmpi_ + 0x26) [0x369a6]
=========     Host Frame:./bin/dew (MAIN_ + 0x29) [0x57e9]
=========     Host Frame:./bin/dew (main + 0x33) [0x5773]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x24083]
=========     Host Frame:./bin/dew (_start + 0x2e) [0x566e]
=========
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2caa7b]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0 (uct_cuda_base_query_devices_common + 0x23) [0x6cc3]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0 (uct_md_query_tl_resources + 0x93) [0x130a3]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x222e1]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 [0x230a1]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0 (ucp_init_version + 0x378) [0x23e48]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so (mca_pml_ucx_open + 0x11f) [0x62af]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_components_open + 0xc5) [0x5adc5]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 [0xacd47]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40 (mca_base_framework_open + 0x85) [0x657b5]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (ompi_mpi_init + 0x6dd) [0xb626d]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40 (MPI_Init + 0x9b) [0x6d09b]
=========     Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40 (PMPI_Init_f08 + 0x25) [0x4a765]
=========     Host Frame:./bin/dew (mpiwrapper_initmpi_ + 0x26) [0x369a6]
=========     Host Frame:./bin/dew (MAIN_ + 0x29) [0x57e9]
=========     Host Frame:./bin/dew (main + 0x33) [0x5773]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x24083]
=========     Host Frame:./bin/dew (_start + 0x2e) [0x566e]
=========

Below is the output from compute-sanitizer:

========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2caa7b]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:base/cuda_iface.c:22:uct_cuda_base_query_devices_common [0x6cc3]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0
=========     Host Frame:base/uct_md.c:115:uct_md_query_tl_resources [0x130a3]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0
=========     Host Frame:core/ucp_context.c:1332:ucp_add_component_resources [0x222e1]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
=========     Host Frame:core/ucp_context.c:1470:ucp_fill_resources [0x230a1]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
=========     Host Frame:core/ucp_context.c:1887:ucp_init_version [0x23e48]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
=========     Host Frame:../../../../../ompi/mca/pml/ucx/pml_ucx.c:236:mca_pml_ucx_open [0x62af]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so
=========     Host Frame:../../../../opal/mca/base/mca_base_components_open.c:68:mca_base_framework_components_open [0x5adc5]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
=========     Host Frame:../../../../ompi/mca/pml/base/pml_base_frame.c:183:mca_pml_base_open [0xacd47]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
=========     Host Frame:../../../../opal/mca/base/mca_base_framework.c:181:mca_base_framework_open [0x657b5]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
=========     Host Frame:../../ompi/runtime/ompi_mpi_init.c:617:ompi_mpi_init [0xb626d]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
=========     Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pinit.c:67:MPI_Init [0x6d09b]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
=========     Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pinit_f.c:85:PMPI_Init_f08 [0x4a765]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40
=========     Host Frame:/home/hoffmann/work/main8_initProfile/src/mpi/mpiWrapper.F90:30:mpiwrapper_initmpi_ [0x369a6]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
=========     Host Frame:/home/hoffmann/work/main8_initProfile/src/common/main.F90:37:MAIN_ [0x57e9]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
=========     Host Frame:main [0x5773]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
=========     Host Frame:__libc_start_main [0x24083]
=========                in /usr/lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x566e]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
========= 
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2caa7b]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:base/cuda_iface.c:22:uct_cuda_base_query_devices_common [0x6cc3]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/ucx/libuct_cuda.so.0
=========     Host Frame:base/uct_md.c:115:uct_md_query_tl_resources [0x130a3]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libuct.so.0
=========     Host Frame:core/ucp_context.c:1332:ucp_add_component_resources [0x222e1]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
=========     Host Frame:core/ucp_context.c:1470:ucp_fill_resources [0x230a1]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
=========     Host Frame:core/ucp_context.c:1887:ucp_init_version [0x23e48]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ucx/mt/lib/libucp.so.0
=========     Host Frame:../../../../../ompi/mca/pml/ucx/pml_ucx.c:236:mca_pml_ucx_open [0x62af]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/openmpi/mca_pml_ucx.so
=========     Host Frame:../../../../opal/mca/base/mca_base_components_open.c:68:mca_base_framework_components_open [0x5adc5]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
=========     Host Frame:../../../../ompi/mca/pml/base/pml_base_frame.c:183:mca_pml_base_open [0xacd47]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
=========     Host Frame:../../../../opal/mca/base/mca_base_framework.c:181:mca_base_framework_open [0x657b5]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libopen-pal.so.40
=========     Host Frame:../../ompi/runtime/ompi_mpi_init.c:617:ompi_mpi_init [0xb626d]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
=========     Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pinit.c:67:MPI_Init [0x6d09b]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi.so.40
=========     Host Frame:/var/jenkins/workspace/rel_nv_lib_hpcx_x86_64/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pinit_f.c:85:PMPI_Init_f08 [0x4a765]
=========                in /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/comm_libs/hpcx/hpcx-2.11/ompi/lib/libmpi_mpifh.so.40
=========     Host Frame:/home/hoffmann/work/main8_initProfile/src/mpi/mpiWrapper.F90:30:mpiwrapper_initmpi_ [0x369a6]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
=========     Host Frame:/home/hoffmann/work/main8_initProfile/src/common/main.F90:37:MAIN_ [0x57e9]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
=========     Host Frame:main [0x5773]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
=========     Host Frame:__libc_start_main [0x24083]
=========                in /usr/lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x566e]
=========                in /home/hoffmann/work/main8_initProfile/./bin/dew
========= 

Hi Natan,

You can ignore this. The OpenACC runtime is just checking if there’s already a CUDA context created before starting a new one. Compute-sanitizer flags it as an error even though it’s benign.

When you say that it works in debugging more, are you compiling with OpenACC enabled or just replaced “-fast” with “-g”?

No output with “PGI_ACC_DEBUG=1” suggests that the code isn’t running on the device, but I’m not clear under which case you used it.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

This means that a bad address was accessed on the device. It’s similar to a seg fault on the host.

Since you’re using managed memory, it’s unlikely to be a host pointer being accessed. Hence I’d look for a stack or heap overflow.

Note that your second post may be related. Does your code use automatics in the device code?

My best guess is that you do use automatics which are implicitly allocated arrays. The default heap size on the device is quite small, hence it’s easy to get overflows. You can use the environment variable “NV_ACC_CUDA_HEAPSIZE=” to increase the heap, but I usually recommend removing automatics if possible since device allocation gets serialized which may cause performance issues.

Of course this is just a guess, so If you can provide a reproducing example, I should be able help more.

-Mat

Hi Mat,

Thanks as always for your response.

When you say that it works in debugging more, are you compiling with OpenACC enabled or just replaced “-fast” with “-g”?

Yes, it is with the -g flag instead of -acc. So I guess this would be the reason why I would not see that error.

No output with “PGI_ACC_DEBUG=1” suggests that the code isn’t running on the device, but I’m not clear under which case you used it.

I tried this with the -acc flag and there was still no output.

Does your code use automatics in the device code?

Please excuse my ignorance, what do you mean by automatics or implicitly allocated arrays? As far as I know, the arrays I use are either globally allocated or are allocated at the beginning of subroutines. Could it be a result of declaring private variables in some loops? I try to avoid this but in some loops it is necessary (I can try and modify some algorithms to avoid this).

You can use the environment variable “NV_ACC_CUDA_HEAPSIZE=” to increase the heap

I tried NV_ACC_CUDA_HEAPSIZE=64 but that did not help.

Of course this is just a guess, so If you can provide a reproducing example, I should be able help more.

The code is rather large and I am not sure how I would create a minimal reproducing example. But, I would be happy to share the code if that would help.

Thanks again for your help.

-Natan

There some disconnect since this indicates that the code isn’t running on the device. But then you shouldn’t get a device error.

You are also compiling with “-acc=multicore”, so it’s possible that you’re running on the CPU in this case?

I tried NV_ACC_CUDA_HEAPSIZE=64 but that did not help.

I should have been more clear. The value is the size in bytes, so this is setting the heap to 64B. Try using a much large heap, like NV_ACC_CUDA_HEAPSIZE=512MB or possibly even larger.

Please excuse my ignorance, what do you mean by automatics or implicitly allocated arrays?

Automatic array are implicitly allocated upon entry to a subroutine with the size defined by an argument.

Could it be a result of declaring private variables in some loops?

Possible. For private worker/vector arrays, the compiler will allocate a large block of global memory where the size is number of workers/vectors times the size of the array. If the total aggregate size is greater than 2GB, you may need to add the flag “-Mlarge_arrays” to the indexing uses 64-bit offsets. It’s rare for this to occur, but possible.

But, I would be happy to share the code if that would help.

Minimal reproducers are ideal, but I’m fine with using the full source if you can share.

-Mat

There some disconnect since this indicates that the code isn’t running on the device. But then you shouldn’t get a device error.
You are also compiling with “-acc=multicore”, so it’s possible that you’re running on the CPU in this case?

Yes, it does seem a bit weird. I am not sure why there is no output. I took away that flag a while ago. My mistake for keeping it in the original post.

I should have been more clear. The value is the size in bytes, so this is setting the heap to 64B. Try using a much large heap, like NV_ACC_CUDA_HEAPSIZE=512MB or possibly even larger.

I tried much larger values and it still did not work. I don’t think this is the issue since a previous version of this code worked with a larger mesh/grid.

Possible. For private worker/vector arrays, the compiler will allocate a large block of global memory where the size is number of workers/vectors times the size of the array. If the total aggregate size is greater than 2GB, you may need to add the flag “-Mlarge_arrays” to the indexing uses 64-bit offsets. It’s rare for this to occur, but possible.

I have also added this flag but no luck.

Minimal reproducers are ideal, but I’m fine with using the full source if you can share.

Thank you. I will send via email. Is there a way to see where in the code this device side seg fault is occurring?