Unexpected cudaErrorMissingConfiguration after returning from function in shared library

Hello,

In my project I encountered a strange issue where, after a call to a function returns that is implemented in a shared library, a call to cudaGetLastError() will return cudaErrorMissingConfiguration, while when called at the end of the function in the shared library, it returns no error. In both cases, I call cudaDeviceSynchronize() before to ensure that all asynchronous work is waited for. I am also pretty sure that there is no C++ class destructor executing any other code at the end of the shared library function, and the program is single-threaded.

The above behavior seems to me like it should never happen, thus I am suspecting it to be caused by a bug in CUDA. I created a (relatively) small test program to reproduce the issue, which is attached to this post: cuda-bug-sample-program.tar.gz (5.6 MB). Unfortunately, I had to include loguru with the program as the bug did not occur without it, but as far as I can tell, loguru’s code is executed only after the bug occurs.

To give a short description of what the program does:

  • The executable testapp's main function is implemented in main.cu. It allocates two pieces of CUDA memory and then calls LibFunction().
  • LibFunction() is implemented in lib.cu, which gets compiled into the shared library testlib. It executes cub::DeviceScan::ExclusiveSum() on the two pieces of CUDA memory, checks for errors, and returns. The error checks in this function do not report anything.
  • After having called LibFunction(), main() performs error checking itself. Here, suddenly the error cudaErrorMissingConfiguration is returned by the CUDA_CHECK_NO_ERROR() macro (which internally calls cudaGetLastError()).

Here is how you can further convince yourself that this appears to be some kind of compiler bug:

  • The file unrelated.cu contains a function ThisFunctionIsNotCalledAnywhere(), which, true to its name, is not called anywhere. This file is compiled into the testapp executable. If unrelated.cu is commented out or removed from CMakeLists.txt, then the program suddenly runs successfully, even though it shouldn’t affect its execution at all. (The function ThisFunctionIsNotCalledAnywhere() contains a call to cub::DeviceScan::ExclusiveSum() - I suspect that this in some weird way interferes with the call to the same function from the shared library).
  • If the file cuda_util.h is edited, replacing the two instances of loguru’s LOG(FATAL) in this file with std::cout, then the program also finishes without errors, even though these lines should only be executed if an error has already occurred.
  • The issue happens only if -DCMAKE_BUILD_TYPE=RelWithDebInfo is used in the CMake call to configure the program. With Debug or Release, it does not occur.
  • valgrind does not show any error, and compute-sanitizer only shows the cudaErrorMissingConfiguration error that is also found by the program’s error checks.

Given that the issue seemingly only happens in very specific circumstances, I hope that somebody will still be able to reproduce this. I included the build folder including the compiled program in the attached archive for the case it helps anything.
In the specific circumstances where the issue occurs, it seems to happen very consistently however. With my full project code, I also tried downgrading CUDA to 11.5.2 and 11.4.4, and the issue kept happening. Changing between GCC 9 and GCC 11 also did not make the issue go away.

My environment is:

  • System: Ubuntu 20.04 (with gcc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0)
  • GPU: Geforce RTX 3090
  • NVIDIA Driver Version: 510.47.03
  • CUDA version (output of /usr/local/cuda/bin/nvcc --version):
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2022 NVIDIA Corporation
    Built on Thu_Feb_10_18:23:41_PST_2022
    Cuda compilation tools, release 11.6, V11.6.112
    Build cuda_11.6.r11.6/compiler.30978841_0
    

I built and ran the provided sample program as follows:

mkdir build
cd build
# Using CMake version 3.22.3 installed via snap:
# sudo snap install cmake --classic
/snap/cmake/current/bin/cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CUDA_ARCHITECTURES="86" ..
make
./testapp

Am I overlooking anything substantial here or is that actually some kind of compiler bug related to CUDA?
Thanks in advance for any help.

I went ahead and also created a bug report for this as described in the sticky thread in this forum (bug ID 3571216).

Also, for completeness of the description, this is the output of the example program when the bug occurs:

Before checks in lib.cu
After checks in lib.cu
Before checks in main
Stack trace:
3       0x557dc570c77e ./testapp(+0xb77e) [0x557dc570c77e]
2       0x7fbc2c7180b3 __libc_start_main + 243
1       0x557dc570c145 ./testapp(+0xb145) [0x557dc570c145]
0       0x7fbc2cbb75ae loguru::StreamLogger::~StreamLogger() + 126
2022-03-16 01:28:57.836 (   0.105s) [        2C5A0000]                main.cu:27    FATL| Cuda Error: __global__ function call is not configured
Aborted (core dumped)

And this is the expected output without the bug:

Before checks in lib.cu
After checks in lib.cu
Before checks in main
After checks in main