Hello,
In my project I encountered a strange issue where, after a call to a function returns that is implemented in a shared library, a call to cudaGetLastError()
will return cudaErrorMissingConfiguration
, while when called at the end of the function in the shared library, it returns no error. In both cases, I call cudaDeviceSynchronize()
before to ensure that all asynchronous work is waited for. I am also pretty sure that there is no C++ class destructor executing any other code at the end of the shared library function, and the program is single-threaded.
The above behavior seems to me like it should never happen, thus I am suspecting it to be caused by a bug in CUDA. I created a (relatively) small test program to reproduce the issue, which is attached to this post: cuda-bug-sample-program.tar.gz (5.6 MB). Unfortunately, I had to include loguru with the program as the bug did not occur without it, but as far as I can tell, loguru’s code is executed only after the bug occurs.
To give a short description of what the program does:
- The executable
testapp
’s main function is implemented inmain.cu
. It allocates two pieces of CUDA memory and then callsLibFunction()
. -
LibFunction()
is implemented inlib.cu
, which gets compiled into the shared librarytestlib
. It executescub::DeviceScan::ExclusiveSum()
on the two pieces of CUDA memory, checks for errors, and returns. The error checks in this function do not report anything. - After having called
LibFunction()
,main()
performs error checking itself. Here, suddenly the errorcudaErrorMissingConfiguration
is returned by theCUDA_CHECK_NO_ERROR()
macro (which internally callscudaGetLastError()
).
Here is how you can further convince yourself that this appears to be some kind of compiler bug:
- The file
unrelated.cu
contains a functionThisFunctionIsNotCalledAnywhere()
, which, true to its name, is not called anywhere. This file is compiled into thetestapp
executable. Ifunrelated.cu
is commented out or removed fromCMakeLists.txt
, then the program suddenly runs successfully, even though it shouldn’t affect its execution at all. (The functionThisFunctionIsNotCalledAnywhere()
contains a call tocub::DeviceScan::ExclusiveSum()
- I suspect that this in some weird way interferes with the call to the same function from the shared library). - If the file
cuda_util.h
is edited, replacing the two instances of loguru’sLOG(FATAL)
in this file withstd::cout
, then the program also finishes without errors, even though these lines should only be executed if an error has already occurred. - The issue happens only if
-DCMAKE_BUILD_TYPE=RelWithDebInfo
is used in the CMake call to configure the program. WithDebug
orRelease
, it does not occur. - valgrind does not show any error, and compute-sanitizer only shows the cudaErrorMissingConfiguration error that is also found by the program’s error checks.
Given that the issue seemingly only happens in very specific circumstances, I hope that somebody will still be able to reproduce this. I included the build folder including the compiled program in the attached archive for the case it helps anything.
In the specific circumstances where the issue occurs, it seems to happen very consistently however. With my full project code, I also tried downgrading CUDA to 11.5.2 and 11.4.4, and the issue kept happening. Changing between GCC 9 and GCC 11 also did not make the issue go away.
My environment is:
- System: Ubuntu 20.04 (with
gcc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
) - GPU: Geforce RTX 3090
- NVIDIA Driver Version: 510.47.03
- CUDA version (output of
/usr/local/cuda/bin/nvcc --version
):nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Thu_Feb_10_18:23:41_PST_2022 Cuda compilation tools, release 11.6, V11.6.112 Build cuda_11.6.r11.6/compiler.30978841_0
I built and ran the provided sample program as follows:
mkdir build
cd build
# Using CMake version 3.22.3 installed via snap:
# sudo snap install cmake --classic
/snap/cmake/current/bin/cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CUDA_ARCHITECTURES="86" ..
make
./testapp
Am I overlooking anything substantial here or is that actually some kind of compiler bug related to CUDA?
Thanks in advance for any help.