Hello CUDA community,
I recently experience cudaErrorInvalidDeviceFunction
when launching a template global function using CUDA runtime, whose declaration and definition are separate and their function signature differs in the const
-declared arguments. This does not happen when the global function is not a template.
From my understanding, based on C++ specification, that compiler should consider declaration and definition of a function the same if the only difference is having const
in the outermost-level parameter type. I am not able to find any relevant description on CUDA documentation that explicitly says this works differently from C++ standard.
This can be demonstrated with a simple prototype program in around 50 line-of-code. Define a macro MAIN_ENABLE_BUG
during compilation to see the error, otherwise it works as intended. You may need to run Nsight debugger to catch the correct error, the error macro may return a wrong error code due to the asynchronous nature of CUDA.
#include <iostream>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
using namespace std;
#ifdef MAIN_ENABLE_BUG
template<typename T>
__global__ static void addNumber(T*, size_t);
#else//MAIN_ENABLE_BUG
template<typename T>
__global__ static void addNumber(T* const, const size_t);
#endif//MAIN_ENABLE_BUG
constexpr static size_t DataCount = 32u;
__device__ static unsigned int Data[DataCount];
#define CHECK_ERROR(ERR) do { if(ERR != cudaSuccess){\
cerr << "CUDA Error: " << cudaGetErrorString(ERR) << endl;\
std::terminate();\
}} while(false)
__host__ int main() {
CHECK_ERROR(cudaFree(0));
unsigned int* data_dev;
CHECK_ERROR(cudaGetSymbolAddress(reinterpret_cast<void**>(&data_dev), Data));
CHECK_ERROR(cudaMemset(data_dev, 0x00, DataCount * sizeof(unsigned int)));
addNumber<<<1, 32>>>(data_dev, DataCount);
CHECK_ERROR(cudaDeviceSynchronize());
CHECK_ERROR(cudaGetLastError());
//check result
unsigned int DataHost[DataCount];
CHECK_ERROR(cudaMemcpy(DataHost, data_dev, DataCount * sizeof(unsigned int), cudaMemcpyDeviceToHost));
for (unsigned int i = 0u; i < DataCount; i++) {
cout << DataHost[i] << ' ';
if ((i + 1u) % 8u == 0u) {
cout << endl;
}
}
}
template<typename T>
__global__ void addNumber(T* const vec, const size_t count) {
const unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= count) {
return;
}
vec[index]++;
}
Testing
I tried cuobjdump
to see the disassembly and cu++filt
to de-mangle the name, the compiler seems to generate the name void addNumber<unsigned int>(T1 *, unsigned long long)
as expected.
When I try running the bugged code on Nsight debugger, it shows the function signature as __wrapper__device_stub_addNumber<unsigned int>(unsinged int* const& vec, const unsigned __int64& count)
when it is launched.
- Tested on both Debug and Release mode.
- It doesn’t matter whether RDC is turned on or not.
- Code generation option is compute_75,sm_75.
- All other settings are unchanged.
System information
- CUDA runtime: 11.8
- CUDA driver version: 522.25
- OS: Windows 10 Pro 21H2
- Compiler: MSVC 16.11.21 (Visual Studio 2019)
- GPU: RTX 2080