InvalidDeviceFunction error when launching templated global function

Hello CUDA community,

I recently experience cudaErrorInvalidDeviceFunction when launching a template global function using CUDA runtime, whose declaration and definition are separate and their function signature differs in the const-declared arguments. This does not happen when the global function is not a template.

From my understanding, based on C++ specification, that compiler should consider declaration and definition of a function the same if the only difference is having const in the outermost-level parameter type. I am not able to find any relevant description on CUDA documentation that explicitly says this works differently from C++ standard.

This can be demonstrated with a simple prototype program in around 50 line-of-code. Define a macro MAIN_ENABLE_BUG during compilation to see the error, otherwise it works as intended. You may need to run Nsight debugger to catch the correct error, the error macro may return a wrong error code due to the asynchronous nature of CUDA.

#include <iostream>
#include <cuda_runtime.h>

#include <device_launch_parameters.h>

using namespace std;

template<typename T>
__global__ static void addNumber(T*, size_t);
template<typename T>
__global__ static void addNumber(T* const, const size_t);

constexpr static size_t DataCount = 32u;
__device__ static unsigned int Data[DataCount];

#define CHECK_ERROR(ERR) do { if(ERR != cudaSuccess){\
	cerr << "CUDA Error: " << cudaGetErrorString(ERR) << endl;\
}} while(false)

__host__ int main() {

	unsigned int* data_dev;
	CHECK_ERROR(cudaGetSymbolAddress(reinterpret_cast<void**>(&data_dev), Data));
	CHECK_ERROR(cudaMemset(data_dev, 0x00, DataCount * sizeof(unsigned int)));

	addNumber<<<1, 32>>>(data_dev, DataCount);

	//check result
	unsigned int DataHost[DataCount];
	CHECK_ERROR(cudaMemcpy(DataHost, data_dev, DataCount * sizeof(unsigned int), cudaMemcpyDeviceToHost));
	for (unsigned int i = 0u; i < DataCount; i++) {
		cout << DataHost[i] << ' ';
		if ((i + 1u) % 8u == 0u) {
			cout << endl;

template<typename T>
__global__ void addNumber(T* const vec, const size_t count) {
	const unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
	if (index >= count) {



I tried cuobjdump to see the disassembly and cu++filt to de-mangle the name, the compiler seems to generate the name void addNumber<unsigned int>(T1 *, unsigned long long) as expected.

When I try running the bugged code on Nsight debugger, it shows the function signature as __wrapper__device_stub_addNumber<unsigned int>(unsinged int* const& vec, const unsigned __int64& count) when it is launched.

  • Tested on both Debug and Release mode.
  • It doesn’t matter whether RDC is turned on or not.
  • Code generation option is compute_75,sm_75.
  • All other settings are unchanged.

System information

  • CUDA runtime: 11.8
  • CUDA driver version: 522.25
  • OS: Windows 10 Pro 21H2
  • Compiler: MSVC 16.11.21 (Visual Studio 2019)
  • GPU: RTX 2080

I suggest filing a bug.

1 Like