(error 98) due to "invalid device function" for a very simple templated kernel example

Hi,

I’ve got a type with a few template parameters to specialize its implementation for some different options. The code below compiles without error (MSVC 19.24.28314.0 and CUDA 11.0.16 on Windows 10),

#include <stdio.h>

enum class Shape { Triangle, Quadrilateral, Tetrahedron, Hexahedron };

template < Shape s, int p >
class Element;

template < int p >
struct Element < Shape::Triangle, p > {
  static constexpr int dofs = (p + 1) * (p + 2) / 2;
  int ids[dofs];
};

template < typename T > 
__global__
void gpu_kernel() {
  printf("gpu: %d\n", int(sizeof(T)));
}

template < typename T >
void cpu_kernel() {
  printf("cpu: %d\n", int(sizeof(T)));
}

int main() {

  // Element< Shape::Triangle, 2 > a;  <----
  cpu_kernel<Element<Shape::Triangle, 2>>();
  gpu_kernel<Element<Shape::Triangle, 2>><<<1,1>>>();

  return 0;

}

but produces unusual output (gpu output missing, GTX 1080ti w/ compute_61,code=sm_61):

$ ./main.exe
cpu: 24

Running it through cuda-memcheck reveals an error:

$ cuda-memcheck.exe main.exe 
========= CUDA-MEMCHECK
cpu: 24
========= Program hit cudaErrorInvalidDeviceFunction (error 98) due to "invalid device function" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x8ff3e) [0x2ad53e]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x928e3) [0x2afee3]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll [0x86ebe]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x113e8a) [0x33148a]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x12c212) [0x349812]
========= ERROR SUMMARY: 1 error

However, if I uncomment the indicated line in main() (Element< Shape::Triangle, 2 > a), then everything works again:

$ cuda-memcheck.exe main.exe 
========= CUDA-MEMCHECK
cpu: 24
gpu: 24
========= ERROR SUMMARY: 0 errors

Is the call to gpu_kernel<Element<Shape::Triangle,2>>() not instantiating the kernel template? It seems to have something to do with the existence of a partial specialization on Element too.

Try placing a “cudaDeviceSynchronize();” just below the kernel launch.

See the bit about output flushing here:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#limitations

If you start to print a lot of things from kernels you will probably face this other limitation:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#associated-host-side-api

Weird things happen when the printf buffer is filled. Just giving you a heads up because I have been there puzzled about missing output from printf.

Raul,

Thanks for the input but I think you misunderstand my problem. The gpu kernel is never even launching, so the print buffer isn’t being filled. Adding cudaDeviceSynchronize() does not change the outcome.

To reiterate, the problem is that if main is defined as

int main() {
  cpu_kernel<Element<Shape::Triangle, 2>>();
  gpu_kernel<Element<Shape::Triangle, 2>><<<1,1>>>();
  cudaDeviceSynchronize();
}

The gpu kernel never executes at all, citing the error :
“Program hit cudaErrorInvalidDeviceFunction (error 98) due to “invalid device function” on CUDA API call to cudaLaunchKernel.”, which is usually explained by the wrong choice of compute capability (which is not the case here).

Confusingly, adding a single line to main(), which has nothing to do with the execution of the kernel, resolves the issue.

int main() {
  Element< Shape::Triangle, 2 > a; // <--- ?
  cpu_kernel<Element<Shape::Triangle, 2>>();
  gpu_kernel<Element<Shape::Triangle, 2>><<<1,1>>>();
  cudaDeviceSynchronize();
}

Is anyone able to reproduce this issue?

Following up: it may be caused by the fact that the template definition and partial template specializations have a mismatched “class” and “struct”. Making them both struct seems to fix the problem, maybe the name mangling in NVCC is different for each, so it can’t find the right symbol at runtime?

The mismatched “struct” and “class” does not seem to have any impact when calling C++ template functions, but maybe the CUDA compiler is different!