(error 98) due to "invalid device function" for a very simple templated kernel example

spmish · June 29, 2020, 8:31am

Hi,

I’ve got a type with a few template parameters to specialize its implementation for some different options. The code below compiles without error (MSVC 19.24.28314.0 and CUDA 11.0.16 on Windows 10),

#include <stdio.h>

enum class Shape { Triangle, Quadrilateral, Tetrahedron, Hexahedron };

template < Shape s, int p >
class Element;

template < int p >
struct Element < Shape::Triangle, p > {
  static constexpr int dofs = (p + 1) * (p + 2) / 2;
  int ids[dofs];
};

template < typename T > 
__global__
void gpu_kernel() {
  printf("gpu: %d\n", int(sizeof(T)));
}

template < typename T >
void cpu_kernel() {
  printf("cpu: %d\n", int(sizeof(T)));
}

int main() {

  // Element< Shape::Triangle, 2 > a;  <----
  cpu_kernel<Element<Shape::Triangle, 2>>();
  gpu_kernel<Element<Shape::Triangle, 2>><<<1,1>>>();

  return 0;

}

but produces unusual output (gpu output missing, GTX 1080ti w/ compute_61,code=sm_61):

$ ./main.exe
cpu: 24

Running it through cuda-memcheck reveals an error:

$ cuda-memcheck.exe main.exe 
========= CUDA-MEMCHECK
cpu: 24
========= Program hit cudaErrorInvalidDeviceFunction (error 98) due to "invalid device function" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x8ff3e) [0x2ad53e]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x928e3) [0x2afee3]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll [0x86ebe]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x113e8a) [0x33148a]
=========     Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x12c212) [0x349812]
========= ERROR SUMMARY: 1 error

However, if I uncomment the indicated line in main() (Element< Shape::Triangle, 2 > a), then everything works again:

$ cuda-memcheck.exe main.exe 
========= CUDA-MEMCHECK
cpu: 24
gpu: 24
========= ERROR SUMMARY: 0 errors

Is the call to gpu_kernel<Element<Shape::Triangle,2>>() not instantiating the kernel template? It seems to have something to do with the existence of a partial specialization on Element too.

RaulPPelaez · July 8, 2020, 4:20pm

Try placing a “cudaDeviceSynchronize();” just below the kernel launch.

See the bit about output flushing here:

If you start to print a lot of things from kernels you will probably face this other limitation:

Weird things happen when the printf buffer is filled. Just giving you a heads up because I have been there puzzled about missing output from printf.

spmish · July 8, 2020, 10:38pm

Raul,

Thanks for the input but I think you misunderstand my problem. The gpu kernel is never even launching, so the print buffer isn’t being filled. Adding cudaDeviceSynchronize() does not change the outcome.

To reiterate, the problem is that if main is defined as

int main() {
  cpu_kernel<Element<Shape::Triangle, 2>>();
  gpu_kernel<Element<Shape::Triangle, 2>><<<1,1>>>();
  cudaDeviceSynchronize();
}

The gpu kernel never executes at all, citing the error :
“Program hit cudaErrorInvalidDeviceFunction (error 98) due to “invalid device function” on CUDA API call to cudaLaunchKernel.”, which is usually explained by the wrong choice of compute capability (which is not the case here).

Confusingly, adding a single line to main(), which has nothing to do with the execution of the kernel, resolves the issue.

int main() {
  Element< Shape::Triangle, 2 > a; // <--- ?
  cpu_kernel<Element<Shape::Triangle, 2>>();
  gpu_kernel<Element<Shape::Triangle, 2>><<<1,1>>>();
  cudaDeviceSynchronize();
}

Is anyone able to reproduce this issue?

spmish · July 8, 2020, 11:48pm

Following up: it may be caused by the fact that the template definition and partial template specializations have a mismatched “class” and “struct”. Making them both struct seems to fix the problem, maybe the name mangling in NVCC is different for each, so it can’t find the right symbol at runtime?

The mismatched “struct” and “class” does not seem to have any impact when calling C++ template functions, but maybe the CUDA compiler is different!

Topic		Replies	Views
Strange bug with __CUDA_ARCH__ and kernel template implicit instantiation CUDA Developer Tools	0	583	June 18, 2021
cudaMemcpyToSymbol returnes "invalid device symbol" CUDA Programming and Performance	12	35608	May 2, 2011
cudaErrorInvalidDeviceFunction Simple program throwing cudaErrorInvalidDeviceFunction error CUDA Programming and Performance	1	2513	April 24, 2010
Correct output with emulation mode, wrong with GPU/Execution CUDA Programming and Performance	6	3324	March 25, 2010
Invalid device function CUDA Programming and Performance	10	6456	November 19, 2008
CUDA Error: Invalid Device Function Debugging CUDA errors CUDA Programming and Performance	3	5762	July 29, 2009
Kernel Launch Failure Very simple kernel CUDA Programming and Performance	3	3891	September 14, 2011
Embarassingly beginner question CUDA Programming and Performance	8	3289	May 22, 2009
cudaErrorInvalidDeviceFunction CUDA Programming and Performance cuda , jetson	6	2775	September 26, 2022
CUDA SDK 2.1 breaks emulation features when no CUDA hardware is installed CUDA Programming and Performance	8	8477	April 15, 2009

(error 98) due to "invalid device function" for a very simple templated kernel example

Related topics