Hi,
I’ve got a type with a few template parameters to specialize its implementation for some different options. The code below compiles without error (MSVC 19.24.28314.0 and CUDA 11.0.16 on Windows 10),
#include <stdio.h>
enum class Shape { Triangle, Quadrilateral, Tetrahedron, Hexahedron };
template < Shape s, int p >
class Element;
template < int p >
struct Element < Shape::Triangle, p > {
static constexpr int dofs = (p + 1) * (p + 2) / 2;
int ids[dofs];
};
template < typename T >
__global__
void gpu_kernel() {
printf("gpu: %d\n", int(sizeof(T)));
}
template < typename T >
void cpu_kernel() {
printf("cpu: %d\n", int(sizeof(T)));
}
int main() {
// Element< Shape::Triangle, 2 > a; <----
cpu_kernel<Element<Shape::Triangle, 2>>();
gpu_kernel<Element<Shape::Triangle, 2>><<<1,1>>>();
return 0;
}
but produces unusual output (gpu output missing, GTX 1080ti w/ compute_61,code=sm_61):
$ ./main.exe
cpu: 24
Running it through cuda-memcheck reveals an error:
$ cuda-memcheck.exe main.exe
========= CUDA-MEMCHECK
cpu: 24
========= Program hit cudaErrorInvalidDeviceFunction (error 98) due to "invalid device function" on CUDA API call to cudaLaunchKernel.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x8ff3e) [0x2ad53e]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x928e3) [0x2afee3]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll [0x86ebe]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x113e8a) [0x33148a]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nv_dispui.inf_amd64_5ae9cabd19b3b3c7\nvcuda64.dll (cuProfilerStop + 0x12c212) [0x349812]
========= ERROR SUMMARY: 1 error
However, if I uncomment the indicated line in main() (Element< Shape::Triangle, 2 > a
), then everything works again:
$ cuda-memcheck.exe main.exe
========= CUDA-MEMCHECK
cpu: 24
gpu: 24
========= ERROR SUMMARY: 0 errors
Is the call to gpu_kernel<Element<Shape::Triangle,2>>()
not instantiating the kernel template? It seems to have something to do with the existence of a partial specialization on Element
too.