Strange bug with __CUDA_ARCH__ and kernel template implicit instantiation

Hello guys,

I ran into something strange, which I think is a bug. Here a minimal example:

#include <cstdio>

template <typename T>
__global__ void kernel(T* array) {}

template<typename T>
class Array {
  T* data;
  Array(size_t n) {
    #ifndef __CUDA_ARCH__ 
    // Work with other undefined macro
    // #ifndef __XXXUISTISTUISTUSI__

int main() {
  Array<int> arr(1);
  return 0;

The error obtained using cuda-memcheck is “Program hit cudaErrorInvalidDeviceFunction (error 98) due to “invalid device function” on CUDA API call to cudaLaunchKernel.”.

If you change the define to use anything else than __CUDA_ARCH__, it will work…

This minimal code doesn’t make too much sense here, but I’m using __CUDA_ARCH__ to distinguish between host and device code in __device__ __host__ functions.

Other info:

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |

If you think it’s a bug too, I’ll report it.