Possible bug with templated CUDA code in -deviceemu mode Templated CUDA code wouldn't link in -d

I skimmed the forums already, but I couldn’t find any other forum topics along these lines.

I have a templated CUDA kernel and matching templated function that does setup on the CPU and calls the templated kernel. I tried to add printf’s to my functions in emulation mode to track down a bug but I got some linker errors about the template functions being undefined. If I compile without the printf’s, it compiles and links correctly (regardless of the -deviceemu flag).

Here’s a simplified version of my code that produces the problem:

// templated functions

typedef struct gpu_adder


  inline __device__ __host__ static float do_operator(float a, float b) {return a - b;}


template <typename operatorT>

__global__ void generic_kernel(float *dest, float* src, int offset)


  int pixel = blockIdx.x * blockDim.x + threadIdx.x;

  //  printf("test1"); //   compiler objects about generic_kernel if this call is here

  dest[pixel] = operatorT::do_operator(src[pixel], offset);


template <typename operatorT>

void generic_img(float *dest, float *value, float offset)


// dummy block size/block number for testing purposes only

  generic_kernel<operatorT><<<1, 128>>>dest, value, offset);


// normal function for comparison

__global__ void subtract_kernel(float *dest, float *src, int offset)


  int pixel = blockIdx.x * blockDim.x + threadIdx.x;

  printf("test2"); // this printf is completely fine.

  dest[pixel] = src[pixel] - offset;


void subtract_img(float *dest, float *value, float offset)


  subtract_kernel<<<1, 128>>>(dest, value, offset);


void launch_kernels()


  float *src;

  float *dest;

// allocate memory on the GPU and copy data over

  subtract_img(dest, src, 1.0);

  generic_img<gpu_adder>(dest, src, 1.0);


On Windows XP, with Visual Studios 2003 and Cuda 2.0 (because we haven’t upgraded to VS 2005 yet):

If I put a printf in generic_kernel I get the error:

tests\cuda\cuda-templates.cu(28) : error C2065: ‘_ZZ14generic_kernelI9gpu_adderEvPfS1_iEs’ : undeclared identifier

A printf in subtract_kernel compiles just fine.

On Mac I get (the function names are different because these errors are from my original code rather than my test code)

kernels.cu: In function ‘void Z24generic_kernelI9gpu_adderEvPfiifS1(float*, int, int, float, float*)’:

kernels.cu:41: error: ‘_ZZ24generic_kernelI9gpu_adderEvPfiifS1_Es’ was not declared in this scope

Any ideas what’s going wrong? My best guess is that the names for looking up the function during linking might not be consistent between where the function is defined and where the function is called? Aside from that, I’m at a loss.

Any help would be greatly appreciated.

You have a typo in your call to generic_kernel, but it works fine for me on 2.1. Templates are not officially supported in 2.0 and were known to be somewhat broken.

Support for printf (specifically, string literals) in templated kernels in device emulation mode has also been significantly been improved in CUDA 2.2.