templated functions problem with CUDA

I have the following code (abstracted from the real thing). I can make it crash cudafe1 or give unresolved references, but can’t make it work.

// templated kernel
global void prim_kernel(float * dest, int n_elts,
float *a0, float *a1, T a2i);

global void prim_kernel(float * dest, int n_elts,
float *a0, float *a1, float a2i)

global void prim_kernel(float * dest, int n_elts,
float *a0, float *a1, float *a2)

// templated host side
void prim(float * d, float *a0, float *a1, T2 a2)
int n_elts = 1000000;
int block_size = 128;
int n_blocks = (n_elts / block_size) + ((n_elts % block_size) ? 1 : 0);
prim_kernel<<< n_blocks, block_size, 0>>>(d, n_elts, a0, a1, a2);

#ifdef USE_TEMPLATE // correct, but crashes cudafe1
template void prim(float *d, float *a0, float *a1, float *a2);
template void prim(float *d, float *a0, float *a1, float a2);
#else // gives unresolved externals because it generates non-templated versions
void prim(float *d, float *a0, float *a1, float *a2);
void prim(float *d, float *a0, float *a1, float a2);

float f;
prim(&f, &f, &f, &f);
prim(&f, &f, &f, f);

If I compile it as is, it crashes cudafe1:

C:\tmp>nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2007 NVIDIA Corporation
Built on Wed_Jul_16_12:57:50_PDT_2008
Cuda compilation tools, release 2.0, V0.2.1221

cudabug.cu(34): internal error: assertion failed: get_scope_for_routine_definiti
on: scope is NULL (D:/Bld/rel/gpgpu/toolkit/r2.0/compiler/edg/EDG_3.9/src/c_gen

be.c, line 8887)

1 catastrophic error detected in the compilation of “C:\DOCUME~1\garyo\LOCALS~1
Compilation aborted.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application’s support team for more information.

–error 0x3 –


If I toggle the #define USE_TEMPLATE to turn off the template decls near the end, it compiles OK, but won’t link due to unresolved refs:

(just the tail of nvcc -v here, showing the invocation of cl to link):
#$ cl -D__CUDA_ARCH__=100 -nologo -Fe"a.exe" “C:\DOCUME~1\garyo\LOCALS~1\Temp/tm
pxft_00001548_00000000-13_cudabug.obj” -link -INCREMENTAL:NO “/LIBPATH:C:\CUDA
\bin/…/lib” cudart.lib
tmpxft_00001548_00000000-13_cudabug.obj : error LNK2019: unresolved external sym
bol “void __cdecl prim(float *,float *,float *,float)” (?prim@@YAXPAM00M@Z) refe
renced in function _main
tmpxft_00001548_00000000-13_cudabug.obj : error LNK2019: unresolved external sym
bol “void __cdecl prim(float *,float *,float *,float *)” (?prim@@YAXPAM000@Z) re
ferenced in function _main
a.exe : fatal error LNK1120: 2 unresolved externals

–error 0x2 –

The reason for this latter error is a name mangling issue. But I think the correct code is to use the “template” keyword anyway, and that’s the main problem – it crashes cudafe1. Is there any workaround? I need this kind of thing to work, and on a fairly large scale.

– Gary

Hi Gary!

I think you must place

template <>

in front of the specilizations. For example

template <>

__global__ void prim_kernel(float * dest, int n_elts, 

					float *a0, float *a1, float a2i)


So you can specilize for T=float and T=float*

But Im not sure if this will solve your problem :mellow:



Thanks for the advice, Navier-Stokes. But it makes no difference, cudafe1 still crashes in the same way. (Actually with your change it crashes cudafe1 whether I turn on or off the #define!)

Any other ideas anyone?

It looks like you’re using CUDA 2.0. The most recent version of CUDA, 2.2, features significantly improved support for C++ templates. Perhaps the latest version of cudafe will correctly compile your code.

yeah, doesn’t crash, but doesn’t compile either:

mjolnir:~ tim$ nvcc compiletest.cu 

compiletest.cu: In function ‘void prim(float*, float*, float*, T2)’:

compiletest.cu:21: error: ‘prim_kernel’ was not declared in this scope

compiletest.cu:21: error: expected primary-expression before ‘>’ token

compiletest.cu:21: error: expected primary-expression before ‘)’ token

(on my mac, running 2.2)

this is a known problem and will be fixed in the 2.3 release.

a known workaround is to give the prototype instantiation an implementation, ie

global void prim_kernel(float * dest, int n_elts,
float *a0, float *a1, T a2i) {}