This falls slightly into the realm of a C++ question, but with CUDA bits!

I have a kernel which I want to be faster. I have three input arguments which vary between 0-32 depending on earlier calculations and I have an array that is updated by the kernel. If I template one of the input arguments I get signficant speedups. If I were to do the others I may get similar speedups. However… as far as I can tell the C++ language isn’t really flexible enough to let me do this easily - each template argument has to be stated explicitly ie:

if (a == 1 && b == 1) kernel<1, 1>

else if (a == 1 && b == 2) kernel<1, 2>


If I were to do this for each combination of two variables I would end up with 1024 lines of boring. Three variables would be insane. Now I imagine that three variables might actually cause a slowdown… and would take years to compile… but the sort of speedups which I may get could save days/weeks in the future.

So my quesiton is this: is there any easy way of doing this? Some sort of loop would be lovely, however the compiler doesn’t appear to be clever enough to see that the values in the loop are constant and known at compile time… I contemplated some sort of macro but couldn’t get my head around how I’d do it.

Any ideas?

I use precisely this trick with a templated kernel that has 3 different parameters. It is a big speedup because it allows loops to be unrolled in the kernel body based on the template parameters, as well as eliminating some dead if-statements where applicable.

Unfortunately, the only suggestion I have for you here is to write a short Perl/Python script to generate your long chain of if-statements in a separate file, then #include them right into your function body at the appropriate location. Then you avoid the error prone task of cutting-and-pasting the selection block into existence.

Here’s an idea which works with Driver API.

Create a templated kernel, explicitly instantiate functions with desired ranges with a dummy function, i.e.

void dummy() {

  for( int a1 = 0; a1 < 32; a1++ )

	for( int a2 = 0; a2 < 32; a2++ )

	  for( int a3 = 0; a3 < 32; a3++ )



Then on the host create and initialize 3-d array which will hold pointers to instantiated functions. Trick here is to get addresses of compiled functions. If you look at nvcc output or into .cubin file, you’ll see that values of a1, a2 and a3 are actually part of mangled function name, so you can fill your array of function pointers for each a1, a2, a3. Now instead of many if’s and switch’es you can get address of required function with one table lookup.

Implemeting this approach with Runtime API is likely possible, but you’ll probably need two levels of templating – first for device functions and second for host-to-device stubs.

My compiler doesn’t accept the above code to initialise a template. It gives:

error: expression must have a constant value

on the last line. It doesn’t seem to be unrolling the loops at compile time. I’m using 2.0 on linux. #pragma unroll doesn’t help at all.

I think I’ve found a way to force unrolling using recursive templates (from here )http://www.codeproject.com/KB/cpp/crc_meta.aspx). This seems to be producing the results that I’m looking for for my one variable test case.