templates and cuda

I was browsing a piece of code for reduction, and a device function was declared as

template <unsigned int BLOCKSIZE, class T> __device__ static void reduceOne(int tid, T *s_A);

How does this differ from:

template <class T> __device__ static void reduceOne(unsigned int BLOCKSIZE, int tid, T *s_A);

And why is the former preferred over the ladder?

Edit: so the first one is resolved at compile time, and BLOCKSIZE must be a constant

Correct, BLOCKSIZE is known at compile time. If there are loops involving this value and it is known at compile time, the loop can be unrolled. Loop unrolling can result in a performance increase, since instructions that control the loop are removed.

Another important thing to remember, each call to reduceOne with different BLOCKSIZE value will generate a brand new function making your final executable size bigger. One disadvantage of templates is that all BLOCKSIZES must be constant at compile time, so you need to have a function call for each possible value of blocksize.

Executable size rarely is a problem nowadays. A far greater annoyance is compile time. Separate compilation units and partial rebuilds help to work around this: only rebuild your collection of kernels if they actually changed. If you are designing a library and want to provide a collection of kernel template instantiations the preprocessor can help to generate these kernels for you.