Correct, BLOCKSIZE is known at compile time. If there are loops involving this value and it is known at compile time, the loop can be unrolled. Loop unrolling can result in a performance increase, since instructions that control the loop are removed.
Another important thing to remember, each call to reduceOne with different BLOCKSIZE value will generate a brand new function making your final executable size bigger. One disadvantage of templates is that all BLOCKSIZES must be constant at compile time, so you need to have a function call for each possible value of blocksize.
Executable size rarely is a problem nowadays. A far greater annoyance is compile time. Separate compilation units and partial rebuilds help to work around this: only rebuild your collection of kernels if they actually changed. If you are designing a library and want to provide a collection of kernel template instantiations the preprocessor can help to generate these kernels for you.