Quick Question on Reduction Sampling Code CUDA SDK Reduction


I am playing around the reduction sample code from SDK and also reading the white paper written by Harris. Here is one question, in the “Complete Unrolling” version (Reduction #6, I believe). The SDK source uses function template to bring in the block size. I don’t understand this actually. I am sure there is some reason to do this, but would that be easier to just use blockDim.x instead of using function template? Could some one point out this for me?



I’m still pretty unexperienced in CUDA programming but I also worked through this example.
At page 26 of Harris paper it says “Note: all code in RED will be evaluated at compile time.” and I think this is the advantage of using templates. Using blockDim.x would result in evaluation of the if statements at runtime and not at compile time which would slow your loop down.

Though, it would be nice if someone more experienced than me could verify this.

The templating is a small performance optimization. blockDim is held in shared memory, so there is some cycles of latency introduced by reading it (Vasily Volkov’s analysis suggested that it is of the order 30 cycles on the G80, I don’t know whether that still holds for newer hardware). That is a bit more than the instruction pipeline latency, which is supposed to be 21 cycles IIRC. By templating blockDim to a constant, the authors of the code are attempting to win back those extra cycles by eliminating the shared memory broadcast read of blockDim at every stage of the reduction.

For what it is worth, my version of the reduction just uses blockDim directly from shared memory. It probably isn’t quite as fast as the templated version, but it remains truer to the idea of pure C for device code and I find it more readable. But each to his own…

Using a template parameter means that the block size is know at compile time. Without knowing the block size at compile time, it is not possible to unroll the loop.