In the reduction SDK example they have 7 different kernels(kernel 0 - kernel 6) which have different optimizations. I however noticed the following in the reduction.cu file:
// execute the initial kernel
reduce<T>(n, numThreads, numBlocks, whichKernel, d_idata, d_odata);
// sum partial block sums on GPU
int s = numBlocks;
int kernel = (whichKernel == 6) ? 5 : whichKernel;
...
If you specify the use of kernel 6 it does the initial part of the reduction on kernel 6, but then for the rest of the reduction it uses kernel 5. Why is this? Thanks