I may have been rely on CUB for so long that I start to forget the implementation for device reduction… I believe in implementing device reduction, we first do block reduction, and somehow add sums from all blocks. As there is no way to share data efficiently across blocks, do we launch multiple kernels to shrink the size of array each time a kernel is launched till the size of array finally fit in the size of a single CUDA block? I can only recall what happens in a block reduction: we first do warp reduction across the block and then use the first warp to add the sums from all the warps stored in shared memory. Maybe things are the same at device level with the only difference that sums are shared in an array of global memory across blocks?
Is it a summation (integer? float?) or a more complicated reduction?
It is also possible to use global atomics, either atomicAdd or CAS for reducing between blocks without starting a new kernel.
Not saying that it would necessarily be faster in your case. It depends.
We can think of reduction as arithmetic summation for simplicity. According to my experience, atomicAdd on global memory can be very slow as every thread is executing a while loop of atomicCAS. The situation may be better in this case, due to relatively small number of blocks as compared to threads.
On the other hand, storing results in a buffer using global memory and do repetitive rounds of block reduction may not be that formidable, as a whole while loop may be executed in a single kernel, and memory can be reused for all the rounds of block reduction. The while loop stops when the data size is below or equal to the size of each block. I start to recall different ways to do it.
If some operations (e.g. integer addition, minimum, maximum) can be processed in small arithmetic units integrated into the memory hierarchy, no while loop as with atomicCAS is necessary.
Depending on whether the result is needed or not, the threads could even choose not to wait.