Ok, the way that code works is as follows;
Based on the size of the array with values to be summed, a determination is made regarding the number of strided values (elements in the input array) each thread in a block will load. In this case for Kepler/Maxwell 256 threads, 64 in the x dimension and 4 in the y dimension, so NbxGroups refers the number of x thread blocks of size 64.
So each thread block in the first launch will cover some multiple of the value in ;
const int blockSize1 = 16384;
but that means that if the array size is not a multiple of that value there will be ‘extra’ values which need to be considered, and that is how the ‘tail’ value is used.
So basically each thread block launched ends up with a sub-sum value for all the elements that block examined which is saved in a global array(of size one value per thread block launched).
That is what the first launch accomplishes, and then there is the second launch.
The second launch will then examine the ‘remainder’ values and sub them within the block. Once that is done then each thread in that second launch will go through a fraction of the values from the block sums saved during the first launch and sum/reduce with that final ‘dynamic’ thread block.
When that is all synchronized the final sum is saved to memory in the first value in the block-sum array by thread #0 . That answer is copied back to the host and done.
if(threadIdx.x == 0)
out[blockIdx.x] = smem[0][threadIdx.x]; // out[0] == ans
The reason this implementation tends to work better than the ‘canonical’ implementation is that (in my experience) GPUs tend to like to be ‘oversubscribed’ when it comes to memory operations, as long as it maintains a good level of occupancy and the operations are made in a coalesced fashion. So rather than trying to figure out the number of SMs on the GPU, and the threads possible per SM, it just floods the device with small groups of work.
The code is commented rather well so I think if you look through it you will be able to follow.
But then according to
Based on the size of the array with values to be summed, a determination is made regarding the number of strided values (elements in the input array) each thread in a block will load. In this case for Kepler/Maxwell 256 threads, 64 in the x dimension and 4 in the y dimension, so NbxGroups refers the number of x thread blocks of size 64.
The code would be static and won’t work on Fermi?