Sum reduction working in Fermi, Kepler and Maxwell

carcamovski · February 1, 2016, 2:43am

Doesnt the sum reduction code to which I posted the link (Jimmy Pettersson’s code) work across multiple architectures?

I have yet to see an implementation which beats that approach, and that code was posted 3 years ago I believe.

Maybe it does. But I didn’t understand what is the variable NbXGroups or tail and start_adr. And also why he use threads in a two dimensional way…

Ok, the way that code works is as follows;

Based on the size of the array with values to be summed, a determination is made regarding the number of strided values (elements in the input array) each thread in a block will load. In this case for Kepler/Maxwell 256 threads, 64 in the x dimension and 4 in the y dimension, so NbxGroups refers the number of x thread blocks of size 64.

So each thread block in the first launch will cover some multiple of the value in ;
const int blockSize1 = 16384;
but that means that if the array size is not a multiple of that value there will be ‘extra’ values which need to be considered, and that is how the ‘tail’ value is used.

So basically each thread block launched ends up with a sub-sum value for all the elements that block examined which is saved in a global array(of size one value per thread block launched).

That is what the first launch accomplishes, and then there is the second launch.

The second launch will then examine the ‘remainder’ values and sub them within the block. Once that is done then each thread in that second launch will go through a fraction of the values from the block sums saved during the first launch and sum/reduce with that final ‘dynamic’ thread block.

When that is all synchronized the final sum is saved to memory in the first value in the block-sum array by thread #0. That answer is copied back to the host and done.
if(threadIdx.x == 0)
		out[blockIdx.x] = smem[0][threadIdx.x]; // out[0] == ans
The reason this implementation tends to work better than the ‘canonical’ implementation is that (in my experience) GPUs tend to like to be ‘oversubscribed’ when it comes to memory operations, as long as it maintains a good level of occupancy and the operations are made in a coalesced fashion. So rather than trying to figure out the number of SMs on the GPU, and the threads possible per SM, it just floods the device with small groups of work.

The code is commented rather well so I think if you look through it you will be able to follow.

But then according to

Based on the size of the array with values to be summed, a determination is made regarding the number of strided values (elements in the input array) each thread in a block will load. In this case for Kepler/Maxwell 256 threads, 64 in the x dimension and 4 in the y dimension, so NbxGroups refers the number of x thread blocks of size 64.

The code would be static and won’t work on Fermi?

Topic		Replies	Views
Speedy general reduction code ( 83.5 % of peak) Works for any size CUDA Programming and Performance	44	30589	October 29, 2010
sequential sum within a kernel. CUDA Programming and Performance	23	5142	September 8, 2008
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7922	July 22, 2010
Reduction kernel for Fermi CUDA Programming and Performance	8	1701	June 11, 2010
Would like to share my speedy reduction code Very simple code! CUDA Programming and Performance	0	1113	July 29, 2010
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6597	April 13, 2009
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4573	August 26, 2018
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1867	January 14, 2009
CUDA reduction CUDA Programming and Performance	10	51521	June 7, 2009
CUDA FORTRAN shared memory warp-level sum reduction Legacy PGI Compilers	1	3417	May 19, 2014

Sum reduction working in Fermi, Kepler and Maxwell

Related topics