NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions

As specified in subject, I am referring to reduction codes given in the whitepaper on Reductions in NVIDIA’s CUDA SDK Example Set.

The first version on reduction at shared memory using Interleaved addressing, works well if number of elements are all contained in one block but fails if there are more than one block.

To illustrate and make it clear:

I took a vector containing 4 elements <0…4> and blockSize =4 => numBlocks = 1 => Worked fine …expected result is 10.

When I take vector of length 5 and blockSize as 8 it still works fine.

But when I take vector size as 9 or 10 then kernel gives me result <28,17> as the final result => Intermediate result.There is still another step required

to complete the sum as 28+17.

__global__ void reduce0(float *g_idata, float *g_odata, unsigned int n)


   extern __shared__ float sdata[];

// each thread loads one element from global to shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];


	// do reduction in shared mem

	for(unsigned int s=1; s < blockDim.x; s *= 2)


		if (tid % (2*s) == 0) 


			sdata[tid] += sdata[tid + s];




	// write result for this block to global mem

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];


and in my main function I am calling it as :–

void main()



while(nBlocks > 0)


	   reduce0<<<nBlocks, blockSize>>>(a_d,result,N);   //call function

	N /= 2;							  

	nBlocks /= 2;	//num of blocks reduced by half.





Being new to CUDA, I know there must be some trivial error that I may be making. Please help me correct.Thanks.

Seem to have solved it but iff I take one block only.i.e single block containing all elements.

The reduction example you copied from assumes equal numbers of threads and data which must be powers of two.

The actual failure is you’re reading past the end of your data when the blocksize isn’t a power of two.

So for example when blocksize is 5, look at the for loop when s=1.

For the fifth and last thread, tid=4, starts reading data at sdata[4+1], which is outside your data array.

Sure it may work sometimes, but by reading undefined memory you’re in for undefined results.

It can be modified to work with something like

if (tid % (2*s) == 0 && tid+s<blockDim.x)


			sdata[tid] += sdata[tid + s];


Notice we just check for going to far and avoid the reference when we do.

Of course this is not the best reduction strategy at all, but you’re studying that great tutorial and that’s a great reason to ask questions, even quick ones like this.

Will try your strategy. I understood that there is some out of bonds referencing. How I solved my problem was to allocate shared memory equal to number of elements being processed.
that is I did: reduce0<<<numBlocks, blockSize, N * sizeof(float) >>>

I did not take care of power of two concept here.

But again, as I said, if I have 15 elements instead of 16 and take block size as 8 then there are two blocks. Each block is computing the sum correctly on elements falling in its range.

Thus for N = 15 and blockSize =8 , numBlocks =2 I get two outputs. On discussion with my colleague and on reading about shared memory , the reason seems obvious for above symptom, because, shared memory can not be common to different blocks.

My colleague suggested that after I get two outputs, I re-insert them to above function as a single array carrying these two outputs, so that next time only 1 block is processed.

This may seem fine but for large arrays there is a bottleneck associated with date transfers. Just thoughts… Thank you very much for your reply.