NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions

cirus · July 22, 2009, 7:44pm

As specified in subject, I am referring to reduction codes given in the whitepaper on Reductions in NVIDIA’s CUDA SDK Example Set.

The first version on reduction at shared memory using Interleaved addressing, works well if number of elements are all contained in one block but fails if there are more than one block.

To illustrate and make it clear:

I took a vector containing 4 elements <0…4> and blockSize =4 => numBlocks = 1 => Worked fine …expected result is 10.

When I take vector of length 5 and blockSize as 8 it still works fine.

But when I take vector size as 9 or 10 then kernel gives me result <28,17> as the final result => Intermediate result.There is still another step required

to complete the sum as 28+17.

__global__ void reduce0(float *g_idata, float *g_odata, unsigned int n)

{

   extern __shared__ float sdata[];

// each thread loads one element from global to shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];

	__syncthreads();

	// do reduction in shared mem

	for(unsigned int s=1; s < blockDim.x; s *= 2)

	{

		if (tid % (2*s) == 0) 

		{

			sdata[tid] += sdata[tid + s];

		}

		__syncthreads();

	}

	// write result for this block to global mem

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

and in my main function I am calling it as :–

void main()

{

...

while(nBlocks > 0)

  {

	   reduce0<<<nBlocks, blockSize>>>(a_d,result,N);   //call function

	N /= 2;							  

	nBlocks /= 2;	//num of blocks reduced by half.

		

  }

..

}

Being new to CUDA, I know there must be some trivial error that I may be making. Please help me correct.Thanks.

cirus · July 23, 2009, 2:04am

Seem to have solved it but iff I take one block only.i.e single block containing all elements.

SPWorley · July 23, 2009, 4:36am

The reduction example you copied from assumes equal numbers of threads and data which must be powers of two.

The actual failure is you’re reading past the end of your data when the blocksize isn’t a power of two.

So for example when blocksize is 5, look at the for loop when s=1.

For the fifth and last thread, tid=4, starts reading data at sdata[4+1], which is outside your data array.

Sure it may work sometimes, but by reading undefined memory you’re in for undefined results.

It can be modified to work with something like

if (tid % (2*s) == 0 && tid+s<blockDim.x)

		{

			sdata[tid] += sdata[tid + s];

		}

Notice we just check for going to far and avoid the reference when we do.

Of course this is not the best reduction strategy at all, but you’re studying that great tutorial and that’s a great reason to ask questions, even quick ones like this.

cirus · July 23, 2009, 3:24pm

Will try your strategy. I understood that there is some out of bonds referencing. How I solved my problem was to allocate shared memory equal to number of elements being processed.
that is I did: reduce0<<<numBlocks, blockSize, N * sizeof(float) >>>

I did not take care of power of two concept here.

But again, as I said, if I have 15 elements instead of 16 and take block size as 8 then there are two blocks. Each block is computing the sum correctly on elements falling in its range.

Thus for N = 15 and blockSize =8 , numBlocks =2 I get two outputs. On discussion with my colleague and on reading about shared memory , the reason seems obvious for above symptom, because, shared memory can not be common to different blocks.

My colleague suggested that after I get two outputs, I re-insert them to above function as a single array carrying these two outputs, so that next time only 1 block is processed.

This may seem fine but for large arrays there is a bottleneck associated with date transfers. Just thoughts… Thank you very much for your reply.

Topic		Replies	Views
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2286	November 23, 2009
reduction6 kernel from CUDA SDK not working correctly CUDA Programming and Performance	2	1871	August 10, 2010
Reduction Problem CUDA Programming and Performance	5	4800	October 13, 2010
Parallel reduction problem CUDA Programming and Performance	1	5119	November 29, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1869	January 14, 2009
Simple (honest!) change to parallel reduction example yields bizarre result? CUDA Programming and Performance	1	2472	December 26, 2011
Vector Reduction CUDA Programming and Performance	3	19800	March 9, 2011
Problem with reduction CUDA Programming and Performance	4	2638	March 22, 2011
reduction example for other than powers of 2 CUDA Programming and Performance	10	1874	October 6, 2010
Reduction from SDK CUDA Programming and Performance	2	11639	March 12, 2009

NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions

Related topics