Best way to face this problem

micantox · May 14, 2010, 4:23pm

Hi all,

I’m a CUDA newbie programmer. I’m writing here to ask you guys to share opinions and ideas.

First of all the goal:

to reduce a 100 x 256 x 4 floats vector to just a 4 floats one

A cpu-computed equivalent might be expressed like follows:

memset(output, 0, 4*sizeof(float));

for( int i = 0; i < 100 * 256 * 4; i++ ) {

		output[i%4] += input[i];

}

Do you guys think that the reduce6 version of the SDK examples might be deployed in this case? How to choose threads’ and blocks’ number? Shall be necessary to devolve some computation to the host or is this entirely possible within the GPU?

The kernel sign should be something like this:

reduceKernel(float* input, float* output, size_t n);

where n might assume values different from 100 * 256 * 4 (that 100 might be 200, 300 or even 900) in the above specifications (btw, not a power of two).

My first idea was to use something like this (in the following code I still have no ideas about how to dimension blocks and grid):

template <unsigned int blockSize>

	__global__ void

	reduce6(float *g_idata, float *g_odata, unsigned int n)

	{   

		extern __shared__ int __sdata[];

		// perform first level of reduction,

		// reading from global memory, writing to shared memory

		unsigned int tid = threadIdx.x;

		unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;

		unsigned int gridSize = blockSize*2*gridDim.x;

		

		T mySum = 0; 

	

		// we reduce multiple elements per thread.  The number is determined by the 

		// number of active thread blocks (via gridDim).  More blocks will result

		// in a larger gridSize and therefore fewer elements per thread

		while (i < n)

		{

			mySum += g_idata[i];

			// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays

			if (i + blockSize < n)

				mySum += g_idata[i+blockSize];

			i += gridSize;

		} 

		

		// each thread puts its local sum into shared memory  

		sdata[tid] = mySum;

		__syncthreads();

	

		

		// do reduction in shared mem

		if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }

		if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }

		if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }

	

		if (tid < 32)

		{

			if (blockSize >=  64) { sdata[tid] = mySum = mySum + sdata[tid + 32];  }

			if (blockSize >=  32) { sdata[tid] = mySum = mySum + sdata[tid + 16];  }

			if (blockSize >=  16) { sdata[tid] = mySum = mySum + sdata[tid +  8];  }

			if (blockSize >=   8) { sdata[tid] = mySum = mySum + sdata[tid +  4];  }

		}

		

		// write result for this block to global mem 

		if (tid < 4)

			g_odata[blockIdx.x*4 + tid] = sdata[tid];

	}

and, after a DEVICETOHOST copy from g_odata to hostg_odata, letting the CPU make the last steps with something like:

for( int i = 0; i < blockSize*4; i++ )

{

	hostOutput[i%4] = hostg_odata[i];

}

Any better ideas? (probably every idea will be better than this).

Thanks to whoever will [try/be able] to help.

Bye!

Sarnath · May 15, 2010, 3:38pm

memset(output, 0, 4*sizeof(float));

for( int i = 0; i < 100 * 256 * 4; i++ ) {

		output[i%4] += input[i];

}

is same as

memset(output, 0, 4*sizeof(float));

for( int i = 0; i < 100 * 256; i++ ) {

	output[i] += input[i] + input[i+1] + input[i+2] + input[i+3];

}

Right?

Its an embarrassingly parallel problem… No problem at all…

You dont need reduction… Just assign threads to each element of I.

Let threads in a block use the shared memory region to avoid repeated loads…

Thats all.

sagrailo · May 16, 2010, 8:36am

@Sarnath: your re-write is something completely different from the original code. For the original code, I think reduction is indeed the right way to go.

micantox · May 16, 2010, 9:20am

Thanks sagrailo for pointing it out.

@Sarnath: yes, your rewrite is not the same as the original code. You’re summing up 4 contiguous vector elements, whereas I’m taking one element out of four.

Still, my question remains. If the above code is the way to go, how to dimension grid and blocks and how to perform the last reduction step? (I need a 4 elements vector as output).

Btw: I’ve just noticed there is an error in the final step as I wrote it in the original post. It lacks the sum of elements, so:

for( int i = 0; i < blockSize*4; i++ )

{

	hostOutput[i%4] += hostg_odata[i];

}

Thanks!

Sarnath · May 16, 2010, 11:49am

O yeah…Sorry about that… I misread % as /

Reduction is the right way to go in that case

Topic		Replies	Views
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2209	November 23, 2009
replacing float with double: simple reduction CUDA Programming and Performance	5	677	March 28, 2018
CUDA reduction CUDA Programming and Performance	10	51427	June 7, 2009
Basic reduction with CUDA CUDA Programming and Performance	1	481	March 22, 2018
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1794	January 14, 2009
Parallel reduction problem CUDA Programming and Performance	1	5082	November 29, 2010
NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions CUDA Programming and Performance	3	3375	July 23, 2009
Regarding Vector Reduction To find sum of all components of a vector CUDA Programming and Performance	1	3246	July 22, 2009
Reduction from SDK CUDA Programming and Performance	2	11602	March 12, 2009
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1805	November 18, 2009

Best way to face this problem

Related topics