Best way to face this problem

Hi all,

I’m a CUDA newbie programmer. I’m writing here to ask you guys to share opinions and ideas.

First of all the goal:

to reduce a 100 x 256 x 4 floats vector to just a 4 floats one

A cpu-computed equivalent might be expressed like follows:

memset(output, 0, 4*sizeof(float));

for( int i = 0; i < 100 * 256 * 4; i++ ) {

		output[i%4] += input[i];

}

Do you guys think that the reduce6 version of the SDK examples might be deployed in this case? How to choose threads’ and blocks’ number? Shall be necessary to devolve some computation to the host or is this entirely possible within the GPU?

The kernel sign should be something like this:

reduceKernel(float* input, float* output, size_t n);

where n might assume values different from 100 * 256 * 4 (that 100 might be 200, 300 or even 900) in the above specifications (btw, not a power of two).

My first idea was to use something like this (in the following code I still have no ideas about how to dimension blocks and grid):

template <unsigned int blockSize>

	__global__ void

	reduce6(float *g_idata, float *g_odata, unsigned int n)

	{   

		extern __shared__ int __sdata[];

		// perform first level of reduction,

		// reading from global memory, writing to shared memory

		unsigned int tid = threadIdx.x;

		unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;

		unsigned int gridSize = blockSize*2*gridDim.x;

		

		T mySum = 0; 

	

		// we reduce multiple elements per thread.  The number is determined by the 

		// number of active thread blocks (via gridDim).  More blocks will result

		// in a larger gridSize and therefore fewer elements per thread

		while (i < n)

		{

			mySum += g_idata[i];

			// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays

			if (i + blockSize < n)

				mySum += g_idata[i+blockSize];

			i += gridSize;

		} 

		

		// each thread puts its local sum into shared memory  

		sdata[tid] = mySum;

		__syncthreads();

	

		

		// do reduction in shared mem

		if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }

		if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }

		if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }

	

		if (tid < 32)

		{

			if (blockSize >=  64) { sdata[tid] = mySum = mySum + sdata[tid + 32];  }

			if (blockSize >=  32) { sdata[tid] = mySum = mySum + sdata[tid + 16];  }

			if (blockSize >=  16) { sdata[tid] = mySum = mySum + sdata[tid +  8];  }

			if (blockSize >=   8) { sdata[tid] = mySum = mySum + sdata[tid +  4];  }

		}

		

		// write result for this block to global mem 

		if (tid < 4)

			g_odata[blockIdx.x*4 + tid] = sdata[tid];

	}

and, after a DEVICETOHOST copy from g_odata to hostg_odata, letting the CPU make the last steps with something like:

for( int i = 0; i < blockSize*4; i++ )

{

	hostOutput[i%4] = hostg_odata[i];

}

Any better ideas? (probably every idea will be better than this).

Thanks to whoever will [try/be able] to help.

Bye!

memset(output, 0, 4*sizeof(float));

for( int i = 0; i < 100 * 256 * 4; i++ ) {

		output[i%4] += input[i];

}

is same as

memset(output, 0, 4*sizeof(float));

for( int i = 0; i < 100 * 256; i++ ) {

	output[i] += input[i] + input[i+1] + input[i+2] + input[i+3];

}

Right?

Its an embarrassingly parallel problem… No problem at all…

You dont need reduction… Just assign threads to each element of I.

Let threads in a block use the shared memory region to avoid repeated loads…

Thats all.

@Sarnath: your re-write is something completely different from the original code. For the original code, I think reduction is indeed the right way to go.

Thanks sagrailo for pointing it out.

@Sarnath: yes, your rewrite is not the same as the original code. You’re summing up 4 contiguous vector elements, whereas I’m taking one element out of four.

Still, my question remains. If the above code is the way to go, how to dimension grid and blocks and how to perform the last reduction step? (I need a 4 elements vector as output).

Btw: I’ve just noticed there is an error in the final step as I wrote it in the original post. It lacks the sum of elements, so:

for( int i = 0; i < blockSize*4; i++ )

{

	hostOutput[i%4] += hostg_odata[i];

}

Thanks!

O yeah…Sorry about that… I misread % as /

Reduction is the right way to go in that case