Hi all,
I’m a CUDA newbie programmer. I’m writing here to ask you guys to share opinions and ideas.
First of all the goal:
to reduce a 100 x 256 x 4 floats vector to just a 4 floats one
A cpu-computed equivalent might be expressed like follows:
memset(output, 0, 4*sizeof(float));
for( int i = 0; i < 100 * 256 * 4; i++ ) {
output[i%4] += input[i];
}
Do you guys think that the reduce6 version of the SDK examples might be deployed in this case? How to choose threads’ and blocks’ number? Shall be necessary to devolve some computation to the host or is this entirely possible within the GPU?
The kernel sign should be something like this:
reduceKernel(float* input, float* output, size_t n);
where n might assume values different from 100 * 256 * 4 (that 100 might be 200, 300 or even 900) in the above specifications (btw, not a power of two).
My first idea was to use something like this (in the following code I still have no ideas about how to dimension blocks and grid):
template <unsigned int blockSize>
__global__ void
reduce6(float *g_idata, float *g_odata, unsigned int n)
{
extern __shared__ int __sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
T mySum = 0;
// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
mySum += g_idata[i];
// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
if (i + blockSize < n)
mySum += g_idata[i+blockSize];
i += gridSize;
}
// each thread puts its local sum into shared memory
sdata[tid] = mySum;
__syncthreads();
// do reduction in shared mem
if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] = mySum = mySum + sdata[tid + 64]; } __syncthreads(); }
if (tid < 32)
{
if (blockSize >= 64) { sdata[tid] = mySum = mySum + sdata[tid + 32]; }
if (blockSize >= 32) { sdata[tid] = mySum = mySum + sdata[tid + 16]; }
if (blockSize >= 16) { sdata[tid] = mySum = mySum + sdata[tid + 8]; }
if (blockSize >= 8) { sdata[tid] = mySum = mySum + sdata[tid + 4]; }
}
// write result for this block to global mem
if (tid < 4)
g_odata[blockIdx.x*4 + tid] = sdata[tid];
}
and, after a DEVICETOHOST copy from g_odata to hostg_odata, letting the CPU make the last steps with something like:
for( int i = 0; i < blockSize*4; i++ )
{
hostOutput[i%4] = hostg_odata[i];
}
Any better ideas? (probably every idea will be better than this).
Thanks to whoever will [try/be able] to help.
Bye!