Reduction kernel for Fermi

I need to sum a lot of samples in a long array, searching for reduction gives this optimized kernel by Nvidia. It is however optimized for 8800GT, how should it be re-optimized for Fermi?

template <unsigned int blockSize>

__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)

{

extern __shared__ int sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockSize*2) + tid;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }

__syncthreads();

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

if (tid < 32) {

if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

if (blockSize >= 8) sdata[tid] += sdata[tid + 4];

if (blockSize >= 4) sdata[tid] += sdata[tid + 2];

if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}

if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Well, given that the maximum number of threads per block in 2.0 is 1024, the algorithm needs to be changed to account for the increase in size. Other than that, I’m not sure how much more the reduction algorithm can be optimized. EDIT: never mind about cache. You would need to use syncthreads anyway to make sure that you are getting the right value at subsequent iterations.

The kernel only works for power of 2 arrays, great…

Not true. Look at the code - partial sums of an arbitrary length array in shared memory, followed by an in-shared memory reduction of those partial sums. It only works for powers-of-two threads per block, but the input data can be any length.

This is stated in the beginning of the code

/*

Parallel sum reduction using shared memory

- takes log(n) steps for n input elements

- uses n threads

- only works for power-of-2 arrays

*/

I guess that it’s rather easy to make it work for arbitrary length, in my case I have 53824 values, first I do reduction of 16384 values three times, then 2048 values two times, then 512 value one time and finally 64 values one time, and add the results together.

why don’t you just use cudpp’s or thrust’s implementation?

Where can I find it?

Always a good place to start…