I have modified the reduction SDK example to test out if CUDA is worth investigating for our research. Basically what I need to do is take an array of double precision numbers, apply a function to each value which involves reading some constants from constant memory, and then summing up the results. I am getting results of about 5 times as fast with one Tesla C1060 compared to our dual quad core xeon 5520 machine which seems ok but perhaps not optimal. When I look at the profiler on a ~1000000 element array I get the following results, my shared memory conflicts seems as low as I can go with double precision numbers and I have zero non-coalesced global memory loads, the result of the results are below:
[codebox]template <class T, unsigned int blockSize, bool nIsPow2>
global void
reduce6(T *g_idata, T *g_odata, unsigned int n)
{
SharedMemory<T> smem;
T *sdata = smem.getPointer();
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
//Read in parameters from constant memory
double sigma = d_params[0];
double mean = d_params[1];
// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
sdata[tid] -= LL(g_idata[i], sigma, mean);
// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
if (nIsPow2 || i + blockSize < n)
sdata[tid] -= LL(g_idata[i+blockSize], sigma, mean);
i += gridSize;
}
__syncthreads();
// do reduction in shared mem
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32)
{
if (blockSize >= 64) { sdata[tid] += sdata[tid + 32];}
if (blockSize >= 32) { sdata[tid] += sdata[tid + 16];}
if (blockSize >= 16) { sdata[tid] += sdata[tid + 8];}
if (blockSize >= 8) { sdata[tid] += sdata[tid + 4];}
if (blockSize >= 4) { sdata[tid] += sdata[tid + 2];}
if (blockSize >= 2) { sdata[tid] += sdata[tid + 1];}
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
[/codebox]
where the only modifications I have made are calling my function LL as I read the value from global to shared memory and where I read in the constant values sigma/mean into registers. Does anyone have any tips on improving this? Is there anyway to calculate if I am fully utilizing the chip as this seems to be computationally bound? Thanks a lot!