Kernel improvements?

I have modified the reduction SDK example to test out if CUDA is worth investigating for our research. Basically what I need to do is take an array of double precision numbers, apply a function to each value which involves reading some constants from constant memory, and then summing up the results. I am getting results of about 5 times as fast with one Tesla C1060 compared to our dual quad core xeon 5520 machine which seems ok but perhaps not optimal. When I look at the profiler on a ~1000000 element array I get the following results, my shared memory conflicts seems as low as I can go with double precision numbers and I have zero non-coalesced global memory loads, the result of the results are below:


template <class T, unsigned int blockSize, bool nIsPow2>

global void

reduce6(T *g_idata, T *g_odata, unsigned int n)


SharedMemory<T> smem;

T *sdata = smem.getPointer();

// perform first level of reduction,

// reading from global memory, writing to shared memory

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

//Read in parameters from constant memory

double sigma = d_params[0];

double mean = d_params[1];

// we reduce multiple elements per thread. The number is determined by the

// number of active thread blocks (via gridDim).  More blocks will result

// in a larger gridSize and therefore fewer elements per thread

while (i < n)


    sdata[tid] -= LL(g_idata[i], sigma, mean);

    // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays

    if (nIsPow2 || i + blockSize < n)

        sdata[tid] -= LL(g_idata[i+blockSize], sigma, mean);

    i += gridSize;



// do reduction in shared mem

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) { if (tid <  64) { sdata[tid] += sdata[tid +  64]; } __syncthreads(); }

if (tid < 32)


    if (blockSize >=  64) { sdata[tid] += sdata[tid + 32];}

    if (blockSize >=  32) { sdata[tid] += sdata[tid + 16];}

    if (blockSize >=  16) { sdata[tid] += sdata[tid +  8];}

    if (blockSize >=   8) { sdata[tid] += sdata[tid +  4];}

    if (blockSize >=   4) { sdata[tid] += sdata[tid +  2];}

    if (blockSize >=   2) { sdata[tid] += sdata[tid +  1];}


// write result for this block to global mem

if (tid == 0) g_odata[blockIdx.x] = sdata[0];



where the only modifications I have made are calling my function LL as I read the value from global to shared memory and where I read in the constant values sigma/mean into registers. Does anyone have any tips on improving this? Is there anyway to calculate if I am fully utilizing the chip as this seems to be computationally bound? Thanks a lot!

Look at your LL function.

Since sigma and mean are constants, you can precompute their effects by computing say A=1.0/(2.0sigmasigma) and B=rsqrtTwoPi(1.0/sigma).

Your function then becomes:

double __device__ LL(double x, double sigma, double mean)


		return log(A*exp((x-mean)*(x-mean))*B));


But log XY = log X + log Y, and log(exp(x))=x.

So actually you have:

double __device__ LL(double x, double sigma, double mean)


		return log(A) + (x-mean)*(x-mean)*B;


But we’re not done. Expand that out, we get a simple quadratic in x.

So you’re computing sum( A+Bx_i + Cx_i^2) for some constant values of A, B and C.

So the entire product is then equal to nA + B (sum x) + C*(sum x^2)

So your computation only needs to compute the sum of an array’s elements, and the sum of the array’s elements squared, and weight them with those A B C constants.

This is a really minor amount of compute, and can be done on the CPU as fast as it can read from memory. The GPU could do the math faster, but the overhead of sending the data to the GPU will overshadow the compute speed savings.

Thanks for the help. I should have mentioned that the array data doesn’t change through my entire program but my kernel gets called hundreds, or thousands of times, and each time it does sigma/mean are the only two values that change and only a single double needs to be transfered back to the host at the end of each kernel so hopefully that is enough to minimize the effect of the overhead.