I have modified the reduction SDK example to test out if CUDA is worth investigating for our research. Basically what I need to do is take an array of double precision numbers, apply a function to each value which involves reading some constants from constant memory, and then summing up the results. I am getting results of about 5 times as fast with one Tesla C1060 compared to our dual quad core xeon 5520 machine which seems ok but perhaps not optimal. When I look at the profiler on a ~1000000 element array I get the following results, my shared memory conflicts seems as low as I can go with double precision numbers and I have zero non-coalesced global memory loads, the result of the results are below:

[codebox]template <class T, unsigned int blockSize, bool nIsPow2>

**global** void

reduce6(T *g_idata, T *g_odata, unsigned int n)

{

```
SharedMemory<T> smem;
T *sdata = smem.getPointer();
```

// perform first level of reduction,

```
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
```

//Read in parameters from constant memory

```
double sigma = d_params[0];
double mean = d_params[1];
```

// we reduce multiple elements per thread. The number is determined by the

```
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
sdata[tid] -= LL(g_idata[i], sigma, mean);
// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
if (nIsPow2 || i + blockSize < n)
sdata[tid] -= LL(g_idata[i+blockSize], sigma, mean);
i += gridSize;
}
__syncthreads();
```

// do reduction in shared mem

```
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
```

if (tid < 32)

```
{
if (blockSize >= 64) { sdata[tid] += sdata[tid + 32];}
if (blockSize >= 32) { sdata[tid] += sdata[tid + 16];}
if (blockSize >= 16) { sdata[tid] += sdata[tid + 8];}
if (blockSize >= 8) { sdata[tid] += sdata[tid + 4];}
if (blockSize >= 4) { sdata[tid] += sdata[tid + 2];}
if (blockSize >= 2) { sdata[tid] += sdata[tid + 1];}
}
```

// write result for this block to global mem

```
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
```

}

[/codebox]

where the only modifications I have made are calling my function LL as I read the value from global to shared memory and where I read in the constant values sigma/mean into registers. Does anyone have any tips on improving this? Is there anyway to calculate if I am fully utilizing the chip as this seems to be computationally bound? Thanks a lot!