Hello, I am working on some matrix algebra functions, and I’m trying to make a simple kernel to do addition of two matricies of arbitrary size.

Here’s the kernel code;

```
#define MIN( a, b ) (a < b ? a : b)
template <class T, unsigned long chunkSize>
GLOBAL void addKernel( T * a, T * b, T * c, unsigned long N ) {
unsigned long start = (blockIdx.x * blockDim.x + threadIdx.x)*chunkSize;
unsigned long end = MIN( start+chunkSize, N);
__syncthreads();
for( unsigned long n=start; n<end;n++ )
c[n] = a[n] + b[n];
}
```

This runs actually slower than the CPU, for any amount of input, and any combination of threads/blocks! (If the amount of data is greater than the maximum number of threads/blocks, it is made up for with chunkSize) chunkSize is a templated parameter, much like as seen in the reduction sdk example, as I was hoping that nvcc might unroll the loops…

Does anyone know why this is running so slowly? Thank you!