Really slow results

Hello, I am working on some matrix algebra functions, and I’m trying to make a simple kernel to do addition of two matricies of arbitrary size.

Here’s the kernel code;

#define MIN( a, b )				(a < b ? a : b)

template <class T, unsigned long chunkSize>

GLOBAL void addKernel( T * a, T *  b, T * c, unsigned long N ) {

	unsigned long start = (blockIdx.x * blockDim.x + threadIdx.x)*chunkSize;

	unsigned long end = MIN( start+chunkSize, N);


	for( unsigned long n=start; n<end;n++ )

		c[n] = a[n] + b[n];


This runs actually slower than the CPU, for any amount of input, and any combination of threads/blocks! (If the amount of data is greater than the maximum number of threads/blocks, it is made up for with chunkSize) chunkSize is a templated parameter, much like as seen in the reduction sdk example, as I was hoping that nvcc might unroll the loops…

Does anyone know why this is running so slowly? Thank you!

Have you run this through the profiler?

(it’s because all of your memory accesses are completely uncoalesced and therefore your bandwidth utilization is going to get wrecked.)

Thank you for your speedy response!

I have used the profiler, and it did say something about uncoalesced, now that you mention it. Thank you, I will research this!

Are you transferring the data from system memory to GPU memory each time you run the kernal? Because it’ll take longer to push the matrices over the PCIe bus than it would for the CPU to just add them together in the first place.

No, I am creating a matrix class that loads the memory onto the device at the start, and every operation after that requires no more transfers.

It works now, (it is faster than the CPU, by 4-5x) I see that having many more blocks is a better solution than having a single thread do more than one operation.