Slow memory access on Tesla C1060


i’m experiencing a very poor performance on a Tesla C1060 (SUSE Linux 64bit) compared to my laptop using a GT 240M. I was able to narrow the problem and to create a minimal kernel which shows the strange behaviour.

__global__ void memoryissuekernel(unsigned int* input, unsigned int* output, unsigned int* indices, unsigned int elements)


	unsigned int tid = getLinearThreadID();


	if (tid < elements)


		unsigned int index = indices[tid];

		unsigned int mask = 0x7FFFFFF;

		output[tid] = input[index & mask];		



Indices and output have a size of 12,800,000 int, input has a size of 2^27 int. Indices and input are filled with random values. All arrays are allocated on the device using cudaMalloc and copied (input & indices) onto device memory prior to launching the kernel.

On my laptop kernel execution takes about 60ms, while it takes about 200ms on the tesla. The most strange behaviour is, that the execution time on the tesla seems to depend on the size of the mask in an expontial way: When I reduce the mask to 0x7F FF FF, execution time on my laptop drops slighly to 40ms (was 60ms), but drops significantly to 20ms (was 200ms) on the tesla. Increasing indices size, output size and the number of threads on the tesla to 2^27 (ints) it’s the same behaviour: 200ms using 0x7F FF FF and 2000ms using 0x7 FF FF FF.

I tested this using driver version 2.2 and 3.0b on the tesla and 2.1 on my laptop.

All of my other kernels are executing about 5-10 times faster on the tesla than on my laptop, it’s just that one kernel which is slower.

Has anyone else made such an experience or has anyone a clue why the performance on the tesla depends so extremly on the size of the mask?

Thank you in advance