Slow memory access on Tesla C1060

MarkusF · January 29, 2010, 2:50pm

Hallo,

i’m experiencing a very poor performance on a Tesla C1060 (SUSE Linux 64bit) compared to my laptop using a GT 240M. I was able to narrow the problem and to create a minimal kernel which shows the strange behaviour.

__global__ void memoryissuekernel(unsigned int* input, unsigned int* output, unsigned int* indices, unsigned int elements)

{

	unsigned int tid = getLinearThreadID();

	

	if (tid < elements)

	{

		unsigned int index = indices[tid];

		unsigned int mask = 0x7FFFFFF;

		output[tid] = input[index & mask];		

	}

}

Indices and output have a size of 12,800,000 int, input has a size of 2^27 int. Indices and input are filled with random values. All arrays are allocated on the device using cudaMalloc and copied (input & indices) onto device memory prior to launching the kernel.

On my laptop kernel execution takes about 60ms, while it takes about 200ms on the tesla. The most strange behaviour is, that the execution time on the tesla seems to depend on the size of the mask in an expontial way: When I reduce the mask to 0x7F FF FF, execution time on my laptop drops slighly to 40ms (was 60ms), but drops significantly to 20ms (was 200ms) on the tesla. Increasing indices size, output size and the number of threads on the tesla to 2^27 (ints) it’s the same behaviour: 200ms using 0x7F FF FF and 2000ms using 0x7 FF FF FF.

I tested this using driver version 2.2 and 3.0b on the tesla and 2.1 on my laptop.

All of my other kernels are executing about 5-10 times faster on the tesla than on my laptop, it’s just that one kernel which is slower.

Has anyone else made such an experience or has anyone a clue why the performance on the tesla depends so extremly on the size of the mask?

Thank you in advance

Markus

Topic		Replies	Views
Emulation issue Emulated kernel outputs sucessfully, not emulated does not CUDA Programming and Performance	0	3339	February 18, 2010
Tesla Random Memory Access problem Problem Help CUDA Programming and Performance	0	1796	July 9, 2009
Need advice - low perfomance CUDA Programming and Performance	4	1231	December 18, 2009
Strange result Comparing a Tesla C1060 against GTS 250 CUDA Programming and Performance	16	1986	December 4, 2010
Does the display output on Tesla C2050 reduce performance? CUDA Programming and Performance	4	1444	June 9, 2010
Sample Program Template - 8600M GT Outperforms T10 Tesla CUDA Programming and Performance	5	5235	May 19, 2009
Understanding the memory latency when using CUDA profiler vs cudaEventRecord CUDA Programming and Performance	9	2082	November 11, 2010
[BENCHMARKS] Tesla C1060 VS Tesla C2050 Quite odd results CUDA Programming and Performance	3	21118	June 7, 2010
"unspecified launch failure" runtime failure CUDA Programming and Performance	6	3320	May 9, 2009
Performance difference between Tesla and system where Cuda GPU is used as display device CUDA Programming and Performance	8	5906	September 2, 2009

Slow memory access on Tesla C1060

Related topics