Code runs 3x times faster on X260 than on tesla c1060

At home I am using a Geforce X260 and my program (an image registration algorithm) runs in 30 seconds.
Today I ran it on an Tesla c1060 Board hoping to gain I little bit more speedup but the result was devastating it took 90 seconds to execute.
I used the Visual Profiler and tried some things but I really can’t figure out what could be the problem.

The program heavily uses shared atomics (CC 1.3) is quite short but needs 300-400 iterations.

At home im using:
Ubuntu 9.04 64bit
Cuda 2.1
Geforce GTX260
Execution time: 30s

At work:
OpenSuse 11.1 64bit
Cuda 2.2
Tesla c1060
Execution time: 90s

I could supply more specs if needed, in case somebody has an idea what could be relevant.

I would be really thankful if somebody could help me with this strange issue.

Hi,

I am very much interested in the solution to your problem. Since i am facing a similar problem with atomic functions on device memory… b/w quadro 5800 and 880gt. 880gt is much faster to do the atomicInc() function.

Sid.

I would try to avoid shared memory atomic operations. Replicating counters and then doing a reduction for the final result can often be faster. In the case of many conflicts even letting only a single thread increment a counter might be better.

The thing is that im completly satisfied with my algorithm its running nice on my desktop pc.
Now i started benchmarking and its 3 times slower on a better GPU, therefor im trying to find out whats the problem, maybe a compiler problem or a problem with the tesla card.

Something’s definitely strange here. Certainly, there are cases where a newer GPU can run “slower” than an old one, but generally only when there are insufficient threads to utilize all resources. Is your computation split up into a large number of kernals? If so, there might be some overhead in there somewhere which happens to be different across the two platforms. Try running the thing through the cuda profiler, and look over the numbers.

Second, make sure it’s actually running on the Tesla - if that same computer has another card for display, it might be running on that instead! Although I can’t think of any low end compute 1.3 cards yet…

It might be important to note that the Tesla and Quadro have slower memory than their desktop counterparts (they chose capacity over bandwidth). That would effect global atomics and global memory access in general, but I still don’t see how it could be on the order of magnitude that you’re seeing.

I only have 1 kernel with > 1 million threads.

I ran the profiler and couldn’t find any strange results, only that the tesla card needs less registers for the computation, dont know why.

There is a quadro in the system which only has CC1.1 and therefor cant be used.

The next days I will try the computation on an GTX280 then i can narrow it down if its problem with the OS, Compiler, … or the Tesla Card because the GTX280 runs on a similiar system.

The difference here is almost certainly the compiler. You may have discovered a performance regression bug. You should probably try the CUDA 2.1 toolkit (no need to change the driver, as it is backward compatible) on the OpenSuse system and see if it reproduces the same speed as the Ubuntu system. Then you should consider moving to CUDA 2.3 to see if they fixed the problem you are seeing with CUDA 2.2.

Yep, the different toolkits behave differently.

It’s very possible that the more detailed reason is that the 2.2 compiler produces faster code, but at the cost of using a few more registers. That higher register use might mean that less blocks can run simultaneously on one SM, causing lower net performance.

Look at the register use of both your compiles to see if this might be the case.

This is truly possible…but at factor of 3 times? This must be a huge different in register use. Don’t forget, that the tesla also has even more MPs than the gtx 260…

@flobbes

If it is a short kernel, mabye you can post it here? How many blocks do you use? How many shared memory is needed? Do you often access the global memory or so? PTX Output of bove compilations would be usefull…

Tried to make an executable and run it on both systems?

No havent tried to run the same executable on both systems, ill try it next week.

Below you can see the complete kernel.

I start 1 thread per image voxel:

dim3 block(256);

dim3 grid((m_VolA_orig.xDim * m_VolA_orig.yDim * m_VolA_orig.zDim + 255) / 256);

Where/what should I look up about the PTX output?

__global__ void hist256copiesTex(Sample_Vol_Main_Header A, Sample_Vol_Main_Header B, values *v, int histo_A[256], int histo_B[256], int histo_A_Btmp[256 * 256 * 256]) {

	__shared__ int histA[256];

	__shared__ int histB[256];

	histA[threadIdx.x] = 0;

	histB[threadIdx.x] = 0;

	float x = (blockIdx.x * blockDim.x + threadIdx.x) % A.xDim;

	float z = (blockIdx.x * blockDim.x + threadIdx.x) / (A.yDim * A.xDim);

	float y = (blockIdx.x * blockDim.x + threadIdx.x) % (A.yDim * A.xDim) / A.xDim;

	float xB = x * v->at[0] + y * v->at[1] + z * v->at[2] + v->dt[0];

	float yB = x * v->bt[0] + y * v->bt[1] + z * v->bt[2] + v->dt[1];

	float zB = x * v->ct[0] + y * v->ct[1] + z * v->ct[2] + v->dt[2];

	if (x < A.xDim && (xB >= 0 && xB < B.xDim) && (yB >= 0 && yB < B.yDim) && (zB >= 0 && zB < B.zDim))

	{

		int valueA = tex3D(texA, x, y, z) * 255.0;

		int valueB = tex3D(texB, xB, yB, zB) * 255.0;

		atomicAdd(&histA[valueA], 1);

		atomicAdd(&histB[valueB], 1);

		atomicAdd(&histo_A_Btmp[threadIdx.x * BLOCKSIZE * BLOCKSIZE + valueA * BLOCKSIZE + valueB], 1);

	}

	__syncthreads();

	atomicAdd(&histo_A[threadIdx.x], histA[threadIdx.x]);

	atomicAdd(&histo_B[threadIdx.x], histB[threadIdx.x]);

}

No, even a single register increase may put you over the limit. Say your kernel uses 256 threads and 32 registers per thread. This is a very very typical block configuration.

Since the block uses 8192 registers, your device can run two blocks at once on one MP on a G200. That uses all 16384 registers per MP.

Now increase register use by one to 33. Now a block uses 8448 registers. Whups, you can now only run ONE block per MP… there’s no room for the second block’s registers.

So yes, you can indeed have dramatic shifts by a tiny change in register use.

Is that a 3X speed difference? Unlikely… but it could explain most of that difference anyway.

This issue came up with the SHA cracking tool… the older CUDA toolkits used less registers and therefore could pack in more blocks on G80/G90 boards. In that case it was only a 30% speed issue, but even that’s quite noticeable.

I tried the code on the GTX280 now and it runs 4 times slower.
I even tested the binaries I compiled at home and they run so slow. I tested the program on the PC of an friend of mine it runs as quick as it runs on my pc (he has a GTX260 as well).

Therefor it seems to be a problem with the configuration of the workstations at my university and not a graphic card problem.

Does anybody have an idea what could be the problem, that makes the program run so much slower?

I am curious, have you run the sample applications that come with SDK such as the bandwidth test? It could give a hint to the problem.

Yes I tried 4 benchmarks from the Cuda SDK and they all give my reasonable results.

Its really weird and I dont anything else that I could test to narrow down the problem.

Can you post a minimal repro code? Preferably a stand-alone .cu file that one can simply compile with nvcc, not linking any unnecessary libraries (preferably not cutil either).

I would have guessed that maybe the C1060 is in power-save mode, but if other CUDA programs perform as expected that’s unlikely. How are you timing your kernel? Ideally, you’d be using CUDA events to time execution on GPU.

Paulius

I measured the times with the cutCreateTimer method and with CUDA profiler.

I did a lot of testing now:

Windows:

Geforce GTX260 : 18 seconds

Geforce GTX275 : 18 seconds

Geforce GTX285 : 64 seconds

Geforce 8800 512: 99 seconds

Linux:

Geforce GTX285: 84 seconds

Tesla C1060 : 84 seconds

So i think there is a problem with linux. But i dont know why the gtx285 under windows is so slow, because from the specs the 275 is much more like then the 260.

I can post the kernel and the kernelcall but a .cu isn’t possible so easily because there a lot of dependancies.

[codebox]global void hist256copiesTexValues(values *v, int histo_A_Btmp[256 * 256 * 256], int ax, int ay, int az, int bx, int by, int bz) {

int position = blockIdx.x * blockDim.x + threadIdx.x;

float x = (position) % ax;

float z = (position) / (ay * ax);

float y = (position) % (ay * ax) / ax;

float xB = x * v->at[0] + y * v->at[1] + z * v->at[2] + v->dt[0];

float yB = x * v->bt[0] + y * v->bt[1] + z * v->bt[2] + v->dt[1];

float zB = x * v->ct[0] + y * v->ct[1] + z * v->ct[2] + v->dt[2];

if (x < ax && (xB >= 0 && xB < bx) && (yB >= 0 && yB < by) && (zB >= 0 && zB < bz))

{

	int valueA = tex3D(texA, x, y, z) * 255.0;

	int valueB = tex3D(texB, xB, yB, zB) * 255.0;

	atomicAdd(&histo_A_Btmp[threadIdx.x * 256 * 256 + valueA * 256 + valueB], 1);

}

}[/codebox]

[codebox] dim3 block(256);

dim3 grid((m_VolA_orig.xDim * m_VolA_orig.yDim * m_VolA_orig.zDim + 255) / 256);

hist256copiesTexValues<<<grid, block>>>( values_dev, histo_A_B_tmp_dev, m_VolA_orig.xDim, m_VolA_orig.yDim, m_VolA_orig.zDim, m_VolB_orig.xDim, m_VolB_orig.yDim, m_VolB_orig.zDim);

[/codebox]

Number of memory partitions of each GPU:

Geforce GTX260 : 7

Geforce GTX275 : 7

Geforce GTX285 : 8

Geforce 8800 512: 4

Geforce GTX285: 8

Tesla C1060 : 8

Do you see a pattern coming? External Media

If valueA and valueB are not uniformly distributed, but rather centered around some value, you have a memory partition contention problem, whenever the number of memory partitions on the GPU is a power of two.

You are essentially accessing memory locations separated by 256*256 words, and these happen to fall all into the same partition(s).

So try turning the 256 factor into a slightly-larger non-power-of-two value and see if it helps…

Your my hero!

I just replaced:

atomicAdd(&histo_A_Btmp[threadIdx.x * 256 * 256 + valueA * 256 + valueB], 1)

with:

atomicAdd(&histo_A_Btmp[threadIdx.x * 256 + 256 * valueA * 256 + valueB], 1);

and it needed only 43 seconds instead of 85 on the tesla board.

So I hope when replace it with an non power of 2 ill get close to my result at home.

One more question:

I heard a whole lecture about CUDA, learnt so much about coalesced mem access, about divergence, about shared memory bank conflicts, but I havent heard anything about these memory partitions. Do you have any literature I can read it up to understand this issue a bit more?
Would be really happy if it is explained somewhere and thanks for the help!

Will I need an electron microscope to spot any difference between these two lines? ;)

I felt the same, BUT:

atomicAdd(&histo_A_Btmp[threadIdx.x * 256 * 256 +

and

atomicAdd(&histo_A_Btmp[threadIdx.x * 256 +

are indeed different.