Performance difference between Tesla and system where Cuda GPU is used as display device

jjp · September 1, 2009, 10:27am

I am benchmarking code on different systems. One system has a Tesla card which is exclusively used by Cuda and another system where Cuda runs on a Geforce which is also the display device. I am measuring a much higher performance on the Tesla system than I would expect from the results I get on the other system. The CPU is the same in both cases and the other parts of the systems are very similar to each other. I am sure the measurements on both systems are correct.

Now my speculation is as follows: the code results in many rather short kernel calls. I.e. around 80 kernel calls within a time frame of 10ms. Is it possible that on the system where the card is also used as display device there is just a relatively high chance that there are stalls between kernel calls because the card is occupied by the display driver? The cards in question are a Geforce GTX 280 and a Tesla c1060. The code would be rather limited by memory bandwith, uses less than 40MB of gpu memory and I am actually getting around 30% better performance on the Tesla system.

seibert · September 1, 2009, 8:28pm

That certainly is a reasonable explanation. I know that people have reported being able to slow down CUDA programs by interacting with the GUI while it was running. You could check to see if the problem gets worse if you move a window around on the screen randomly while your code is running. It’s possible that even an idle GUI increases the latency in starting short kernels.

tmurray · September 1, 2009, 8:29pm

Is the GTX 280 machine running Vista while the Tesla machine is running XP or something like that?

jjp · September 1, 2009, 10:25pm

The GTX machine is running Vista while the one with the Tesla runs OpenSuse 11.1 (driver is the most recent version in both cases). The whole thing is not really a problem for me, I was just wondering about the results I am getting. Basically I am benchmarking different algorithms and for one that tends to make lots of short kernel calls there is this performance difference between the systems. For the other algorithms which split the work into fewer and longer kernel calls (around 10 - 20 times fewer kernel calls than the “problematic” algorithm) the results match with what I would expect.

And yes, performance decreases a little more when I move the mouse around as I run my benchmark and the variance of the results increases. In any case this is a valuable insight for me: one really should not use a card that is also the display device for benchmarking. Or at least one should not try to compare results between a system where the gpu is exclusivly used for Cuda and one where it is not.

tmurray · September 1, 2009, 10:57pm

Welcome to WDDM. Kernel launch overhead is ~3us on non-WDDM platforms. On WDDM, it’s 40 at a minimum and can potentially be much larger. Considering the number of kernels you’re launching in 10ms, that’s going to add up.

SPWorley · September 2, 2009, 12:17am

Ouch! It’d be interesting to see this overhead measured for XP / Vista / Win7 / Linux / Leopard / Snow Leopard. For some reason I thought launch overhead was pretty much flat across all platforms, but obviously I’m wrong.

WDDM is also the cause of the 7 GPU limit in Win7?

Are there other WDDM gotchas?

tmurray · September 2, 2009, 12:20am

There is no 7 GPU limit, that’s just as many as I could fit in a machine. Presumably there’s a large upper bound (16?) but good luck getting a BIOS to enumerate that many.

There are other gotchas related to memory allocation, paging (you can’t really see how much free memory you have because WDDM will page in and out), kernel queueing, TDR… lots of stuff

SPWorley · September 2, 2009, 1:05am

I translate this as “Use 64 bit Linux, you fool!”

Smokey · September 2, 2009, 1:08am

Good luck telling that to all your customers…

Topic		Replies	Views
GTX280 vs Tesla C870 CUDA Programming and Performance	21	19140	August 13, 2008
CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times CUDA Programming and Performance	21	19155	November 11, 2009
WDDM on windows 7 and kernel call overhead CUDA Programming and Performance	1	1339	May 20, 2010
Tesla Compute Cluster driver CUDA Programming and Performance	6	1967	August 16, 2010
Kernel launch overhead due to WDDM - some questions CUDA Programming and Performance	0	7498	March 24, 2011
Tesla + geforce in same machine CUDA Programming and Performance	4	16168	May 7, 2011
Multiple nVIDIA Display Cards (one for display and one for CUDA) CUDA Programming and Performance	1	11276	March 16, 2009
Tesla Workstation Advice CUDA Programming and Performance	4	4285	January 2, 2012
CUDA device not primary display adapter CUDA Programming and Performance	14	10785	January 16, 2008
Poor performance on Vista Kernel runs slower on Vista than WinXP CUDA Programming and Performance	9	10289	May 13, 2009

Performance difference between Tesla and system where Cuda GPU is used as display device

Related topics