Performance difference between Tesla and system where Cuda GPU is used as display device

I am benchmarking code on different systems. One system has a Tesla card which is exclusively used by Cuda and another system where Cuda runs on a Geforce which is also the display device. I am measuring a much higher performance on the Tesla system than I would expect from the results I get on the other system. The CPU is the same in both cases and the other parts of the systems are very similar to each other. I am sure the measurements on both systems are correct.

Now my speculation is as follows: the code results in many rather short kernel calls. I.e. around 80 kernel calls within a time frame of 10ms. Is it possible that on the system where the card is also used as display device there is just a relatively high chance that there are stalls between kernel calls because the card is occupied by the display driver? The cards in question are a Geforce GTX 280 and a Tesla c1060. The code would be rather limited by memory bandwith, uses less than 40MB of gpu memory and I am actually getting around 30% better performance on the Tesla system.

That certainly is a reasonable explanation. I know that people have reported being able to slow down CUDA programs by interacting with the GUI while it was running. You could check to see if the problem gets worse if you move a window around on the screen randomly while your code is running. It’s possible that even an idle GUI increases the latency in starting short kernels.

Is the GTX 280 machine running Vista while the Tesla machine is running XP or something like that?

The GTX machine is running Vista while the one with the Tesla runs OpenSuse 11.1 (driver is the most recent version in both cases). The whole thing is not really a problem for me, I was just wondering about the results I am getting. Basically I am benchmarking different algorithms and for one that tends to make lots of short kernel calls there is this performance difference between the systems. For the other algorithms which split the work into fewer and longer kernel calls (around 10 - 20 times fewer kernel calls than the “problematic” algorithm) the results match with what I would expect.

And yes, performance decreases a little more when I move the mouse around as I run my benchmark and the variance of the results increases. In any case this is a valuable insight for me: one really should not use a card that is also the display device for benchmarking. Or at least one should not try to compare results between a system where the gpu is exclusivly used for Cuda and one where it is not.

Welcome to WDDM. Kernel launch overhead is ~3us on non-WDDM platforms. On WDDM, it’s 40 at a minimum and can potentially be much larger. Considering the number of kernels you’re launching in 10ms, that’s going to add up.

Ouch! It’d be interesting to see this overhead measured for XP / Vista / Win7 / Linux / Leopard / Snow Leopard. For some reason I thought launch overhead was pretty much flat across all platforms, but obviously I’m wrong.

WDDM is also the cause of the 7 GPU limit in Win7?

Are there other WDDM gotchas?

There is no 7 GPU limit, that’s just as many as I could fit in a machine. Presumably there’s a large upper bound (16?) but good luck getting a BIOS to enumerate that many.

There are other gotchas related to memory allocation, paging (you can’t really see how much free memory you have because WDDM will page in and out), kernel queueing, TDR… lots of stuff

I translate this as “Use 64 bit Linux, you fool!”

Good luck telling that to all your customers…