I am benchmarking code on different systems. One system has a Tesla card which is exclusively used by Cuda and another system where Cuda runs on a Geforce which is also the display device. I am measuring a much higher performance on the Tesla system than I would expect from the results I get on the other system. The CPU is the same in both cases and the other parts of the systems are very similar to each other. I am sure the measurements on both systems are correct.
Now my speculation is as follows: the code results in many rather short kernel calls. I.e. around 80 kernel calls within a time frame of 10ms. Is it possible that on the system where the card is also used as display device there is just a relatively high chance that there are stalls between kernel calls because the card is occupied by the display driver? The cards in question are a Geforce GTX 280 and a Tesla c1060. The code would be rather limited by memory bandwith, uses less than 40MB of gpu memory and I am actually getting around 30% better performance on the Tesla system.