Comparison Linux vs windows of "cudaDeviceSynchronize"

Hi everyone,

I noticed something that I can’t explain so I thought I would share that with you.

Somewhere in my code I have a loop that executes 6 kernels in a row. I do not use streams so they are all executed in a serial manner. Something like that:

for (unsigned int subStep = 0; substep < nfOfSubSteps; subStep++)
{
   Launch_Kernel1();
   Launch_Kernel2();
   Launch_Kernel3();
   Launch_Kernel4();
   Launch_Kernel5();
   Launch_Kernel6();
}

with nfOfSubSteps = 48 here.

I found a problem when I tried to make use of streams to interleave some of those computations. In particular, I wanted to run Kernel3 and Kernel4 concurrently as they are the most computationnaly expensive and independent. Doing so, I had to introduce a cudaDeviceSynchronize at the beginning of the loop because Kernel1 takes as input the output of Kernel6. So I had something like that:

for (unsigned int subStep = 0; substep < nfOfSubSteps; subStep++)
{
   cudaDeviceSynchronize();
   Launch_Kernel1();
   Launch_Kernel2();
   Launch_Kernel3();
   Launch_Kernel4();
   Launch_Kernel5();
   Launch_Kernel6();
}

And on Linux it worked nicely, it improves the performance substantially. But the same code on windows appeared to run slower than without streams. Much slower even, which did not make sense. After investigation, I realised the problem was the call to cudaDeviceSynchronize.

I measured the time taken by this loop to be executed, and in average I obtained the following:

Linux
Loop: 5.1ms
Loop with cudaDeviceSynchronize: 5.6ms

Windows
Loop: 5.7ms
Loop with cudaDeviceSynchronize: 21ms

Yes, 21ms! How come the overhead is so high on windows?

The weird thing is that when I try to time the call to cudaDeviceSynchronize(), it seems very fast. On Linux I measure about 0.8 microseconds and windows 4.1 microseconds. So there’s a factor 5 in between. But 4 us times 48 (the number of subSteps) should lead to something like 0.2 ms. So why do I have a 15 ms overhead…

If I set the environment variable CUDA_LAUNCH_BLOCKING to 1 (which prevents any concurrent kernel launch), it does not make any noticeable difference on Linux. Still about 5ms. Which makes sense since my kernels are already serialised. But I do the same thing on windows, then I measure 92ms!!!

I tried those tests with both a GTX 580 and a GTX TITAN. The timings are slightly different but in both cases, the difference when I add cudaDeviceSynchronize is negligible on Linux but huge on windows (tested with CUDA 5.0 and CUDA 5.5).

Could someone explain to me those differences? I’d like to understand it.

This sounds like it is caused by the overhead of CUDA working with the WDDM driver interface used in Windows.

I think this is why (among other reasons) the TCC driver exists, allowing the CUDA device to be controlled outside the usual Windows display driver system. Unfortunately, the TCC driver is only available for Tesla cards.

on my W7 machine, the only verification I have that the TCC driver is being used for the K20, is when I run CUDA-Z.

for the K20c it states the driver is 311.35(TCC),
while for the 680 is states only 311.35.

Is there any other direct method to verify the TCC driver is working for the Tesla? The device query only gives the driver version number.

Hum. Thanks guys. So you’re basically saying that the windows driver makes everything slower and I have no way to fix it. For a 1000$ card it’s disappointing NVIDIA…

I am currently in a very similar boat, with the WDDM driver being the limiting factor on my code running on a Titan. Not the limitation I was expecting to have as it has artificially put a limit on how fast I am able to get things to run. Using a K20 is an alternative that we have considered, but it is essentially a sledgehammer to crack a nut approach as we will not end up making full use of the double precision or ECC memory, and for the price it just is not worthwhile.

I have considered trying the dynamic parallelism to circumvent it similar to how you described but unfortunately I am also limited by using the cuFFT library which does not have dynamic parallelism available.

It’s not a hardware or software problem on the NVIDIA side. It’s Microsoft’s driver model starting with Vista. You can use Windows XP 64, though… that sounds awkward but it does work and is not bad in practice. Or Linux, of course.

Sure, if you want to use a GPU both for display and compute on Windows Vista and later, you are stuck with WDDM. However, for compute-only work, one could argue that the restriction of the TCC driver to Tesla cards is an attempt by NVIDIA to “artificially” add some additional value to the Tesla line. I could see how that would annoy some people.

(Fortunately, I do all my work in Linux, so this doesn’t obstruct my work, but I can sympathize.)

So If I have a GTX 680 which has the video out(WDDM) and a K20 (with I believe the TCC driver, but not sure how to verify), will the K20 use the TCC driver?

I just want to make sure that the Tesla is being fully utilized in Windows 7, and am not sure what is going on behind the scenes with this particular configuration.