Comparison Linux vs windows of "cudaDeviceSynchronize"

Morph208 · August 9, 2013, 6:44pm

Hi everyone,

I noticed something that I can’t explain so I thought I would share that with you.

Somewhere in my code I have a loop that executes 6 kernels in a row. I do not use streams so they are all executed in a serial manner. Something like that:

for (unsigned int subStep = 0; substep < nfOfSubSteps; subStep++)
{
   Launch_Kernel1();
   Launch_Kernel2();
   Launch_Kernel3();
   Launch_Kernel4();
   Launch_Kernel5();
   Launch_Kernel6();
}

with nfOfSubSteps = 48 here.

I found a problem when I tried to make use of streams to interleave some of those computations. In particular, I wanted to run Kernel3 and Kernel4 concurrently as they are the most computationnaly expensive and independent. Doing so, I had to introduce a cudaDeviceSynchronize at the beginning of the loop because Kernel1 takes as input the output of Kernel6. So I had something like that:

for (unsigned int subStep = 0; substep < nfOfSubSteps; subStep++)
{
   cudaDeviceSynchronize();
   Launch_Kernel1();
   Launch_Kernel2();
   Launch_Kernel3();
   Launch_Kernel4();
   Launch_Kernel5();
   Launch_Kernel6();
}

And on Linux it worked nicely, it improves the performance substantially. But the same code on windows appeared to run slower than without streams. Much slower even, which did not make sense. After investigation, I realised the problem was the call to cudaDeviceSynchronize.

I measured the time taken by this loop to be executed, and in average I obtained the following:

Linux
Loop: 5.1ms
Loop with cudaDeviceSynchronize: 5.6ms

Windows
Loop: 5.7ms
Loop with cudaDeviceSynchronize: 21ms

Yes, 21ms! How come the overhead is so high on windows?

The weird thing is that when I try to time the call to cudaDeviceSynchronize(), it seems very fast. On Linux I measure about 0.8 microseconds and windows 4.1 microseconds. So there’s a factor 5 in between. But 4 us times 48 (the number of subSteps) should lead to something like 0.2 ms. So why do I have a 15 ms overhead…

If I set the environment variable CUDA_LAUNCH_BLOCKING to 1 (which prevents any concurrent kernel launch), it does not make any noticeable difference on Linux. Still about 5ms. Which makes sense since my kernels are already serialised. But I do the same thing on windows, then I measure 92ms!!!

I tried those tests with both a GTX 580 and a GTX TITAN. The timings are slightly different but in both cases, the difference when I add cudaDeviceSynchronize is negligible on Linux but huge on windows (tested with CUDA 5.0 and CUDA 5.5).

Could someone explain to me those differences? I’d like to understand it.

seibert · August 9, 2013, 8:07pm

This sounds like it is caused by the overhead of CUDA working with the WDDM driver interface used in Windows.

I think this is why (among other reasons) the TCC driver exists, allowing the CUDA device to be controlled outside the usual Windows display driver system. Unfortunately, the TCC driver is only available for Tesla cards.

CudaaduC · August 9, 2013, 9:30pm

on my W7 machine, the only verification I have that the TCC driver is being used for the K20, is when I run CUDA-Z.

for the K20c it states the driver is 311.35(TCC),
while for the 680 is states only 311.35.

Is there any other direct method to verify the TCC driver is working for the Tesla? The device query only gives the driver version number.

Morph208 · August 12, 2013, 1:36pm

Hum. Thanks guys. So you’re basically saying that the windows driver makes everything slower and I have no way to fix it. For a 1000$ card it’s disappointing NVIDIA…

Tiomat · August 12, 2013, 2:59pm

I am currently in a very similar boat, with the WDDM driver being the limiting factor on my code running on a Titan. Not the limitation I was expecting to have as it has artificially put a limit on how fast I am able to get things to run. Using a K20 is an alternative that we have considered, but it is essentially a sledgehammer to crack a nut approach as we will not end up making full use of the double precision or ECC memory, and for the price it just is not worthwhile.

I have considered trying the dynamic parallelism to circumvent it similar to how you described but unfortunately I am also limited by using the cuFFT library which does not have dynamic parallelism available.

SPWorley · August 12, 2013, 8:01pm

It’s not a hardware or software problem on the NVIDIA side. It’s Microsoft’s driver model starting with Vista. You can use Windows XP 64, though… that sounds awkward but it does work and is not bad in practice. Or Linux, of course.

seibert · August 12, 2013, 9:29pm

Sure, if you want to use a GPU both for display and compute on Windows Vista and later, you are stuck with WDDM. However, for compute-only work, one could argue that the restriction of the TCC driver to Tesla cards is an attempt by NVIDIA to “artificially” add some additional value to the Tesla line. I could see how that would annoy some people.

(Fortunately, I do all my work in Linux, so this doesn’t obstruct my work, but I can sympathize.)

CudaaduC · August 13, 2013, 3:11am

So If I have a GTX 680 which has the video out(WDDM) and a K20 (with I believe the TCC driver, but not sure how to verify), will the K20 use the TCC driver?

I just want to make sure that the Tesla is being fully utilized in Windows 7, and am not sure what is going on behind the scenes with this particular configuration.

Topic		Replies	Views
Unexpected Synchronization Behavior in Windows vs. Linux for CUDA Async Operations with Multiple Streams CUDA Programming and Performance synchronization	6	141	March 13, 2025
CUDA on Windows much slower than on linux CUDA Programming and Performance	5	3639	January 26, 2013
Is there anyone know about the performance at linux and windows? CUDA Programming and Performance	4	1074	November 2, 2012
Tesla Compute Cluster driver CUDA Programming and Performance	6	2002	August 16, 2010
cudaDeviceSynchronize doesn't work if the kernel function takes too long to complete CUDA Programming and Performance	3	8739	January 29, 2012
CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times CUDA Programming and Performance	21	19256	November 11, 2009
WDDM on windows 7 and kernel call overhead CUDA Programming and Performance	1	1353	May 20, 2010
Unexplained / Strange behavior on Linux vs. Windows CUDA Programming and Performance cuda , ubuntu	0	311	April 26, 2022
Large standard deviation difference in performance of kernels for Windows vs Linux CUDA Programming and Performance cuda	12	368	May 9, 2025
Asynchronous cudaMemcpy host to device not overlapping with kernel CUDA Programming and Performance	4	108	April 10, 2025

Comparison Linux vs windows of "cudaDeviceSynchronize"

Related topics