Hi everyone,
I noticed something that I can’t explain so I thought I would share that with you.
Somewhere in my code I have a loop that executes 6 kernels in a row. I do not use streams so they are all executed in a serial manner. Something like that:
for (unsigned int subStep = 0; substep < nfOfSubSteps; subStep++)
{
Launch_Kernel1();
Launch_Kernel2();
Launch_Kernel3();
Launch_Kernel4();
Launch_Kernel5();
Launch_Kernel6();
}
with nfOfSubSteps = 48 here.
I found a problem when I tried to make use of streams to interleave some of those computations. In particular, I wanted to run Kernel3 and Kernel4 concurrently as they are the most computationnaly expensive and independent. Doing so, I had to introduce a cudaDeviceSynchronize at the beginning of the loop because Kernel1 takes as input the output of Kernel6. So I had something like that:
for (unsigned int subStep = 0; substep < nfOfSubSteps; subStep++)
{
cudaDeviceSynchronize();
Launch_Kernel1();
Launch_Kernel2();
Launch_Kernel3();
Launch_Kernel4();
Launch_Kernel5();
Launch_Kernel6();
}
And on Linux it worked nicely, it improves the performance substantially. But the same code on windows appeared to run slower than without streams. Much slower even, which did not make sense. After investigation, I realised the problem was the call to cudaDeviceSynchronize.
I measured the time taken by this loop to be executed, and in average I obtained the following:
Linux
Loop: 5.1ms
Loop with cudaDeviceSynchronize: 5.6ms
Windows
Loop: 5.7ms
Loop with cudaDeviceSynchronize: 21ms
Yes, 21ms! How come the overhead is so high on windows?
The weird thing is that when I try to time the call to cudaDeviceSynchronize(), it seems very fast. On Linux I measure about 0.8 microseconds and windows 4.1 microseconds. So there’s a factor 5 in between. But 4 us times 48 (the number of subSteps) should lead to something like 0.2 ms. So why do I have a 15 ms overhead…
If I set the environment variable CUDA_LAUNCH_BLOCKING to 1 (which prevents any concurrent kernel launch), it does not make any noticeable difference on Linux. Still about 5ms. Which makes sense since my kernels are already serialised. But I do the same thing on windows, then I measure 92ms!!!
I tried those tests with both a GTX 580 and a GTX TITAN. The timings are slightly different but in both cases, the difference when I add cudaDeviceSynchronize is negligible on Linux but huge on windows (tested with CUDA 5.0 and CUDA 5.5).
Could someone explain to me those differences? I’d like to understand it.