Different times Ubuntu Vs Windows


I have been testing the same code, on the same computer, for some reason when i measure times on both, the times have a very big difference.

Using the same configuration (CUDA 7.5) and exactly the same code on both SO. For example this it’s quite the results i have been getting during the testing :

Ubuntu : 38. 99 Seconds
Windows : 118.72 Seconds

Both SO have CUDA 7.5.

Ubuntu : 15.04
Windows : 10

Maybe the only difference it’s that Windows has the latest driver version and Ubuntu has the same version that came with the Nvidia 7.5 Driver.

Anyone can give a hint , why this is happening?..

Both programs have been compiled and run as release version.

It is not clear what or how you are measuring. Does the timed portion of the code include CUDA startup overhead? If you don’t know, insert a call cudaFree(0) prior to the measured portion of your program. Does the timed portion include potentially expensive host activity such as calls to memory allocation APIs? Move such API calls before the timed portion.

You can use the CUDA profiler to check whether there are any noticeable differences in the timing of GPU kernels. If the only difference between your two systems is the OS, that shouldn’t be the case. If there are such differences, check for bugs or for code paths conditional on OS or toolchain.


My code has two parts, the first one it’s the initation and setting initial values for the device vectors.

The second part is a while loop that iterative N times, (N it’s defined by me). What i am doing it’s taking a full measure only for the time on the while loop, taking time counts afer a few iterations .

Example : N = 25.000 and i usally take elapsed time each 5.000 iterations.

I am doing the measures using cudaEvents…

cudaEvent_t start, stop;
float time;
cudaEventRecord(start, 0);

while (counter < maxT) {
... code

    if ((counter % Tcicle == 0 && printResults) || (counter == maxT - 1 && printResults)) {

      cudaEventRecord(stop, 0);
      cudaEventElapsedTime(&time, start, stop);
      printf("Time for the kernel: %f sec\n", time / 1000.0f);



And for some reason the visual profiler on ubuntu and windows get stuck generating the timeline.
The only way to get measures about performace it’s using the visual studio tools, the visual profiler gets lock and nothing happend…


There is too little detail in the above to give any specific recommendations. If you just want to time one particular kernel kernel from the host side, I would suggest using the following sequence:

cudaDeviceSynchronize(); // wait until all previous GPU activity has finished
start = timer(); 
cudaDeviceSynchronize(); // wait until kernel has finished
stop = timer();

Here, timer() would be a high-resolution system timer, such as gettimeofday() on Linux. This methodology has a bit of overhead due to the call to cudaDeviceSynchronize(), typically >= 20 microseconds, so this approach will not work well for very short kernels, but since your kernel apparently runs for seconds, that shouldn’t be a problem. I have used this successfully for timing CUBLAS API calls.

Note that if you are on Windows and use the WDDM driver, your timing activities may be affected by artifacts particular to that environment, such as launch batching by the CUDA driver. There are techniques for force-flushing the WDDM driver’s launch queue, but I don’t recall what they are.

According to an old post from Greg@NV - https://devtalk.nvidia.com/default/topic/548639/cuda-programming-and-performance/is-wddm-causing-this-/post/3840052/#3840052 - cudaEventQuery(0) can be called to flush the software queue.
I have tried this recently and if i recall correctly it did not work and the call ended with an error.
It is possible that cudaStreamQuery(stream) could force the queue to be flushed but i did not test this.

In general once a kernel has been launched on a GPU, the OS should not matter. Have you tried running more basic applications under both OSs and compared the difference between the two? There are a number in the CUDA samples SDK, such as the nbody sample and the cuBLAS Sgemm matrix Multiplication samples. Those numbers should be about the same regardless of OS.

Does CUDA-Z show 100% GPU utilization while your program is running?

Is your program gobbling up most of the GPU’s memory? There have been reports of slowdowns with large memory allocations on nVidia GPUs running under WDDM drivers.

what do you mean by

“here have been reports of slowdowns with large memory allocations on nVidia GPUs running under WDDM drivers”

I couldn’t find those reports on the forum. Where can I find them?

I do have a similar issue, and I found out that allocating and freeing memory (cudaMalloc, cudaFree) is much slower on Windows than on Ubutu 14.04.