Big performace differece between Linux and Windows,is that normal?

I have the same code and the same GTX 1080ti,but the performace vary a lot when running on Linux and Windows repectively,is that normal? Or how can I improve the performace on Windows.

Type Time(%) Time Calls Avg Min Max Name GPU activities:
34.24% 90.191ms 5 18.038ms 17.822ms 18.511ms GsDenoise(float*, float*, int, int)
25.24% 66.492ms 5 13.298ms 3.5114ms 19.926ms findMinMaxRow(float*, unsigned short*, int, int)
23.67% 62.348ms 5 12.470ms 9.7317ms 15.070ms [CUDA memcpy HtoD]
14.52% 38.238ms 5 7.6476ms 7.4774ms 7.8828ms rawToGray(unsigned short*, float*, int, int)
1.18% 3.0995ms 5 619.90us 531.51us 777.41us nomalization(float*, unsigned short*, float*, int, int)
1.15% 3.0195ms 5 603.91us 495.99us 907.68us GsFit(unsigned short*, float*, float*, int, int)
0.00% 10.144us 6 1.6900us 896ns 2.6560us [CUDA memcpy DtoH]


Type Time(%) Time Calls Avg Min Max Name GPU activities:
79.02% 22.419ms 5 4.4838ms 4.2993ms 4.5552ms [CUDA memcpy HtoD]
9.35% 2.6529ms 5 530.59us 517.90us 544.94us findMinMaxRow(float*, unsigned short*, int, int)
8.24% 2.3386ms 5 467.72us 465.81us 469.87us GsDenoise(float*, float*, int, int)
1.59% 451.82us 5 90.364us 90.051us 90.978us nomalization(float*, unsigned short*, float*, int, int)
1.12% 317.32us 5 63.464us 60.354us 74.754us rawTypeTransform(unsigned short*, float*, int, int)
0.33% 94.595us 6 15.765us 4.4480us 21.601us [CUDA memcpy DtoH]
0.33% 94.243us 5 18.848us 18.209us 19.168us GsFit(unsigned short*, float*, float*, int, int)
0.01% 2.1760us 1 2.1760us 2.1760us 2.1760us EvaluateError(float*, float*, int, int)

In my experience, running the application in Linux is faster, mainly because the GPU is usually not running the UI like in Windows. Also, if you use several streams of execution, Windows will not provide a good scenario to perform parallelism between every stream due to this behind-the-scenes management with the UI and/or other applications which happen to be running in the background.

However, on Linux (as long as you run it in command line of course) this does not happen, and the GPU is capable of multitasking much better than on Windows.

If you want to run the GPU on Windows in a “non-managed” mode, you will need a second GPU to render the interface and some other tweaks that I am not aware of at the moment.

Hope this helps a bit :)

thx,Dread13.I found the reason lies in the VS compiler and it is better to use nvcc.

I’m currently comparing opencl and cuda and also was facing the same problem. Found out that in visual studio debug mode cuda runs really slow. You can simple change the solution configuration to realase. Took me 2 days to find this out :-/

Generally speaking, debug builds of CUDA programs will run slow independent of host platform, as for debug builds all compiler optimizations are disabled to ensure observability of all objects.

OpenCl also uses the cuda toolkit and don’t have optimizations running which slow performance down. My performance gain through the change of the solutions config was five times faster :-)

Of course, using release mode instead of debug is a must to measure performance in Windows VS :) But still, even in release, with no debug info whatsoever in the code, and enabling all the optimizations you possibly can, I found that Linux was faster due to the facts I explained before. Take that in consideration when extracting results for any chart or table!