Big performace differece between Linux and Windows,is that normal?

1954349147 · December 17, 2019, 7:26am

Hi,
I have the same code and the same GTX 1080ti,but the performace vary a lot when running on Linux and Windows repectively,is that normal? Or how can I improve the performace on Windows.

Windows:
Type Time(%) Time Calls Avg Min Max Name GPU activities:
34.24% 90.191ms 5 18.038ms 17.822ms 18.511ms GsDenoise(float*, float*, int, int)
25.24% 66.492ms 5 13.298ms 3.5114ms 19.926ms findMinMaxRow(float*, unsigned short*, int, int)
23.67% 62.348ms 5 12.470ms 9.7317ms 15.070ms [CUDA memcpy HtoD]
14.52% 38.238ms 5 7.6476ms 7.4774ms 7.8828ms rawToGray(unsigned short*, float*, int, int)
1.18% 3.0995ms 5 619.90us 531.51us 777.41us nomalization(float*, unsigned short*, float*, int, int)
1.15% 3.0195ms 5 603.91us 495.99us 907.68us GsFit(unsigned short*, float*, float*, int, int)
0.00% 10.144us 6 1.6900us 896ns 2.6560us [CUDA memcpy DtoH]

Linux:

Type Time(%) Time Calls Avg Min Max Name GPU activities:
79.02% 22.419ms 5 4.4838ms 4.2993ms 4.5552ms [CUDA memcpy HtoD]
9.35% 2.6529ms 5 530.59us 517.90us 544.94us findMinMaxRow(float*, unsigned short*, int, int)
8.24% 2.3386ms 5 467.72us 465.81us 469.87us GsDenoise(float*, float*, int, int)
1.59% 451.82us 5 90.364us 90.051us 90.978us nomalization(float*, unsigned short*, float*, int, int)
1.12% 317.32us 5 63.464us 60.354us 74.754us rawTypeTransform(unsigned short*, float*, int, int)
0.33% 94.595us 6 15.765us 4.4480us 21.601us [CUDA memcpy DtoH]
0.33% 94.243us 5 18.848us 18.209us 19.168us GsFit(unsigned short*, float*, float*, int, int)
0.01% 2.1760us 1 2.1760us 2.1760us 2.1760us EvaluateError(float*, float*, int, int)

Dread13 · December 17, 2019, 7:56am

In my experience, running the application in Linux is faster, mainly because the GPU is usually not running the UI like in Windows. Also, if you use several streams of execution, Windows will not provide a good scenario to perform parallelism between every stream due to this behind-the-scenes management with the UI and/or other applications which happen to be running in the background.

However, on Linux (as long as you run it in command line of course) this does not happen, and the GPU is capable of multitasking much better than on Windows.

If you want to run the GPU on Windows in a “non-managed” mode, you will need a second GPU to render the interface and some other tweaks that I am not aware of at the moment.

Hope this helps a bit :)

1954349147 · December 17, 2019, 1:10pm

thx,Dread13.I found the reason lies in the VS compiler and it is better to use nvcc.

WookieOne · December 18, 2019, 7:14pm

I’m currently comparing opencl and cuda and also was facing the same problem. Found out that in visual studio debug mode cuda runs really slow. You can simple change the solution configuration to realase. Took me 2 days to find this out :-/

njuffa · December 18, 2019, 7:30pm

Generally speaking, debug builds of CUDA programs will run slow independent of host platform, as for debug builds all compiler optimizations are disabled to ensure observability of all objects.

WookieOne · December 18, 2019, 7:36pm

OpenCl also uses the cuda toolkit and don’t have optimizations running which slow performance down. My performance gain through the change of the solutions config was five times faster :-)

Dread13 · December 19, 2019, 3:31pm

Of course, using release mode instead of debug is a must to measure performance in Windows VS :) But still, even in release, with no debug info whatsoever in the code, and enabling all the optimizations you possibly can, I found that Linux was faster due to the facts I explained before. Take that in consideration when extracting results for any chart or table!