Hello everyone. I few days back i posted a similar topic but the case was with a 750 GTX TI.
Now i manage to get another GPU for the same PC. It’s quite weird how the times on ubuntu running the same code are event 4 times faster than windows.
In both i am using the same PC , i just switch my OS and the code it’s the same for both versions.
I don’t know exactly why that might be. My code has a hard use on Thrust,Cublas and Arrayfire… And Raw Cuda Kernels…
But i have been experiencing some kind of issue on windows with video games too… there are few games where the performance of the GPU it’s pretty low, 30 - 50% usage…
Maybe anyone can give some directions or idea about what could be happening?
Windows 10
AMD FX 6300
Gigabyte 970 GTX Turbo Twin
Ubuntu 15
8 GB Of RAM
MSI 970A-g43
For windows, Are you building debug projects in visual studio? Or release projects?
In general, performance differences between windows (WDDM) and linux don’t surprise me. I’m not sure that 4X is typical, but that may be due to how you are building the project.
A given CUDA kernel, once it is started, should run in approximately the same time (assuming the same data and input conditions) on windows vs. linux. (Although the amount of allocated memory on the GPU may apparently affect this claim.) You should be able to verify this with a profiler.
Of course, this is going to be hard to manage with a complicated application that uses multiple frameworks. If you’re wanting to understand the difference, you may need to start narrowing the problem down or subdividing it.
Hi. On both OS, i am using a release version of the cuda app.
I don’t know exactly why that might be happening, but the time difference is quite big. On ubuntu some iterative running could take around 400 seconds and on windows can even take around 1400 seconds. Both in exactly the same configuration.
I have another issue, for some reason the visaul profiler get stuck creating the timeline. the only way to get metrics is using the visual studio nsight tools… so i don’t know exactly why that might be.
The profilers generally have more difficulty on large, complicated applications. There’s a variety of reasons for this, but a simple one is that various hardware counters can overflow, leading to a loss of profiling data.
The usual suggestion is to simplify the application to either narrow down the problem, or subdivide the problem.
If you have access to the source code for the application, you can usually help the profiler out by turning on and off profiling for just a short segment of the application, using the profiler API.
You do not need to use the Visual Profiler. You can get a basic profile (as well as very sophisticated profiles) from the command line profiler, nvprof:
(1) #include <cuda_profiler_api.h> in your source code
(2) call cudaProfilerStop() before the application exit point
(3) run nvprof --print-gpu-trace [your app name]
This should produce a list of all kernel invocations and H/D copies. If you use identical hardware for Linux and Windows, the timing of these should match within 2% or so. If that is not the case, your build settings or application configurations are probably not the same.
“To collect all the events and metrics that are needed for the guided analysis system, you can simply use the --analysis-metrics option along with the --kernels option to select the kernel(s) to collect events and metrics for. See Remote Profiling for more information.”
Best I know, under the hood nvprof uses the same infrastructure as the Visual Profiler. Given the problems you described above with a (presumably) complex and long-running application overwhelming the profiler’s processing capabilities, I would suggest starting out with some simple profiles first to see if you can exclude things like kernel run time from the list of potential culprits, and narrow down the areas of interest.
You may also want to simplify tracing by turning the profile on/off programmatically which will generate shorter, more manageable traces. This would also help mitigate the risk of hardware counter overflows pointed out by txbob. Profiling a short segment of steady-state execution may be sufficient.
Glad to hear that using nvprof solved your profiling issues. I think nvprof may be under-appreciated by many CUDA developers, possibly because command line tools are considered 20th century technology.
I warmly recommend it, however, since it is a tool that can scale from simple profiling tasks to quite sophisticated ones, with a user interface that I consider well thought out. And as you found out, if visualization is ultimately required, it is trivial to import the data into the Visual Profiler.