163x performance boost on Fedora 28 vs Windows 10?

I typically do all my CUDA work in Linux, but today I was setting up CUDA under Windows 10 in order to help a student through the setup process. I was able to get CUDA installed and can compile and run the CUDA code samples under Windows 10, as well as compile my own CUDA code via command prompt with nvcc. However, when I ran the nbody simulation sample in Windows the performance was abysmal. Booting my laptop into Fedora and running the same sample code gives a 163x performance boost.

Additionally, while my own CUDA code compiles with no errors, and executes without any kind of error messages, it does not return the expected results, e.g. everything that is calculated on the GPU is simply 0. This is code that works as expected when compiled and run in Fedora.

Has anyone else experienced similar issues? My guess is that it has something to do with the difference in the way Optimus technology is handle between Linux (via Bumblebee) and Windows, though if anything I would have expected the officially supported Windows implementation to work better than the community made hack that is Bumblebee on Linux.

Here are some specifics about the performance disparity:

Fedora 28: nbody --benchmark
n = 5120, time for 10 iterations = 5.825 ms
45.003 billion interactions per second
900.066 single-precision GFLOPs at 20 FLOPs per interaction

Windows 10: nbody --benchmark
n = 5120, time for 10 iterations = 948.492 ms
0.276 billion interactions per second
5.528 single-precision GFLOPs at 20 FLOPs per interaction

Both tests were run on my laptop with the following specifications:
Model: Lenovo Ideapad Y700 15ISK
CPU: Intel Core i7-6700HQ (4 core/8 thread)
IGPU: Intel HD Graphics 530
System Memory: 32 GB DDR4
NVIDIA Discrete GPU: GTX 960M 4 GB

Are you building a debug project on windows? That will slow things down a lot. Build a release project.

RIGHT! Sorry. Like I mentioned, I usually use Linux for CUDA with makefiles. Not used to having to switch IDEs from debug to release mode. After that the nbody sample runs much faster, but still slower than on Linux. The Windows run of nbody --benchmark now took 7.083 ms, or about 1.2 times longer than on Linux.

I also still cannot explain why my own code compiled via the command line with nvcc produces the correct results on Linux, but not Windows. The only includes are from the C++ standard library or the CUDA SDK. The portion of the code that calculates on the CPU works just fine, but it seems like the GPU isn’t actually doing anything on Windows (e.g. host arrays are initialized to zero, and should be copied over with the results from the GPU calculations, but in the end the host arrays are all still zero on Windows, while the exact same code gives non-zero results when run on Linux).

I’ll have to wrap my CUDA calls in my error checking code, and probably run some debugging on Windows. Just seems like there shouldn’t be a platform dependent problem here.

Thanks for the reminder about the debug build though!

On windows, you may be running into a WDDM TDR timeout, if any of the kernels in your code run for longer than about 2 seconds.

There are various platform differences as well, for example long is 32 bits on 64-bit windows, but 64 bits on 64-bit linux.

It’s the kernel time limit. I didn’t realize that Windows had such a draconian kernel execution time limit.

The code only uses ints and floats, which I believe have the same number of bits regardless of platform.

It’s a simple code that was designed to mimic a CPU based Python code written by a student, and while I implemented a few optimizations, the kernels have not been optimized very well just yet and are quite memory bound at the moment. The first kernel takes about 40 ms on my laptop, so no problem there, but the second kernel takes about 5.8 seconds on my laptop running under Linux (which still has a kernel runtime limit, but they just give you more than 2 seconds, I believe it’s more in the 20 second range on Linux if memory serves).

At least this may lead to some good discussions with this student who is just starting to learn CUDA and I can leave it as a task to him to get the code to work without hitting the kernel time limit should he run into the same problem on his desktop.

Thanks for all the input @Robert_Crovella!

The default limit of the Windows GUI watchdog timer is about 2 seconds. GUI watchdog timer limits exist in all other operating systems supported by CUDA and in my recollection their default values are always in the low single-digit seconds range.

When using a TCC driver with a GPU on Windows or not running X11 on a GPU on Linux GUI operation is not affected by long-running kernels and no watchdog limits are imposed.

A fairly standard configuration is therefore to install a high-end GPU for compute, and a low-end GPU to drive the GUI.

I was fully aware of the run time limit for kernels. My surprise was that it was so short on Windows. As chronicled in my posts above, the code in question executes fine on my laptop’s Linux installation with one kernel taking 5.8 s to complete. The run time limit was being enforced in Linux, but was not tripped by a 5.8 s run time, so the limit there is larger. I can’t find a reference right now, but from back when I first started learning CUDA in 2011 or so, I seem to recall seeing something of the order of 20 seconds being the limit for Linux.

I also don’t normally run CUDA programs on my laptop, but I use my desktop which always has one GPU for my displays and one for compute work. Currently I’m running a GTX 950 for the display and a GTX 1070ti for compute. I’d prefer to be running a Tesla or Titan V for compute, but unfortunately that’s not currently in the budget.

Interesting data about the GUI watchdog timer limit on Linux. Maybe I misremembered, or different Linux distros use different defaults. For about a decade, I used RHEL exclusively.