I typically do all my CUDA work in Linux, but today I was setting up CUDA under Windows 10 in order to help a student through the setup process. I was able to get CUDA installed and can compile and run the CUDA code samples under Windows 10, as well as compile my own CUDA code via command prompt with nvcc. However, when I ran the nbody simulation sample in Windows the performance was abysmal. Booting my laptop into Fedora and running the same sample code gives a 163x performance boost.
Additionally, while my own CUDA code compiles with no errors, and executes without any kind of error messages, it does not return the expected results, e.g. everything that is calculated on the GPU is simply 0. This is code that works as expected when compiled and run in Fedora.
Has anyone else experienced similar issues? My guess is that it has something to do with the difference in the way Optimus technology is handle between Linux (via Bumblebee) and Windows, though if anything I would have expected the officially supported Windows implementation to work better than the community made hack that is Bumblebee on Linux.
Here are some specifics about the performance disparity:
Fedora 28: nbody --benchmark
n = 5120, time for 10 iterations = 5.825 ms
45.003 billion interactions per second
900.066 single-precision GFLOPs at 20 FLOPs per interaction
Windows 10: nbody --benchmark
n = 5120, time for 10 iterations = 948.492 ms
0.276 billion interactions per second
5.528 single-precision GFLOPs at 20 FLOPs per interaction
Both tests were run on my laptop with the following specifications:
Model: Lenovo Ideapad Y700 15ISK
CPU: Intel Core i7-6700HQ (4 core/8 thread)
IGPU: Intel HD Graphics 530
System Memory: 32 GB DDR4
NVIDIA Discrete GPU: GTX 960M 4 GB