I ported my CUDA code from Ubuntu 20.04 (Code::Blocks) to Windows 10 (Visual Studio 2019) - same hardware (dual boot/OS machine). CUDA toolkit latest version on both OS. After nvcc compile, Windows version binary is approximately 15x slower than Ubuntu version.
Without posting the entire code now, is there a “classic” mistake/trap while porting code to another IDE/OS?
My guess would be now the difference is attributable not to kernel execution but to some other aspect. My expectation is that a given kernel launch should take the same amount of time to execute, whether the OS is windows or linux, all other things (GPU, machine config, kernel code, input parameters, grid config, compile settings, etc.) being the same.
So my guess is the difference is in something that should be pretty evident from a profiler timeline view. For example, if on WDDM, the actual windows OS usage of the GPU could be getting in the way of the CUDA usage of the GPU.