Huge Linux vs XP performance boost with beta 2.0

I installed CUDA 1.1 under Fedora Core 8 x86_64, and have been benchmarking my program (along with some of the other programs in the SDK) vs benchmarks on Windows with the exact same hardware.

My program:
Windows XP (32-bit): 1450.8ms
Linux (FC8, x86_64): 2554.2ms

So, how about the stock BlackScholes sdk example?

Windows XP (32-bit): 3.5ms
Linux (FC8, x86_64): 5.3ms

Linux is running at 55%-67% of Windows with the 1.1 SDK!

So, I installed 2.0 beta under Linux.

My Linux system exactly matches my Windows 1.1 results, resulting in an obviously big performance jump.

So, I installed 2.0 under Windows.

My program went from 1450.8ms to 1170.4ms under Windows. So, my Linux 2.0 benchmarks now match my Windows 1.1 benchmark. But my Windows 2.0 benchmark is now faster than my Linux 2.0 benchmark by quite a bit.

Bottom line, the 2.0 beta is definitely worth installing immediately, but the performance mismatch between Linux and Windows is puzzling.

Firstly its a beta so its definately worth flagging this difference with NVIDIA whatever the case is. I was going to suggest the overhead of 64-bit processing, but even so, thats way too big.

Might be worth detailing what compile your using under Linux - presuming .net CL for windows?

Our application only has a +/- 2% delta in performance differences between Windows XP 32-bit and Linux 64-bit: and that is due to MSVC poorly optimizing portions of the CPU code. And I’ve never noticed a difference in any of the microbenchmarks I’ve written.

Does your application read pointers out of global memory or otherwise perform a lot of operations with pointers? In 64-bit OSes, pointers are 64-bit in CUDA to match the host pointer size. This can potentially lead to more registers used (check the cubin), or more memory transfers, etc…

One other area where performance is often different between linux and windows is in host<->device transfers. On many systems, linux is slower than windows: usually to the tune of 2.5 GiB/s linux vs 3 GiB/s windows.

Have you tried the Black Scholes benchmark across XP-32 and Linux-x86_64? My application’s performance boost mirrored that of the BS results.

Upgrading Linux to beta 2.0 of CUDA instantly gave me the performance that existed on Windows XP 1.1, so my guess is something significant has changed.

Going from CUDA 1.1 to CUDA 2.0 on 64-bit Linux had no significant performance delta for me. Unfortunately I don’t have a Black-Scholes benchmark from before the upgrade to compare to.

I’ll try it when i get back to the office.

Hardware: 8800 GTS 512MB
Linux benchmarks are performed in the text console mode with nothing but sshd running in the background. Windows Vista benchmarks are performed with Aero disabled and all default background serviced running, except SuperFetch.

Black Scholes times
CUDA 1.1 Linux 64-bit: 2.0534 +/- 0.0006 ms
CUDA 2.0 beta Linux 64-bit: 1.6797 +/- 0.002 ms
CUDA 2.0 beta Vista32: 1.6538 +/- 0.0009 ms

So, Black Scholes does seem a little slower with CUDA 1.1, but both Linux 64-bit and Vista32 with CUDA 2.0 beta perform nearly identically. Apparently, there was some compiler improvement between 1.1 and 2.0 here.

However, Mersenne Twister offers a counter example: Windows is slower

Mersenne Twister BoxMullerGPU() samples per second
CUDA 1.1 Linux 64-bit: 5.57 +/- 0.2 billion
CUDA 2.0 beta Linux 64-bit: 5.63 +/- 0.1 billion
CUDA 2.0 beta Vista32: 4.57 +/- 0.1 billion

I think this goes to show that compiler/performance differences between architectures needs to be evaluated on a case-by-base basis. I would suggest starting by examining the register counts and occupancy numbers for cases where the performance differs from one OS to another. This is the most likely cause for differences in compute-intensive kernels.

For instance, the BlackScholes SDK example compiles to 16 regs in 64-bit linux CUDA 2.0 beta, and 16 in CUDA 1.1. OK, so my idea didn’t work here… I can’t explain the performance difference here, it will probably take wumpus and decuda to completely unravel the compiler differences in BlackScholes. But: in my own kernels I have seen register count differences between 64-bit and 32-bit compiles cause performance differences due to the changed occupancy.