CUDA performance on Linux Sample programs shows it's slower?

I compared the results of the sample programs on the SDK, and it seems like CUDA is slower in Linux.

The only thing that was faster, was simpleGL, where the wave moves much faster compared to on Windows (if (faster == performance))

Any specific cause of this?
I’m currently using Debian. I know its not supported as yet, will that have any effect on performance?

I’d say that what is potentially more likely is that the motherboard that you’re using is impacting performance. How are you measuring performance?

I do not see any runtime differences between XP and Linux for the kernel time (using the CUDA profiler). Kernel startup and memory operations (pageable) tend to be a bit faster on Linux. The emulator can be dramatically faster on Linux with some kernels because of the better thread scheduling.

I am using WinXP SP2 and openSuSE 10.2 (2.6.18 kernel) respectively. CK804 board, 3GHz P4 HT

Peter

My machine spec is:
Intel 5000X
2x Xeon 5160 3.0GHz
3GB RAM
8800GTX

OSes are Windows XP/SP2 and Debian

Notable differences are

Bandwith Test - Device to Device
Linux: 3331 MB/s
Win: 9504 MB/s

Binomial
Linux: 218.4 ms
Win: 162.6 ms

matrixMul
Linux: 162.8 ms
Win: 16 ms

MultiGPU
Linux: 797.8 ms
Win: 576 ms

Scan
Linux: .477ms .771ms .306ms
Win: .29ms .38ms .167ms

Vectorload
Linux: 160ms
Win: 24ms

Any ideas what is causing this?

I saw this on Fedora Core, Knoppix(Debian), Ubuntu… All same perf problem, at least D2D numbers were quite close to 333xMB/s .

On Fedora, I remember I saw the good D2D bandwidth at some point, but never saw it again.

It is good to try suse.

With the same motherboard and everything else, winxp is faster than linux.

I got 85xxMB/s d2d bw on windows, but 333xMB/s d2d on linux.

I got binomial 304 ms, matMul 44 ms, scan .5, .89, .24 ms, vectorload 44 ms (all linux).

But something’s funny – I get 3334.7 MB/s D2D – and this on an 8800GTS, not GTX.

Why should the number be virtually the same?

I was pursuing this 333x MB/s D2D speed for quite some time.

This makes me worried to devote the development and measurement on linux.

I expect linux performs better on compute.

Will be very appreciate if someone can explain that D2D number and other lower bench numbers.

The new version is going to be way faster.
If you look at the FAQ, these are the new numbers

            Pageable     Page-locked

Host - Device 1.7 GB/sec 3.1 GB/sec
Device - Host 1.7 GB/sec 3.1 GB/sec
Device - Device 70.7 GB/sec 70.7 GB/sec

I can confirm the values of mfatica on Linux (CUDA 0.9beta).

I don’t see this difference. And I actually don’t see why a device2device should depend on the host OS :blink:

Peter