Speed difference for same CUDA code under Windows/Linux

ttsiodras · December 19, 2009, 1:49pm

Hi, everyone.

I bought a GT240 and started “playing” with CUDA last week. I made my first few tests, then a simple Mandelbrot zoomer (I blogged about it and released the GPL code on my site).

I like the Cuda API, it’s nice and clean - but I just realized that there’s a major speed difference between the two most popular OSes:

My Mandelbrot code runs at around 200fps under my Linux
My Mandelbrot code runs at around 400fps under Windows XP

In both cases, I disabled VSync to get the maximum frame rate.
I am using Cuda 2.3 (the stable version, that is) under both OSes.

Since I use an OpenGL Pixel buffer to draw in, all operations are done entirely in card space… so I was puzzled at this. I thought that maybe the OpenGL implementation under Windows is so much more optimized that it runs circles around the Linux one - so I commented out the code that draws the generated data using the texture…

Speed when only doing calculations, not drawing, under Linux: 400fps
Speed when only doing calculations, not drawing, under Windows: 680fps

Again, even for pure computations, CUDA under Windows runs a lot faster.

Perhaps my code is buggy somehow? I tried the official nbody sample from the SDK:

Under Linux:
./nbody -benchmark -n=30000
Run “nbody -benchmark [-n=]” to measure perfomance.
30000 bodies, total time for 100 iterations: 19701.262 ms
= 4.568 billion interactions per second
= 91.365 GFLOP/s at 20 flops per interaction

Under Windows XP:
Run “nbody -benchmark [-n=]” to measure perfomance.
30000 bodies, total time for 100 iterations: 12137.919 ms
= 7.415 billion interactions per second
= 148.296 GFLOP/s at 20 flops per interaction

So it’s not just my code… For some reason, my GT240 runs at least 60% faster under Windows XP.

Any ideas why? Is this a driver bug? It seems weird, since when doing pure calculations in the card’s global memory, I would expect that nvcc does pretty much the same work under Windows/Linux.

Any people out there doing serious computations using Linux/CUDA, and having seen this?

Thanks for any help,

Thanassis Tsiodras, Dr.-Ing.

P.S. Under Linux, my nvidia-settings, in the PowerMizer option, shows 3 Performance levels, with the 2nd one (i.e. not the best) selected. The other two options appear to be disabled, and selecting “Preferred mode: Maximum performance” doesn’t impact this selection.

Could it be that under Linux the card operates under lower clock frequency because of this?

Nico · December 23, 2009, 11:06am

You can try setting the performance level manually.
[url=“Ð¡Ð°Ð¹Ñ tutanhamon.com.ua Ð½Ðµ Ð½Ð°ÑÑÑÐ¾ÐµÐ½ Ð½Ð° ÑÐµÑÐ²ÐµÑÐµ”]Ð¡Ð°Ð¹Ñ tutanhamon.com.ua Ð½Ðµ Ð½Ð°ÑÑÑÐ¾ÐµÐ½ Ð½Ð° ÑÐµÑÐ²ÐµÑÐµ

N.

meng · December 23, 2009, 12:47pm

Same synopsis as you.

I have a similar performance drop with my Ubuntu. It takes 0.19 sec to finish 3dfd off, whereas Windows 7 uses 0.079 sec.

My post is here

[url=“http://forums.nvidia.com/index.php?showtopic=153624”]The Official NVIDIA Forums | NVIDIA

mikola · December 23, 2009, 3:47pm

Just a guess,

May be it is due to compiler, i.e. microsoft compiler vs. GCC.
Since some amount of code is still executed on CPU, thus the slow down may be caused entirelly by diffrence in CPU execution.

So may be it have sence to measure just kernell execution time.

meng · December 24, 2009, 6:04am

No shot. The sample I used (simpleTexture3D) depends on GPU considerably, as it loops indefinitely until stopped. On Ubuntu 64 it freezes the screen. On Windows 7 it runs smoothly like any other program. The confusing part is that the fps says 60 even on Ubuntu.

nvidia-settings says if any CUDA program is running, the powermizer will switch to max performance mode. However, this promise is simply not paid in the light of the awful performance.

I found a way to speedup graphics card in Windows (http://www.laptops-drivers.com/miscellaneous/alienware-m15x-disabled-nvidia-powermizer.html). There seems to be a similar yet ugly way to do that in Linux too (Swik - Swik News).

ttsiodras · December 25, 2009, 8:09am

To mikola: No, we are talking about pure-GPU calculations here, the CPU simply spawns the CUDA kernels.
GCC vs MSVC can’t explain these differences.

To Niko: I tried every combination I could think of (using the “registry” options), but nothing came of it.
nvidia-settings is always showing the same thing: the 3rd level (1.7GHz) remains disabled, and never used.
This could explain the differences… I sent an email to NVIDIA support, and they sort-of-confirmed that this
is a bug in the current Linux drivers (stable and beta) for GT240. I am waiting for the next driver release,
hoping… (195.30 didn’t fix this bug)

tmurray · December 25, 2009, 3:26pm

If you’re running this from X (namely with some sort of GL-accelerated UI running), try killing that.

meng · December 25, 2009, 6:26pm

Hi tumrray,

Your suggestion worked like a charm. I turned off all visual effects and the simpleTexture3D sample has a boost to up to 120fps (avg.) and 350fps (peek) on my GTX 260m. The screen no longer freezes.

Btw, statistics from 3dfd still gives a 461 Mpoints/s and 0.19264s, which are much lower than its Windows counterpart.

ttsiodras · December 26, 2009, 1:44pm

I am running a spartan IceWM, with no OpenGL “waves and wobbles” for my windows :-)

So there’s nothing to kill.

“nbody” is executed with “-benchmark”, so there’s no OpenGL code running - still, the difference is substantial:

Under Linux:

./nbody -benchmark -n=30000

Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.

30000 bodies, total time for 100 iterations: 19701.262 ms

= 4.568 billion interactions per second

= 91.365 GFLOP/s at 20 flops per interaction

Under Windows XP:

Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.

30000 bodies, total time for 100 iterations: 12137.919 ms

= 7.415 billion interactions per second

= 148.296 GFLOP/s at 20 flops per interaction

For my mandelbrot, it is true I am using OpenGL to “blit” the generated images on screen - but I am doing the same thing under Windows, where it runs 60% faster… and EVEN IF I COMMENT OUT the code that blits the texture on the window, and have only the pure mandelbrot calculations, with no displaying, the code is still running at HALF the speed under Linux.

Try it yourself, the code is opensource and GPLd.

This is not a fluke - it is reproducible every time. It is probably because of the disabled 3rd performance level

(the 1.7GHz mode), which I can only attribute to the fact that GT240 is a new chip and the drivers haven’t caught up

yet - powermizer probably thinks it only has levels one and two.

This needs to be addressed, people from NVIDIA…

Other people have noticed, too…

Thanassis.

biebo · December 28, 2009, 9:34pm

hi

i am using ubuntu 9.10 and my card is nVIDIA GeForce 9200M GE with drivers 190.42
my ubuntu can use the level 3 option, so may be there is a problem with your card model.
if i am right !!!

ttsiodras · December 29, 2009, 8:42am

That’s what I am saying, biebo; that for the GT240/GDDR5 cards, the Linux driver is not supporting the 1.7GHz mode.

Here’s my own screenshot:

jpetridesku · January 6, 2010, 5:56pm

Same thing happens to me, too - GT240/GDDR5.

NVIDIA, please fix this!

A large percentage of the people coding for CUDA are doing this under academic environments (me included), and Linux rules here!

We can’t have 1/3 the speed we have under Windows!

ttsiodras · January 10, 2010, 1:45pm

The bandwidthTest (included in the CUDA SDK), executed on a GT240/GDDR5, under Windows and Linux:

Windows:

...

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   26698.5

&&&& Test PASSED

Linux:

...

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			  8394.9

&&&& Test PASSED

In other words, the more your CUDA code has to do with GPU global memory accesses, the more pronounced the tremendous difference between Windows and Linux.

The reason for this, as shown in the previous posts, is that the Linux drivers run the card at 324MHz instead of 1.7GHz, even when executing CUDA or OpenGL code (see powermizer screenshots in previous posts).

This bug makes GT200-series cards rather useless for CUDA work under Linux…

Can anyone from NVIDIA provide an ETA for a fix for this serious bug?

P.S. According to this, it also affects 260GTX and 280GTX cards. Which means that even 700$ cards run at 1/3 - 1/5th the speed they should be running…

johnglenn · January 19, 2010, 6:09pm

From a post over at nvnews, it appears that NVIDIA has accepted this bug (look at the bottom of that page) as bug number 636716 in the internal NVIDIA bug tracking system.

The post also includes reference to this thread, as well as a similar one about the same powermizer issue for GTX260/280 cards (which seem to suffer from the same Linux driver problem).

I hope this will drive NVIDIA into fixing this powermizer thing…

J.Glenn

(recent owner of a GT240/GDDR5, which runs under Linux at 324MHz instead of 1.7GHz because of this bug).

The bandwidthTest (included in the CUDA SDK), executed on a GT240/GDDR5, under Windows and Linux:

Windows:
...

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   26698.5

&&&& Test PASSED
Linux:
...

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			  8394.9

&&&& Test PASSED
In other words, the more your CUDA code has to do with GPU global memory accesses, the more pronounced the tremendous difference between Windows and Linux.

The reason for this, as shown in the previous posts, is that the Linux drivers run the card at 324MHz instead of 1.7GHz, even when executing CUDA or OpenGL code (see powermizer screenshots in previous posts).

This bug makes GT200-series cards rather useless for CUDA work under Linux…

Can anyone from NVIDIA provide an ETA for a fix for this serious bug?

P.S. According to this, it also affects 260GTX and 280GTX cards. Which means that even 700$ cards run at 1/3 - 1/5th the speed they should be running…

ttsiodras · January 23, 2010, 1:43pm

Phoronix just did a review of a GT240/GDDR5 and noticed the same bug - that the card is much slower under Linux than it should be…

[url=“ECS NVIDIA GeForce GT 240 512MB Review - Phoronix”]Linux Hardware Reviews & Performance Benchmarks, Open-Source News - Phoronix

jma · January 23, 2010, 11:31pm

The bandwidthTest for GDDR5 under Windows doesn’t look too good either … Because around 25GB/sec is what I get on a GT220 also, and this with only DDR3.

Potentially, there should be room for driver improvements that will double the framerates on those cards.

plegresley · January 29, 2010, 7:06pm

I’m seeing the same problem on both Windows and Linux with a range of different drivers. Theoretical peak bandwidth is 54.4 GB/s so we’re getting less than 50% of the theoretical peak on a simple memory to memory copy?! Typically that should be around 75% or so. Anyone found a newer driver that fixes this problem?

mak1078 · February 2, 2010, 3:34pm

I am looking to buy a GeForce 9800 GTX+ or a GeForce GTS250 video card. Given the speed problem with the driver under Linux, should I stay away from the GTS250? Any suggestions would be appreciated.

avidday · February 2, 2010, 4:49pm

The GTS250 is still based on the older G92 GPU, which is the same as the 9800GTX uses. It should not be effected by this problem, which seems to be restricted to the very new GT215 GPU (ie. the GT240).

ttsiodras · February 4, 2010, 9:04pm

Just tested 195.36.03 on my GT240/GDDR5 - powermizer bug remains, memory speed is still 324MHz instead of 1.7GHz, even when running OpenGL/CUDA apps…

$ glxinfo | grep version | grep 195

OpenGL version string: 3.2.0 NVIDIA 195.36.03

$ glxgears & # or any CUDA application you want

$ nvidia-settings -q all 2>&1 | egrep '(Current)?ClockFreqs' | grep '0\.0'

  Attribute 'GPU2DClockFreqs' (home:0.0): 135,135.

  Attribute 'GPU3DClockFreqs' (home:0.0): 550,1700.

  Attribute 'GPUDefault2DClockFreqs' (home:0.0): 135,135.

  Attribute 'GPUDefault3DClockFreqs' (home:0.0): 550,1700.

  Attribute 'GPUCurrentClockFreqs' (home:0.0): 405,324.

The card can do 1.7GHz, but even when running CUDA and/or OpenGL, the CurrentClockFreq remains at 324MHz … :-(

Different test, from the CUDA SDK:

$ ./bandwidthTest  | grep -A1 Bandwi

Host to Device Bandwidth for Pageable memory

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1653.1

--

Device to Host Bandwidth for Pageable memory

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1197.6

--

Device to Device Bandwidth

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   8389.7

Under Windows, all numbers are much greater: e.g. the last one is 26700, that is, more than 3 times faster…

Guess we have to wait for the next driver version again…

Topic		Replies	Views
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1199	April 17, 2023
Linux Kernel Crashes under 260.19.21 Investigating Linux Kernel Crashes CUDA Programming and Performance	35	37589	February 1, 2011
CUDA test performance issue CUDA Programming and Performance	7	1446	November 24, 2014
300x to 600x times faster... really? CUDA Programming and Performance	92	34413	February 8, 2010
GTX295 Specefications & CUDA CUDA Programming and Performance	5	12286	October 7, 2010
CUDA very slow performance CUDA Programming and Performance	21	16737	March 6, 2020
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11161	May 23, 2010
Grim memory bandwidth GTX 1080 CUDA Programming and Performance	127	30561	July 20, 2017
GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2. CUDA Programming and Performance	15	46611	May 6, 2010
G210, GT220 deviceQuery? CUDA Programming and Performance	30	14794	November 21, 2009

Speed difference for same CUDA code under Windows/Linux

Related topics