9800 GTX and CUDA performance problems Slower than 8800 GT in some cases

shawkie · June 17, 2008, 10:31am

I really don’t understand what’s happening but I am seeing my 9800 GTX perform slower than an 8800 GT on some of my tests. In some cases it can take almost twice as long to complete a task! In other tests it comes out much faster. I understood that the 9800 GTX was just an 8800 GTS 512 running at higher clock speeds - is that true? I’m using CUDA 1.1. Any help would be much appreciated.

seibert · June 17, 2008, 12:40pm

Can you say more about your setup?:

Are both cards in the same computer?
Which PCI-Express slots are they installed in?
Do you know if your kernels are compute-bound or memory bound?
If memory bound, what does bandwidthTest in the SDK report for the two cards?

shawkie · June 17, 2008, 2:02pm

I have been swapping the cards in and out of the same slot in the same computer. Its a 16X PCI-E 2.0 slot on an i5400-based motherboard. Actually I’ve also tried it on another PC with a 16X PCI-E 1.0 slot on an old nForce4-based motherboard with similar results.

I’m really not sure if they are memory bound or computation bound. They don’t do an awful lot of maths although there is a division (which is quite costly I believe). Otherwise its mostly just texture lookups (with bilinear filtering) and some coallesced reads and writes.

I’ll try the bandwidthTest.

But surely the 9800 GTX should basically be faster in every way shouldn’t it? There isn’t something wierd about the architecture I need to be aware of is there?

The only thing I can think of that might be important is that I found blocks of 512 (rather than 256 for instance) threads to perform best on the 8800 GT. Could this be my problem?

seibert · June 17, 2008, 2:37pm

Yes, as far as I know, the 9800 GTX should be faster than the 8800 GT in all aspects: computation and memory transfer. However, I’ve seen a weird driver bug with CUDA 1.1 that mysteriously handicapped my 8800 GTX’s memory bandwidth until I started X.org. (CUDA 2.0 beta fixed this) I don’t think this is your problem, but I’m asking about the configuration mostly to narrow down the bug.

Actually… Now that I think about it, I don’t believe the 9800 GTX is supported in CUDA 1.1. Technically, it isn’t supported in CUDA 2.0 beta either, but there have been successful reports of people using it anyway. (9800 GTX was released after CUDA 1.1 and CUDA 2.0 beta) Can you try the CUDA 2.0 beta driver and toolkit and see if this helps? You might be hitting some kind of driver incompatibility in the older version.

shawkie · June 17, 2008, 3:46pm

Yes, as far as I know, the 9800 GTX should be faster than the 8800 GT in all aspects: computation and memory transfer. However, I’ve seen a weird driver bug with CUDA 1.1 that mysteriously handicapped my 8800 GTX’s memory bandwidth until I started X.org. (CUDA 2.0 beta fixed this) I don’t think this is your problem, but I’m asking about the configuration mostly to narrow down the bug.

Actually… Now that I think about it, I don’t believe the 9800 GTX is supported in CUDA 1.1. Technically, it isn’t supported in CUDA 2.0 beta either, but there have been successful reports of people using it anyway. (9800 GTX was released after CUDA 1.1 and CUDA 2.0 beta) Can you try the CUDA 2.0 beta driver and toolkit and see if this helps? You might be hitting some kind of driver incompatibility in the older version.

[snapback]395045[/snapback]

I’ve got the results of the bandwidth test and there are no suprises there. This is for the PCI-E 2.0 slot:

8800 GT

Host (pinned) → Device: 5,129.8 MB/s

Device → Host (pinned): 4,524.3 MB/s

Device → Device: 47,874.4 MB/s

9800 GTX

Host (pinned) → Device: 5,564.7 MB/s

Device → Host (pinned): 4,759.2 MB/s

Device → Device: 56,308.4 MB/s

I think you may have a point with the CUDA versions but I’m not sure its possible to run CUDA 2.0 beta on the 9800 GTX since there is only one version of the driver that supports CUDA 2.0 and it doesn’t support the 9800 GTX. Also, I thought CUDA was meant to be compatible with future hardware without recompilation?

seibert · June 17, 2008, 8:45pm

Ah, ok, for some reason I thought you were using a driver from the CUDA 1.1 release. So you are using a driver newer than the CUDA 2.0 beta driver? (And ignore my statement about the toolkit. You are right that code compiled with the older toolkit ought to run on the newer drivers, as far as I know.)

shawkie · June 18, 2008, 8:13am

Yep, I’m using a newer driver.

It looks like its probably all down to the texture cache and my access patterns. My run times seem to be very sensitive to the size and shape of my thread block. Strangely what works well on the 8800 GT also works well on the 8800 GTS 640 but not on the 9800 GTX. Is there a way to get a better idea of what is going on?

Sibi_A · June 18, 2008, 10:33am

improper memory access may reduce the program execution speed. I faced such an issue, output was absolutely right, but memory accessing was a little bit wrong (array indexing out of bound). When I run my program in EmuDebug mode it crashed with the infamous access violation error message box. Fixed it and back to speed. (But these all happened with 8800GTS)

You may also look into something like that.

shawkie · June 18, 2008, 10:49pm

Definitely looks like its the texture cache. I’ve found that depending on the exact details of my computation I fall into one of two cases:

Computation and (non-texture) memory bandwidth limited - minimum amount of arithmetic and maximum amount of coallesced reads and writes is crucial
Texture memory bandwidth limited - most efficient use of texture cache is crucial

I think that I originally tuned my program for case 1 and then threw a case 2 at it. Strangely the cost of getting it wrong seems to be much higher on the 9800 GTX than the other cards. Maybe it has higher latencies or something?

I’m still interested if anyone has any information on the texture cache and how to measure or model hits and misses.

shawkie · June 26, 2008, 7:56pm

Actually there is more to this. Texture cache is certainly a factor but I am also seeing very eratic and generally poor performance as soon as I use any driver newer than 169.21.

shawkie · June 27, 2008, 7:04pm

The attached graph demonstrates what I’m seeing. If you look at the dark blue and dark red plots they are for Forceware 169.21 drivers. They are well defined and totally repeatable. Also you will notice that the second pass is a mirror image of the first - thats a feature of an inherent symmetry in the calculation. Now look at the bright red and blue plots. They are for the latest 177.35 drivers. They follow the darker plots quite closely in places (although they are slightly higher) but over most of the plot they are massively inflated in a fairly haphazard kind of way. The effect is only vaguely similar on repeated runs. Also notice that it doesn’t obey the inherent symmetry in the calculation!

Topic		Replies	Views
GTX or GTS ? CUDA Programming and Performance	14	16167	August 14, 2007
8800 vs 8600: CUDA differences? CUDA Programming and Performance	22	48324	May 23, 2007
Lack of support for 9800GTX in 2.0 Beta CUDA Programming and Performance	9	5027	April 18, 2008
Why 8800 is faster? CUDA Programming and Performance	15	10270	May 13, 2009
two (newbie?) questions asynchroneous host->device memcpy+events CUDA Programming and Performance	22	21969	December 11, 2008
Raw speed for CUDA apps What is the fastest card at present? CUDA Programming and Performance	7	8836	February 6, 2008
GeForce 9800 GX2 almost end-of-life already? Where is the 9800 GX2+??? CUDA Programming and Performance	7	4391	July 2, 2008
9800 GX2 & Tesla problem Which driver & CUDA version? CUDA Programming and Performance	9	6477	May 15, 2008
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6034	December 8, 2008
Advice on first CUDA system CUDA Programming and Performance	13	2678	July 7, 2009

9800 GTX and CUDA performance problems Slower than 8800 GT in some cases

Related topics