9800 GTX and CUDA performance problems Slower than 8800 GT in some cases

I really don’t understand what’s happening but I am seeing my 9800 GTX perform slower than an 8800 GT on some of my tests. In some cases it can take almost twice as long to complete a task! In other tests it comes out much faster. I understood that the 9800 GTX was just an 8800 GTS 512 running at higher clock speeds - is that true? I’m using CUDA 1.1. Any help would be much appreciated.

Can you say more about your setup?:

  • Are both cards in the same computer?
  • Which PCI-Express slots are they installed in?
  • Do you know if your kernels are compute-bound or memory bound?
  • If memory bound, what does bandwidthTest in the SDK report for the two cards?

I have been swapping the cards in and out of the same slot in the same computer. Its a 16X PCI-E 2.0 slot on an i5400-based motherboard. Actually I’ve also tried it on another PC with a 16X PCI-E 1.0 slot on an old nForce4-based motherboard with similar results.

I’m really not sure if they are memory bound or computation bound. They don’t do an awful lot of maths although there is a division (which is quite costly I believe). Otherwise its mostly just texture lookups (with bilinear filtering) and some coallesced reads and writes.

I’ll try the bandwidthTest.

But surely the 9800 GTX should basically be faster in every way shouldn’t it? There isn’t something wierd about the architecture I need to be aware of is there?

The only thing I can think of that might be important is that I found blocks of 512 (rather than 256 for instance) threads to perform best on the 8800 GT. Could this be my problem?

Yes, as far as I know, the 9800 GTX should be faster than the 8800 GT in all aspects: computation and memory transfer. However, I’ve seen a weird driver bug with CUDA 1.1 that mysteriously handicapped my 8800 GTX’s memory bandwidth until I started X.org. (CUDA 2.0 beta fixed this) I don’t think this is your problem, but I’m asking about the configuration mostly to narrow down the bug.

Actually… Now that I think about it, I don’t believe the 9800 GTX is supported in CUDA 1.1. Technically, it isn’t supported in CUDA 2.0 beta either, but there have been successful reports of people using it anyway. (9800 GTX was released after CUDA 1.1 and CUDA 2.0 beta) Can you try the CUDA 2.0 beta driver and toolkit and see if this helps? You might be hitting some kind of driver incompatibility in the older version.

I’ve got the results of the bandwidth test and there are no suprises there. This is for the PCI-E 2.0 slot:

8800 GT

Host (pinned) -> Device: 5,129.8 MB/s

Device -> Host (pinned): 4,524.3 MB/s

Device -> Device: 47,874.4 MB/s

9800 GTX

Host (pinned) -> Device: 5,564.7 MB/s

Device -> Host (pinned): 4,759.2 MB/s

Device -> Device: 56,308.4 MB/s

I think you may have a point with the CUDA versions but I’m not sure its possible to run CUDA 2.0 beta on the 9800 GTX since there is only one version of the driver that supports CUDA 2.0 and it doesn’t support the 9800 GTX. Also, I thought CUDA was meant to be compatible with future hardware without recompilation?

Ah, ok, for some reason I thought you were using a driver from the CUDA 1.1 release. So you are using a driver newer than the CUDA 2.0 beta driver? (And ignore my statement about the toolkit. You are right that code compiled with the older toolkit ought to run on the newer drivers, as far as I know.)

Yep, I’m using a newer driver.

It looks like its probably all down to the texture cache and my access patterns. My run times seem to be very sensitive to the size and shape of my thread block. Strangely what works well on the 8800 GT also works well on the 8800 GTS 640 but not on the 9800 GTX. Is there a way to get a better idea of what is going on?

improper memory access may reduce the program execution speed. I faced such an issue, output was absolutely right, but memory accessing was a little bit wrong (array indexing out of bound). When I run my program in EmuDebug mode it crashed with the infamous access violation error message box. Fixed it and back to speed. (But these all happened with 8800GTS)

You may also look into something like that.

Definitely looks like its the texture cache. I’ve found that depending on the exact details of my computation I fall into one of two cases:

  1. Computation and (non-texture) memory bandwidth limited - minimum amount of arithmetic and maximum amount of coallesced reads and writes is crucial

  2. Texture memory bandwidth limited - most efficient use of texture cache is crucial

I think that I originally tuned my program for case 1 and then threw a case 2 at it. Strangely the cost of getting it wrong seems to be much higher on the 9800 GTX than the other cards. Maybe it has higher latencies or something?

I’m still interested if anyone has any information on the texture cache and how to measure or model hits and misses.

Actually there is more to this. Texture cache is certainly a factor but I am also seeing very eratic and generally poor performance as soon as I use any driver newer than 169.21.

The attached graph demonstrates what I’m seeing. If you look at the dark blue and dark red plots they are for Forceware 169.21 drivers. They are well defined and totally repeatable. Also you will notice that the second pass is a mirror image of the first - thats a feature of an inherent symmetry in the calculation. Now look at the bright red and blue plots. They are for the latest 177.35 drivers. They follow the darker plots quite closely in places (although they are slightly higher) but over most of the plot they are massively inflated in a fairly haphazard kind of way. The effect is only vaguely similar on repeated runs. Also notice that it doesn’t obey the inherent symmetry in the calculation!
cuda_drivers_perf.gif