Raw speed for CUDA apps What is the fastest card at present?

Hi all,

I’m in the process trying to decide what cards to grab for a bunch of new machines to run as a test bed for our new CUDA based engines.

I have a bank of GTX-8800’s (G80) and was looking into maybe getting a number of the newer GTS-8800’s (G92) and wanted to know if anyone has any specifications for comparing the cards in CUDA apps?

I’m also aware that in the next quarter (it’s the 6th of Feb, 2008 when I’m posting this, for posterity) but we need some cards right now - we will upgrade later if they are sufficiently faster.

According to the nVidia wiki page, the on-card bandwidth is 84.6 (G80) to 62.7 (G92) but the shader rate/GFlops is listed as 518 (G80) and 624 (G92) - so I’m thinking that maybe that with the PCI-Express v2.0 might give enough of a boost to make it worthwhile?

Anyone? I’m trying to work out the whole “2x GTS” vs “1x Ultra” and such to determine the best “bang for buck” we can get for the $$$.

Thanks,

RPS.

As you have noticed, internal memory bus is wider in “old” G80, but “raw speed” is higher on new 8800GTS/512.
I would personally get G92-based 8800GTS instead of G80-based 8800Ultra because price/performance ratio is much better for it.

Hi RPS,

First of all, you got the theoretical peak gflops figures way too high. The numbers presented in some PR material includes capacity not available to CUDA. To get the correct figures, multiply frequency times the number of thread-processors times two;

E.g. 8800GTS/512 GPU: 1.625GHz * 128 * 2 = 416 gflops

I have tested a few cards against each others and my conclusion is that they are not comparable without knowledge of what type of problem you are going to use them for. For example the eigenvalues demo runs faster on a 8800GT then a Tesla, but if you need more memory capacity or memory bandwidth, the Tesla wins. If you need much transfers to/from CPU, a PCIe v2.0 system is of course preferred, etc. There is no generally best GPU today.

Myself I’d settle for the 8800GTS for now mainly because G92 is a newer architecture then G80. Also keep in mind that the rest of the system is of vital importance for the performance over all. You need at least as many CPUs (cores) as GPUs in the machine, and different motherboard chip-set seems to give large variances in the DMA transfer speeds. nForce 780i motherboards seems to be slower then X38, but on the other side 780i can run one more GPU at PCIe v2.0 speeds, etc.

It’s a jungle.

  • Kuisma

I’ve benchmarked my application on both the GTX and the 8800 GT (very similar specs to the new GTS). Performance for the global memory bandwidth bound kernels is as expected: slower by a factor similar to the raw memory performance decrease.

Performance for the computation limited kernels is roughly equal. So the overall program (1/2 of each type of kernel) is only a little slower on the GT.

As Kuisma says, you need to pick the GPU that best fits your needs.

Hi all,

Thanks for your input, it’s appreciated.

I have a strange situation here, which is confusing things somewhat regarding the decision on which way to go for the additional hardware:

In short, I have done about as much optimizing as I can, and the figures I am getting from the cards are pretty impressive, but worrying as they don’t seem add up…

From what I can understand from reading the documentation and this forum, the limit for throughput should be ~64GB/sec (G80-GTS) and ~86.4 GB/sec (G80-GTX).

My figures are now somewhat in excess of those figures, which makes me think either I’m doing something wrong - or maybe there is something I am not taking into consideration, maybe the cache?

After much trial and error, I have abandoned the use of shared memory altogether. My code now runs purely from the texture and constant memory regions and registers. It then push-pulls context information via global memory (which is miniscule).

The figures I am seeing are 5 billion itterations / second on the GTS-8800-320Mb (G80). Each itteration is reading 14 bytes - so to my math, that equates to some 70 GB/sec - isn’t the card only supposed to be capable of 64 GB/sec?!

I’m thinking that maybe it could be the cache (which covers both the texture and constant memory regions) which is inflating the speed (and hence, my results)?

I have yet to run it on the GTX’s, I’ll leave that for the morning, but I think I’m pretty well there (basically, I need to know when to stop optimizing, so I’m a little out of sorts when I have exceeded what I thought was the limit… :)

RPS.

I think you’re right in your guess. Both texture and const memory are cached ;-)

I have the same problem. When do you stop to optimize. As far as I read in some very nice slides from Mark Harris, you should basically calculate your GFlops/s and GB/s and see if any of these is approaching the theoretical limit. But with kernels that have parts in if-statements, switches and such it quickly gets complicated to do the math.

So now I have decided to stop when I reach real-time performance :D

Yikes, I’ve got a long ways to go then. A full day of number crunching only nets me nanoseconds of simulation time :)