I’ve been reading some disturbing rumours suggesting that the 8800 GTS 512 and 8800 GT are already end-of-life and the 9800 GTX and 9800 GX2 will be soon.
I’ve been using Direct3D to do GPGPU computing for a while and I’ve recently been evaluating CUDA. My overall impression is that there are some gains to be had (up to about 2.5x for my application) but it is still a bit immature and unstable. I get good results from the 169.21 driver and CUDA 1.1 but performance becomes erratic when I try using newer drivers. This makes it difficult to use the newer cards.
The 8800 GTS 512 is small and not too power hungry making it very easy to use. The 9800 GTX is too big, uses too much power and makes too much heat. It simply won’t fit in our current motherboard. As a result I’m now looking at new motherboards but because I have to use server boards there aren’t many options. It tends to be a lengthy process involving several BIOS updates before everything works together. Just in time for the card to be dropped. Its very frustrating.
This brings me to the 9800 GX2. Yes, its big. Yes, its power hungry. Yes, it gets very hot. But its also incredibly fast and relatively cheap. The benchmarks I’ve seen indicate the 280 will be slower despite its higher price.
So, what we really need is a dual-GPU card based on the new 55nm chip (from the 9800 GTX+) but clocked at current GTS or GTX speeds.
I have not seen one single benchmark yet comparing the GTX280 to 9800GX2 for CUDA performance. The differences between the 2 are:
double the registers per MP
much higher memory bandwidth
apparently much more often a MAD & MUL can be done at the same time than in the previous generations.
240 ALU’s compared to 2128 on 9800GX2, but the max. number of threads running is 301024 vs 2 * 16*768 = 30720 vs 24576 (and you will achieve more of those 30k than of those 24k because of the amount of available registers)
atomic shared memory operations (can be useful for some tasks)
The first one brings much higher occupancy for register-hungry algorithms.
The second brings higher performance when hitting the bandwidth barrier.
I do not see one reason why GTX280 will perform slower than 9800GX2 and would be surprised if it were.
It is true.
And the most benefit of GTX280 is 1GB of RAM per single GPU in comparison to 2x512MB in 9800 GX2. That fact is involved in dramatically decreased performance on 9800GX2 if your application need to use all avalible computing power (almost always) with kernel distributed over all GPUs. The problem is when those kernels running on different GPUs need to exchange data because each GPU has it’s own global memory inaccessible from other GPU . In such situations CPU host must be used for data transfers from one GPU to second GPU which is huge impact on performance. Like that is not enough problems things in practice are worst since each device (GPU) must be accesed from different CPU thread.
So in case of exchanging data over multi GPUs, CPU host threads must have shared memory block and do synchronize to allow exchanged data are available to second CPU’s thread and then transfered to second GPU. This problem persists on every multi GPU platform and NVidia knows that. So they developed technology capable to allow communication between GPUs directly over northbridge where host is not involved and it is already integrated in their 790i chipset. The problem is it is not supported by CUDA yet. Hello NV guys and CUDA developers, can someone answer the question when it will be supported?
I haven’t seen any CUDA benchmarks either. I was basing my conclusions on gaming benchmarks where the 280 generally performs worse than the 9800 GX2. I find this particularly strange since SLI doesn’t normally scale very well.
My particular CUDA problem scales linearly with the number of GPUs. I don’t need any communications between GPUs. For me a single GPU thats twice as fast, has twice the memory and twice the memory bandwidth is exactly the same as two GPUs.
The increased number of registers is a factor I hadn’t considered but my problem isn’t particularly register hungry either. I can make do with 8 registers per thread if I need to although I can effectively pack several iterations into one by using more.
Yet another peculiarity of my problem is that with the 9800 GX2 I am already reaching the point where GPU performance is not my bottleneck. Any performance improvements will mostly now come from increased video memory (because it happens to improve the efficiency of the non-GPU part of the problem).
Hopefully this helps to explain why I am not over the moon about the next generation of nVidia GPUs. Of course my problem may not be typical.
Well, sounds like you will have to wait for the Tesla version then with 4 Gb :) But indeed it sounds like a particular algorithm/problem (and also a very parallel problem so you should be happy ;))
Unfortunately the Tesla is out of my price range and the improvements with video memory size aren’t huge anyway (in many cases the 2GB of two 9800 GX2 cards is already plenty). Its certainly true that my problem is well suited to GPUs but that also means that the major gains (about 250x faster than 2 years ago for similarly priced hardware) have already been had. Even a 2x improvement is hard to come by now.
Are you referring to a 2x improvement over the 250x improvement you made 2 years ago? If that is the case, then its not surprising, as you are still 500x over 2 years ago. You become hardware limited just like you would be if you were using a CPU. This just seems like the nature of the beast to me. What is confusing me is the 250x improvement. I assume that is a GPU to CPU comparison and now the 2x improvement is a GPU to GPU improvement. If not, I’m lost.
Edit: I reread your post, and am still confused as to what you are asking or if you are just unhappy in general and wanted to share it.