CUDA vs ATI Stream comparison

Im interested in a cheap personal workstation for math intensive calculations and graphics processing.

  1. Has anyone tried to compare ATI Stream and Cuda benchmarks, say on their fastest processors (ie 4870x2 and gtx295)? Perhaps even using Badaboom and ATI Avivo on the same file?
  2. As I see it 1 strike against CUDA and PhysX is their closed propietary APIs vs OpenCL and the general purpose GPU functions in the upcoming DirectX 11. Are there any plans to open up CUDA?
  3. The upcoming ATI Firestream 9270 claims 1.2 tflops in single precision and 240 gflops in double. What might be the equivalent cards/combos for Nvidia 200 series?

I don’t know that either of those would make good benchmarks, since they are specific to each GPU and there’s really no reference to determine a ‘correct’ output. Maybe something like BLAS or LAPACK, optimized for each GPU, would be a better benchmark. CUBLAS 2.0 should be available soon (from what the nVidia staff says), so you can use that against whatever ATI supplies.

What do you mean, open up CUDA? How open do you want it to be? Some (or all?) of the compiler is already open-sourced, if you look on nVIDIA’s FTP site. nVIDIA doesn’t provide open-source drivers for their hardware, so if that’s something you absolutely must have (though I don’t see why it would be a huge issue), you’re out of luck.

From what I’ve read, a single GTX280 can do around 1 TFLOP in SP and something like 120 TFLOPS in DP. So, if you used proper multi-GPU programming methods, you could get approximately twice that from a GTX295. The upcoming GT300 series (rumored to be due towards the end of this year) should probably improve on this by a good bit.

I wouldn’t spend too much time worrying about ATI Stream. It sounds like that API (which wasn’t very flexible) will be retired in favor of OpenCL. OpenCL has a lot of resemblance to CUDA, so no matter which one you pick, switching to the other will be less trouble in the future. NVIDIA has said that they will also be supporting both OpenCL and CUDA in the future. My guess is that CUDA will be used showcase new GPU features, which may or may not appear in future OpenCL revisions. The runtime API (rather than the “driver API”) of CUDA seems a lot less verbose than OpenCL.

As for comparing ATI to NVIDIA with Badaboom and Avivo, Anandtech did this in December:

http://www.anandtech.com/video/showdoc.aspx?i=3475

Sadly, their conclusions indicate that Avivo is not a very good product yet, and so it is hard to compare it to Badaboom. GPU-accelerated video encoding is still too new to be a good cross-platform benchmark of GPU performance.

I understand what it means for software (like libraries) to be open, but what does it mean for an API to be open? The CUDA API is published, which seems pretty “open” to me. Do you mean open source drivers and tools?

The graphics drivers (which provide both 3D and CUDA support) are closed source. Some parts of the CUDA tools, such as the nvcc compiler and the CUDA debugger, derive from GPL projects and have published source code. The rest of the toolkit is closed source, but some utility libraries built on top of CUDA, like CUBLAS, CUFFT, and CUDAPP, are open source.

NVIDIA has made no mention of opening things up any more than this, though.

There is no single NVIDIA device that can achieve this performance now, and NVIDIA generally does not announce the features of future products in advance. There are of course rumors on various news sites about the next NVIDIA GPU, but it’s pretty much impossible to reliably compare future products from NVIDA with future products from ATI.

The GTX 285, released in January, claims about 1 TFLOP “peak” in single precision, but these numbers have all kinds of caveats depending on what you are actually doing. The GTX 285 has 240 single precision floating point units and 30 double precision units clocked at 1.476 GHz. With full pipelines the single precision units can finish one multiply-add instruction per clock, giving you 708 GFLOPs, assuming that is all you are doing. The GTX 200 series chips can also issue an additional multiply instruction overlapping with the multiple-add, but it is not entirely clear (to me, anyway) how this capability actually works in practice. The texture units also give you some addition FLOPS by performing linear interpolation for you as well. Once the dual-issue and texture FLOPs are added in, you get a somewhat fuzzy 1 TFLOP.

In the case of double precision, the CUDA documentation gives no guidance at all. Assuming the 1 double precision unit per multiprocessor works like the single precision units, you could imagine getting 2-3 double precision FLOPs per DP unit per clock cycle for a total of 88 to 132 double FLOPs on the GTX 285.

Anyway, the take-home message here should be that FLOPs is a tricky thing to calculate, and the number is highly dependent on what you are doing. Look at numbers from both NVIDIA and ATI with skepticism. :)

Also: When comparing GPUs you should also take a look at the total graphics memory bandwidth. Many GPU programs are limited by how fast you can move data to the floating point units, and not the floating point performance itself.

For example, if your program only does 1 floating point operation per element read from the graphics memory, the GTX 285 floating point units will only be operating at 1/10th of their peak capability.

Brief correction:

The peak FLOP numbers do not include anything related to texturing, just the MAD from each SP and the dual issued MUL.

Ah, OK. That makes those numbers easy to calculate then. :)

In a related topic sometime back,

Some1 said ATI is not fully IEEE 754 compliant even in DP. Atleast CUDA is IEEE 754 compliant in DP.

Dont know how far the ATI thing is true… Can some1 confirm?

"I understand what it means for software (like libraries) to be open, but what does it mean for an API to be open? The CUDA API is published, which seems pretty “open” to me. Do you mean open source drivers and tools?
The graphics drivers (which provide both 3D and CUDA support) are closed source. Some parts of the CUDA tools, such as the nvcc compiler and the CUDA debugger, derive from GPL projects and have published source code. The rest of the toolkit is closed source, but some utility libraries built on top of CUDA, like CUBLAS, CUFFT, and CUDAPP, are open source.

NVIDIA has made no mention of opening things up any more than this, though. "

For starters, full documentation and no hidden APIs and hooks (like in early Windows). User extensible drivers, ala UNIX would be useful. OpenCL and DirectX 11’s Compute Shader will be open standards. Didnt know NVCC is open source. ATI claims CUDA is a closed language, while their Stream is an “open standard” . I understand however that Nvidia’s compiler is based on the open source Open64 Itanium compiler. Eventually, the market will settle on a standard, but it would be useful if users wont have to choose. Imagine if ATI and Nvidia agreed to support each other - ATI would support CUDA and Nvidia support ATI’s Brook+ and Havok API, and/or a cross compiler would support all the standards …

"There is no single NVIDIA device that can achieve this performance now, and NVIDIA generally does not announce the features of future products in advance. "

Since the Telsa products have multiple boards arrays, perhaps 2 or 3 280s or 295s in SLI can be scaled for such performance. Dont know if this is possible.

"Anyway, the take-home message here should be that FLOPs is a tricky thing to calculate, and the number is highly dependent on what you are doing. Look at numbers from both NVIDIA and ATI with skepticism. "

I hear some co that does peripherals has a new chipset capable of petraflops calculations :)

It sounds like what you are really asking for is “documented” and “cross platform”. CUDA is definitely documented, though one could argue about whether “full” has been achieved. :) I expect CUDA will never be as cross-platform as OpenCL will be. At best, we will soon see a CUDA compiler which can target both NVIDIA GPUs and multithreaded SSE on x86 processors. I doubt CUDA will ever run on ATI cards. For cross-platform data-parallel computing, it looks like OpenCL is the future.

That said, it may be a year before OpenCL has the level of OS support (Win/Mac/Linux) that CUDA does now. CUDA has had 2 years to deal with the cross-OS issues, and still has some work to do.

Yes, you can run multiple CUDA devices at the same time, although the programmer is responsible for partitioning the work between them. Four GTX 295s (which will appear to the OS as 8 CUDA devices) have been crammed into one computer and actually worked. The main limitation is power, space, and cooling in the case. I assume multiple ATI cards can also be used at the same time, so I don’t know if it is a useful criteria for comparing ATI and NVIDIA.

Hah, I bet…

I dono about you guys, but all the problems i have been working on in CUDA were IO bound, this for me is were the GPU has the big upper hand over CPU. If you look at the hardware architecture of the g200 and the 4xxx you can see 2 very different approaches. Leave it up 2 u 2 decide which fits your needs …

Well, not quite happy to say this, but for compute-bound kernels 4870 (not 4870X2) is about 30% faster than GTX280. GTX295 is certainly faster than single 4870 but still much slower than 4870X2. Again, these figures are for compute-bound kernel. Here are some actual benchmarks (higher is better):

GTX280 - 11800

GTX285 - 12500

GTX295 - 21700

  4870 - 15750

 4870X2 - 31000

Problem with ATI is their software and API. You’ll need to spend a lot of time to make things work, and then some more time to make them work fast. CUDA is much more developer-friendly.

Also, I’m not aware of many ATI Stream-enabled products, so if you plan to use third-party software CUDA seems to be better choice.

Yes as AnderiB says, in raw compute power ATI cards are more powerful (this is in direct corellation to the flops numbers). The CUDA toolchain is much easier to work with then the current stream one, OpenCL will even the playing field quite a bit :) but still for real world problems, allot of times the IO is the bottle neck and there Nvidia has the upper hand in a significant way.

How does NVIDIA and IO go together?

CUDA stands for “CPU intensive jobs”. What is IO doing here?

IO is a memory bandwidth in this context. And NVIDIA is better than ATI in this aspect.

you can split problems into io bound problems, and compute bound problems. IO is any data transfer, from the hard disk or ram or cache. Of course a problem on one architecture might be IO bound and on another compute. When you want to accelerate your code it is vitally important to know if you are IO bound or compute, because optimizing for one dose nothing or even worse for the other. I can’t remember where i saw a nice explanation of the IO compared to the compute powers of the last cards from Nvidia and ATI. In the end, in IO bound problems you might have your compute cores sitting around most of the time waiting for data to get read into local registers.

Well, not quite happy to say this, but for compute-bound kernels 4870 (not 4870X2) is about 30% faster than GTX280. GTX295 is certainly faster than single 4870 but still much slower than 4870X2. Again, these figures are for compute-bound kernel. Here are some actual benchmarks (higher is better):

GTX280 - 11800

GTX285 - 12500

GTX295 - 21700

  4870 - 15750

 4870X2 - 31000

Do you have any benchmarks (ATI vs. NVIDIA) for double precision calculations? I assume that the results presented above were done for single precision.

No, I do not have DP benchmarks. Numbers quoted above are for 32-bit integer ops (which are quite close to SP).
With DP difference between ATI and NVIDIA should be even more, IMHO.

I was under the impression that none of the ATI cards supported double precision yet…has that changed?

Just checked the docs once again. They say that ATI cards support double precision starting with HD3870.

Yes, They have been having DP for quite a while. The number of stream cores inside their GPUs are too much… Like one of their latest has 800 cores… Grrr…