NVIDIA vs. ATI GPGPU Comparison

Hi everyone,

I’m new in the forum so if I opened topic in wrong place, sorry.

I have recently heard about parallel programming using GPUs. I became interested and took a look in Google. I found out that GPGPU can be done on NVIDIA and ATI Video Cards. As I want to start parallel programming on GPUs (to be more precise, I want to use GPUs in Computational Finance), I’d like to know what advantages and disadvantages both sides have.

Nvidia has CUDA and Parallel NSight plugin in VS, ATI has OpenCL and gDEBugger plugin in VS. If I remember correctly, OpenCL is for both - ATI and NVIDIA, but CUDA is Nvidia’s product. Programming in OpenCL is probably hard. Is it easy to write big projects in OpenCL? I guessed if I wanted to write a big code without bugs, I should write in VS. But here we don’t have tools to run the same code on ATI and NVIDIA Video Cards right? If so, please tell me which one is better solution and why. What are the prices? Which is easier to use, gDEBugger or Parallel NSight? (If I remember correctly, NVIDIA needs 2 video cards for debugging, while ATI needs only 1) Which one has better performance? is there no solution to write the same code for both NVIDIA and ATI Cards?

Thanks for help.

CUDA is easier to use than OpenCL, especially for a beginner. The code that you write for the GPU is very similar, but OpenCL requires a lot of extra work on the CPU side that is avoided with CUDA. Here’s a picture that I made back when I was learning OpenCL. Two programs doing the same exact task, one using CUDA run-time API, the other using vanilla OpenCL. In both cases there’s a short kernel (on top) which is executed on a GPU, and the rest of the code initializes the system, copies data to & from the GPU, and launches the kernel:

Granted, it is a bit exaggerated, and there are software packages that ATI(AMD) writes that make the code less miserable to work with. But it should still demonstrate the point.

Performance wise, it depends on the task, sometimes one or the other is faster. Personally, I’ve never used any debuggers from either side. CUDA and OpenCL both allow you to put printf() inside the kernel code, and that’s often all you need.

This comparision isn’t really fair… if the cuda example was written using the cuda driver api, then both examples would be about the same size. This does matter for other languages.

I guess cuda.h simply automates some things but some flexibility is lost by it. but this can be regained.

OpenCL could do the same but apperenlty it does not.

Thanks for your reply :) As I understood, there is a big difference for programmer to write on CUDA and OpenCL, in terms of hard work.

One more thing, is Tesla somehow better than Geforce series? Let’s take GTX 580 and Tesla M2090. They both have the same amount of parallel processors (512) and both are fermi based. Is Tesla M2090 faster than GTX 580? I read somewhere that Tesla is created solely for parallel processing and because of it, Tesla is faster. And how much does one Tesla M2090 cost? I couldn’t find the price on website.

Thanks for your reply :) As I understood, there is a big difference for programmer to write on CUDA and OpenCL, in terms of hard work.

One more thing, is Tesla somehow better than Geforce series? Let’s take GTX 580 and Tesla M2090. They both have the same amount of parallel processors (512) and both are fermi based. Is Tesla M2090 faster than GTX 580? I read somewhere that Tesla is created solely for parallel processing and because of it, Tesla is faster. And how much does one Tesla M2090 cost? I couldn’t find the price on website.

I found price of Tesla C2070, it’s $2499.99 why does the price is so high when GTX 580 (which has more cores) does only cost $470.99 on amazon? And what is the difference between C da M series?

The Tesla vs. GeForce thing is a long-standing point of confusion. In terms of single precision floating point performance and memory bandwidth, the GeForce GTX 580 is FASTER than the Tesla. The top of the line Tesla and GeForce are basically the same chip, with the Tesla downclocked slightly for reliability reasons.

Tesla cards add (or enable, depending on your perspective) several other features:

  • Better quality assurance for 24/7 computational use

  • Double precision floating point that is 4x faster than the equivalent GeForce card.

  • More memory than you can get in a GeForce card. There is a 3GB GTX 580 (got one next to me now), but the Tesla is available in 3 GB and 6 GB.

  • The driver option to enable ECC in device memory at the cost of some memory performance.

  • Technical support from NVIDIA for CUDA development (so I’m told)

  • A second DMA engine allowing simultaneous transfers to and from the host system (both directions at the same time) along with the execution of a kernel. GeForce cards can only overlap one memory transfer with kernel execution.

  • The ability to use a special “TCC” driver in Windows 7 for computing rather than the standard graphics driver. The graphics subsystem in Windows 7 adds quite a bit of overhead to CUDA function calls. Linux and Windows XP don’t have this problem, so GeForce and Tesla are equivalent in this respect there.

For people building compute clusters, these features can be very important, so they’ll pay the price premium. Many of us (including myself) don’t need these things and do all of our CUDA work on GeForce cards quite effectively. Unless you know you need one of the above features, the GeForce cards are a much lower cost way to evaluate whether CUDA works for you.

A lot of people worry about the double precision performance difference between GeForce and Tesla, but for many problems it is not as noticeable as you think. CUDA programs often become memory bandwidth limited given the ridiculous amount of floating point power on these chips. The memory subsystem can’t keep up unless you are doing at least a dozen operations per float you read.

(Note: This is the perspective of an academic user who works exclusively in Linux with GeForce cards in small, homegrown clusters we administer. In other environments, the Tesla value-add might be more significant.)

The C series is a card with a standard, on-board fan and heatsink that you can install in a normal workstation. The M series is intended for OEMs that are designing an entire computer specifically with Tesla boards in mind. The cards have heatsinks, but assume the airflow in the case has been designed properly to cool them. Stick an M series card (assuming you can even buy them individually…) into a random workstation, and there is a danger you will overheat the card. M series cards are usually found in compact rackmount enclosures where the case fans can do double-duty cooling the GPU along with the rest of the system.