Fastest CUDA card on the market choosing best CUDA card for CUDA computation purpose

hi everyone,

I know this question might be more appropriate to be posted on the hardware forum, but I’d like to emphasize “choosing the best CUDA card for CUDA computation purpose” so I post it here.

I’d like to find out the fastest (or powerful) CUDA card on the market, for parallel CUDA programming purpose(i.e., not for video gaming), to be installed in a desktop PC.

I looked into the CUDA zone GPU table

At first I assumed the CUDA card with higher “compute capability” should be more powerful, ex: Geforce GTX 295 (compute capability 1.3) vs Geforce GTX 560 Ti (compute capability 2.1). I know the number means CUDA version compatibility, and I reasonably thought it also represents the powerfulness.

After looking their spec and price, I found out I was probably wrong??

It’s even harder to a cross-family comparison, ex: Quadro vs Geforce, each model has it’s pros and cons by their spec??

Can someone kindly direct me, how do I find the most powerful card?? or putting it in another way, how do I make comparisons with these cards, for parallel CUDA computing purpose??



The two most important metrics are the memory bandwidth (GB/s) and single precision arithmetic throughput (GFLOP/s). To a lesser degree, double precision throughput might also be of interest, depending on your particular code.

Nvidia directly specifies memory bandwidth for all their cards. The single precision arithmetic throughput is, for all practical purposes, calculated as 2×number_of_CUDA_cores×processor_clock (peak throughput for compute capability 1.x devices is 50% higher than that, but this is difficult or even impossible to achieve in practice). The latter values are also listed in Nvidia’s spec sheets. Double precision throughput for Tesla and Quadro cards is half their single precision throughput. For Geforce cards of compute capability 2.1 it is 1/12th of the single precision throughput, for compute capability 1.3 and 2.0 it is 1/8th of the single precision throughput as given above.

As you figured out, higher compute capability does not necessarily relate to higher computational power. However, nowadays you would probably want to buy a compute capability 2.x device, as their new features make programming them a lot easier.

For single precision or integer code. the fastest single GPU CUDA card from Nvidia currently is the Geforce GTX 580. On double precision problems, it might (or might not) be beaten by the Tesla 20x0 cards. This however depends a lot on the particular code.

I’ll give you my answer:

  1. Look at memory hertz especially if you dealing with super large datasets like hundreds of megabytes or even gigabytes, and ultimately that’s with parallel-stuff is about, lot’s of stuff to do lot’s of stuff in parallel ;)

  2. Then look at memory bus bits. The more bits it has the more bits it can push through at once. It would be nice if the specs mentioned if this could also be 2x128 bits or 4x64 bits or 8x32 bits or 16x16 bits (all with hopefully different memory addresses for extreme random memory access performance !)

Better sequential performance speaks for itself I would think ;) though there could always be cave ats ;)

  1. If memory and memory bandwidth and memory access performance is not main concern but instead doing many computations then:

  2. Look at number of multi-processors.

  3. Look at number of cuda cores.

  4. Last look at shared memory but this is probably difficult to use and doesn’t do so much… it’s usually very little like 50 KB or so… compared to 1 GB of ram, it’s peanuts ;) only very small problems or which have small inner loops could use that ;) So far I have seen some algorithms which use it, but makes the algorithms much more complex.

Something which is apperently new is:

  1. Caches like L1 and L2… I am not sure how big they are or if this information is available… I am also not sure if they are necessary for “coalescing” or not…

I would definetly go for highest compute capability so that you can program with easy and use the latest tips and trick and techniques and languages features, this will make your software last longer.

You can always buy new card in future or so 5 years, 3 years whatever… but to limit yourself to compute 1.3 or compute 1.1 while compute 2.2 or something is already out would be kinda foolish I think… because these compute capabilities can/could come in handy and are probably needed to make the somewhat more advanced algorithms.

So if you do not plan on buying a new graphics cards every 3 months, go for long term ;) :) and get highest compute capability ! ;) :)

thank you so much for both of your help!!

Also one last important tip:

PCI Express 2.0 can be bottle neck if having to transfer a lot between cpu/host and device/gpu.

PCI Express 3.0 is going to be faster, so if you can get one of those cards, it would be better too ;)

Good luck finding a PCIe 3.0 card. I must have missed the part where Alex said he is planning a system for the distant future.

Lol news message today:

I have seen other news messages about this as well…

PCI Express 3.0 is coming real soon ! ;) :)

I’d also take a look at amd’s apu’s… amd doesn’t seem to support cuda (yet) on their graphics cards or embedded graphics chips inside processors,

But there is an open source project for linux mostly called ocelot, maybe it can recompile cuda kernels to amd’s IL (intermediate language).

So then there is some hope of running cuda kernels on amd hardware.

Also replacing nvcuda.dll with something that calls for example opencl api’s shouldn’t be to hard… question is if anybody is going to invest time into make a cuda api clone for opencl…

That would kinda be weird/silly.

cuda api on top of
opencl api on top of
cuda api

^ opencl is implemented on top of cuda… at least for nvidia I think… so maybe on amd it would look like:

cuda api on top of
open cl api

So then it’s not so silly ;)

Question remains what opencl api is on top of on amd… maybe “close to the metal” or fire stream or stream or something…

If nvidia would implement cuda on top of opencl or on top of amd/ati hardware…

That would make me real happy…

Then I can simply/happily continue development on cuda and optimize for nvidia hardware… and hopefully still have somewhat reasonable performance on ati/amd ;)

Currently performance for ati/amdi = 0 because it wont run lol.

This is a nice chart:

Some of these motherboards have 2xPCI 3.0 so if you looking for an SLI board… ;)


if you need to use TCC (e.g. plan on rdp’ing into your machine, etc.), or need ECC capability, or are planning to use GPU in some production context with MS HPC – you need a Tesla.

Otherwise, GTX590 is the most powerful CUDA card on the market. It is over 2x of Tesla’s performance and costs <$700, vs >$2,000 for Tesla. The architecture of GTX590 is superior to/newer than that of Tesla 2050/2070.

You will have to do funny things like tweaking registry in Windows to disable tdr delay and such, in order to fully use Geforce for computation, but it’s well worth the extra effort.