Fastest card for CUDA?

I’ve done some CUDA programming several months ago, but I am not up-to-date on the newest GPUs. What is currently the fastest card for CUDA? Are there any benchmarks anywhere?

Last time I looked it looked like the 9800gtx was probably the fastest. What about the new 260 and 280 cards?

Definitely the new cards. GTX 260 has good bang for buck, GTX 280 somewhat better if the extra $ don’t matter. There has been some discussion on the forum - check out the latest threads.

Fastest depends on your application of course. My memory bound applications run much faster on the original 8800 GTX than on the 9800 GTX :)

Of course, the GTX 280 is faster than all previous cards in every way. If you wait for the Tesla 1000 series, it will be slightly faster (same core as GTX 280, but higher clocks and more memory.)

but lower memory bandwidth on the tesla’s, so when people do not need more than 1 Gb, I suspect a GTX280 is the fastest option

I just double checked this from the NVIDIA specs page:

GTX 280:

Processor Clock (MHz) 1296 MHz

Memory Bandwidth (GB/sec) 141.7

Tesla 1066:

Frequency of processor cores 1.296 Ghz

Memory Bandwidth 102 GB/sec

So I guess I was wrong about the clocks here too…

It is the Tesla S1070 that has the boosted clock:

Frequency of processor cores 1.44 GHz

So yeah, more than ~40% less memory bandwidth on Tesla 1000 series = a bad thing for memory bound apps. Oh well. I guess there is a price for that 4GiB of memory.

it seems counter intuitive!

Not really, NVIDIA tells all the time that they use the memory on Tesla’s below their spec.

It’s to get fewer bit errors. If you’re paranoid about your production code (or actually notice the errors), you can downclock gaming cards for more stability.

Now that is Rather Interesting. I suspect that I’ve seen a few bit errors though it could as well be my weird code. Have you tried this yourself or got any links?

I’ve only ever seen something that seems like bit errors in an overheating GPU. I’ve never had any other evidence for them on any consumer or Tesla card even for week+ long runs.

But I think the bit error issue is not the reason here. NVIDIA lowered the clock on Tesla C870 to reduce bit errors, too, but the memory bandwidth suffered only a few percent.

I’m not DRAM expert, but I do remember something from computer architecture class about DRAM taking longer to address and read when you’ve got a larger total memory area. That could account for the 40% less bandwidth, and is what I was implying when I mentioned there was “a price to pay for having that 4 GiB of memory”.

Hmm, then they have not been telling the truth at NVISION. But I can imagine that is indeed the case.

I agree, especially because the new cards have much better memory coalescing capability. For example, I used to access a 3D array (of float4) the usual way on a 8800GT:

variable accessed by thread (i,j,k) = array[Nx*(Ny*(k-1) + (j-1)) + (i-1)],

so that everything was coalesced according to the old rules. Now, I can access the array on a GTX280 using indirect addressing:

variable accessed by thread (i,j,k) = new_array[ pointer_array[Nx*(Ny*(k-1) + (j-1)) + (i-1)] - 1],

which breaks the old coalescing rule. Surprisingly, not only do I not get any speed penalty for doing so, but I actually get a small (0.5%) speedup! At the same time, my memory usuage went down by 45% because I now only have to allocate the much smaller, condensed array ‘new_array’.

You only really “see” bit errors when you’re getting millions of them. That’s why they’re not such a big concern. But if you have the time, write a memory burn-in kernel to specifically look for them. It’d be very interesting.

You mean the memory bandwidth is much lower than the mere decrease in frequency? Yes, that has an explanation. When you start adding multiple chips to the same address lines the signals take longer to stabilize (data lines are always point-to-point, so they’re fine). The solution is to wait. Normally you have idle time anyway as the DDR sends you data in bursts, but if they really pushed the limits that time wouldn’t be enough and the waiting (what you see as 1T, 2T addressing in a PC BIOS) becomes longer than the slack. You really want to be careful with the waiting, because an error in the addressing step is an equivalent to a hundred bit-errors. It’s possible NVIDIA origianlly wanted there to be no stall time, but then had to change things when parts weren’t working.

I’d really love to see someone do independent testing to see if the Teslas are any more reliable than the GeForces, or if they might even be worse.

Not when I run the same deterministic simulation every time and diff the binary output files, even between different hardware. Plus, any random bit error is likely to send a particle in my simulation flying outside the box or cause an infinite loop in the kernel, both of which will show up as a crashed application.

That’s pretty reassuring. But keep in mind that you may not be using 100% of your memory and, in the cases where you’re not diffing, a bit error is likely to just change a float’s mantissa slightly and won’t cause an out-of-box particle.

A burn-in program would be very interesting (and useful!), maybe i’ll work on one. Besides checking memory, it’d also stress-test ALUs. What are your ideas for hitting as many parts of the GPU as possible and causing the most heat?